Divide and Conqueredit

The terms in a query string can be divided into more-important (low-frequency) and less-important (high-frequency) terms. Documents that match only the less important terms are probably of very little interest. Really, we want documents that match as many of the more important terms as possible.

The match query accepts a cutoff_frequency parameter, which allows it to divide the terms in the query string into a low-frequency and high-frequency group. The low-frequency group (more-important terms) form the bulk of the query, while the high-frequency group (less-important terms) is used only for scoring, not for matching. By treating these two groups differently, we can gain a real boost of speed on previously slow queries.

Take this query as an example:

{
  "match": {
    "text": {
      "query": "Quick and the dead",
      "cutoff_frequency": 0.01 
    }
}

Any term that occurs in more than 1% of documents is considered to be high frequency. The cutoff_frequency can be specified as a fraction (0.01) or as an absolute number (5).

This query uses the cutoff_frequency to first divide the query terms into a low-frequency group (quick, dead) and a high-frequency group (and, the). Then, the query is rewritten to produce the following bool query:

{
  "bool": {
    "must": { 
      "bool": {
        "should": [
          { "term": { "text": "quick" }},
          { "term": { "text": "dead"  }}
        ]
      }
    },
    "should": { 
      "bool": {
        "should": [
          { "term": { "text": "and" }},
          { "term": { "text": "the" }}
        ]
      }
    }
  }
}

At least one low-frequency/high-importance term must match.

High-frequency/low-importance terms are entirely optional.

The must clause means that at least one of the low-frequency terms—quick or dead—_must_ be present for a document to be considered a match. All other documents are excluded. The should clause then looks for the high-frequency terms and and the, but only in the documents collected by the must clause. The sole job of the should clause is to score a document like “Quick and the dead” higher than “_The_ quick but dead”. This approach greatly reduces the number of documents that need to be examined and scored.

Setting the operator parameter to and would make all low-frequency terms required, and score documents that contain all high-frequency terms higher. However, matching documents would not be required to contain all high-frequency terms. If you would prefer all low- and high-frequency terms to be required, you should use a bool query instead. As we saw in and Operator, this is already an efficient query.

Controlling Precisionedit

The minimum_should_match parameter can be combined with cutoff_frequency but it applies to only the low-frequency terms. This query:

{
  "match": {
    "text": {
      "query": "Quick and the dead",
      "cutoff_frequency": 0.01,
      "minimum_should_match": "75%"
    }
}

would be rewritten as follows:

{
  "bool": {
    "must": {
      "bool": {
        "should": [
          { "term": { "text": "quick" }},
          { "term": { "text": "dead"  }}
        ],
        "minimum_should_match": 1 
      }
    },
    "should": { 
      "bool": {
        "should": [
          { "term": { "text": "and" }},
          { "term": { "text": "the" }}
        ]
      }
    }
  }
}

Because there are only two terms, the original 75% is rounded down to 1, that is: one out of two low-terms must match.

The high-frequency terms are still optional and used only for scoring.

Only High-Frequency Termsedit

An or query for high-frequency terms only—`‘To be, or not to be’'—is the worst case for performance. It is pointless to score all the documents that contain only one of these terms in order to return just the top 10 matches. We are really interested only in documents in which the terms all occur together, so in the case where there are no low-frequency terms, the query is rewritten to make all high-frequency terms required:

{
  "bool": {
    "must": [
      { "term": { "text": "to" }},
      { "term": { "text": "be" }},
      { "term": { "text": "or" }},
      { "term": { "text": "not" }},
      { "term": { "text": "to" }},
      { "term": { "text": "be" }}
    ]
  }
}

More Control with Common Termsedit

While the high/low frequency functionality in the match query is useful, sometimes you want more control over how the high- and low-frequency groups should be handled. The match query exposes a subset of the functionality available in the common terms query.

For instance, we could make all low-frequency terms required, and score only documents that have 75% of all high-frequency terms with a query like this:

{
  "common": {
    "text": {
      "query":                  "Quick and the dead",
      "cutoff_frequency":       0.01,
      "low_freq_operator":      "and",
      "minimum_should_match": {
        "high_freq":            "75%"
      }
    }
  }
}

See the common terms query reference page for more options.