WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Divide and Conqueredit
The terms in a query string can be divided into more-important (low-frequency) and less-important (high-frequency) terms. Documents that match only the less important terms are probably of very little interest. Really, we want documents that match as many of the more important terms as possible.
The match
query accepts a cutoff_frequency
parameter, which allows it to
divide the terms in the query string into a low-frequency and high-frequency
group. The low-frequency group (more-important terms) form the bulk of the
query, while the high-frequency group (less-important terms) is used only for
scoring, not for matching. By treating these two groups differently, we can
gain a real boost of speed on previously slow queries.
Take this query as an example:
Any term that occurs in more than 1% of documents is considered to be high
frequency. The |
This query uses the cutoff_frequency
to first divide the query terms into a
low-frequency group (quick
, dead
) and a high-frequency group (and
,
the
). Then, the query is rewritten to produce the following bool
query:
{ "bool": { "must": { "bool": { "should": [ { "term": { "text": "quick" }}, { "term": { "text": "dead" }} ] } }, "should": { "bool": { "should": [ { "term": { "text": "and" }}, { "term": { "text": "the" }} ] } } } }
At least one low-frequency/high-importance term must match. |
|
High-frequency/low-importance terms are entirely optional. |
The must
clause means that at least one of the low-frequency terms—quick
or dead
—_must_ be present for a document to be considered a
match. All other documents are excluded. The should
clause then looks for
the high-frequency terms and
and the
, but only in the documents collected
by the must
clause. The sole job of the should
clause is to score a
document like “Quick and the dead” higher than “_The_ quick but
dead”. This approach greatly reduces the number of documents that need to be
examined and scored.
Setting the operator parameter to and
would make all low-frequency terms
required, and score documents that contain all high-frequency terms higher.
However, matching documents would not be required to contain all high-frequency terms. If you would prefer all low- and high-frequency terms to be
required, you should use a bool
query instead. As we saw in
and Operator, this is already an efficient query.
Controlling Precisionedit
The minimum_should_match
parameter can be combined with cutoff_frequency
but it applies to only the low-frequency terms. This query:
{ "match": { "text": { "query": "Quick and the dead", "cutoff_frequency": 0.01, "minimum_should_match": "75%" } }
would be rewritten as follows:
Only High-Frequency Termsedit
An or
query for high-frequency terms only—`‘To be, or not to be’'—is
the worst case for performance. It is pointless to score all the
documents that contain only one of these terms in order to return just the top
10 matches. We are really interested only in documents in which the terms all occur
together, so in the case where there are no low-frequency terms, the query is
rewritten to make all high-frequency terms required:
{ "bool": { "must": [ { "term": { "text": "to" }}, { "term": { "text": "be" }}, { "term": { "text": "or" }}, { "term": { "text": "not" }}, { "term": { "text": "to" }}, { "term": { "text": "be" }} ] } }
More Control with Common Termsedit
While the high/low frequency functionality in the match
query is useful,
sometimes you want more control over how the high- and low-frequency groups
should be handled. The match
query exposes a subset of the
functionality available in the common
terms query.
For instance, we could make all low-frequency terms required, and score only documents that have 75% of all high-frequency terms with a query like this:
{ "common": { "text": { "query": "Quick and the dead", "cutoff_frequency": 0.01, "low_freq_operator": "and", "minimum_should_match": { "high_freq": "75%" } } } }
See the common
terms query reference page for more options.