Phrase Matching | Elasticsearch: The Definitive Guide [2.x]

原文地址: https://www.elastic.co/guide/en/elasticsearch/guide/current/phrase-matching.html, 版权归 www.elastic.co 所有

WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.

This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.

» » »

« Proximity Matching Mixing It Up »

Phrase Matchingedit

In the same way that the match query is the go-to query for standard full-text search, the match_phrase query is the one you should reach for when you want to find words that are near each other:

GET /my_index/my_type/_search
{
    "query": {
        "match_phrase": {
            "title": "quick brown fox"
        }
    }
}

Like the match query, the match_phrase query first analyzes the query string to produce a list of terms. It then searches for all the terms, but keeps only documents that contain all of the search terms, in the same positions relative to each other. A query for the phrase quick fox would not match any of our documents, because no document contains the word quick immediately followed by fox.

The match_phrase query can also be written as a match query with type phrase:

"match": {
    "title": {
        "query": "quick brown fox",
        "type":  "phrase"
    }
}

Term Positionsedit

When a string is analyzed, the analyzer returns not only a list of terms, but also the position, or order, of each term in the original string:

GET /_analyze?analyzer=standard
Quick brown fox

This returns the following:

{
   "tokens": [
      {
         "token": "quick",
         "start_offset": 0,
         "end_offset": 5,
         "type": "<ALPHANUM>",
         "position": 1 
      },
      {
         "token": "brown",
         "start_offset": 6,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 2 
      },
      {
         "token": "fox",
         "start_offset": 12,
         "end_offset": 15,
         "type": "<ALPHANUM>",
         "position": 3 
      }
   ]
}

The position of each term in the original string.

Positions can be stored in the inverted index, and position-aware queries like the match_phrase query can use them to match only documents that contain all the words in exactly the order specified, with no words in-between.

What Is a Phraseedit

For a document to be considered a match for the phrase “quick brown fox”, the following must be true:

quick, brown, and fox must all appear in the field.
The position of brown must be 1 greater than the position of quick.
The position of fox must be 2 greater than the position of quick.

If any of these conditions is not met, the document is not considered a match.

Internally, the match_phrase query uses the low-level span query family to do position-aware matching. Span queries are term-level queries, so they have no analysis phase; they search for the exact term specified.

Thankfully, most people never need to use the span queries directly, as the match_phrase query is usually good enough. However, certain specialized fields, like patent searches, use these low-level queries to perform very specific, carefully constructed positional searches.

« Proximity Matching Mixing It Up »