What Is Relevance? | Elasticsearch: The Definitive Guide [2.x]

原文地址: https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html, 版权归 www.elastic.co 所有

WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.

This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.

» » »

« String Sorting and Multifields Doc Values Intro »

What Is Relevance?edit

We’ve mentioned that, by default, results are returned in descending order of relevance. But what is relevance? How is it calculated?

The relevance score of each document is represented by a positive floating-point number called the _score. The higher the _score, the more relevant the document.

A query clause generates a _score for each document. How that score is calculated depends on the type of query clause. Different query clauses are used for different purposes: a fuzzy query might determine the _score by calculating how similar the spelling of the found word is to the original search term; a terms query would incorporate the percentage of terms that were found. However, what we usually mean by relevance is the algorithm that we use to calculate how similar the contents of a full-text field are to a full-text query string.

The standard similarity algorithm used in Elasticsearch is known as term frequency/inverse document frequency, or TF/IDF, which takes the following factors into account:

Term frequency: How often does the term appear in the field? The more often, the more relevant. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention.
Inverse document frequency: How often does each term appear in the index? The more often, the less relevant. Terms that appear in many documents have a lower weight than more-uncommon terms.
Field-length norm: How long is the field? The longer it is, the less likely it is that words in the field will be relevant. A term appearing in a short title field carries more weight than the same term appearing in a long content field.

Individual queries may combine the TF/IDF score with other factors such as the term proximity in phrase queries, or term similarity in fuzzy queries.

Relevance is not just about full-text search, though. It can equally be applied to yes/no clauses, where the more clauses that match, the higher the _score.

When multiple query clauses are combined using a compound query like the bool query, the _score from each of these query clauses is combined to calculate the overall _score for the document.

We have a whole chapter dedicated to relevance calculations and how to bend them to your will: Controlling Relevance.

Understanding the Scoreedit

When debugging a complex query, it can be difficult to understand exactly how a _score has been calculated. Elasticsearch has the option of producing an explanation with every search result, by setting the explain parameter to true.

GET /_search?explain 
{
   "query"   : { "match" : { "tweet" : "honeymoon" }}
}

The explain parameter adds an explanation of how the _score was calculated to every result.

Adding explain produces a lot of output for every hit, which can look overwhelming, but it is worth taking the time to understand what it all means. Don’t worry if it doesn’t all make sense now; you can refer to this section when you need it. We’ll work through the output for one hit bit by bit.

First, we have the metadata that is returned on normal search requests:

{
    "_index" :      "us",
    "_type" :       "tweet",
    "_id" :         "12",
    "_score" :      0.076713204,
    "_source" :     { ... trimmed ... },

It adds information about the shard and the node that the document came from, which is useful to know because term and document frequencies are calculated per shard, rather than per index:

    "_shard" :      1,
    "_node" :       "mzIVYCsqSWCG_M_ZffSs9Q",

Then it provides the _explanation. Each entry contains a description that tells you what type of calculation is being performed, a value that gives you the result of the calculation, and the details of any subcalculations that were required:

"_explanation": { 
   "description": "weight(tweet:honeymoon in 0)
                  [PerFieldSimilarity], result of:",
   "value":       0.076713204,
   "details": [
      {
         "description": "fieldWeight in 0, product of:",
         "value":       0.076713204,
         "details": [
            {  
               "description": "tf(freq=1.0), with freq of:",
               "value":       1,
               "details": [
                  {
                     "description": "termFreq=1.0",
                     "value":       1
                  }
               ]
            },
            { 
               "description": "idf(docFreq=1, maxDocs=1)",
               "value":       0.30685282
            },
            { 
               "description": "fieldNorm(doc=0)",
               "value":        0.25,
            }
         ]
      }
   ]
}

	Summary of the score calculation for `honeymoon`
	Term frequency
	Inverse document frequency
	Field-length norm

Producing the explain output is expensive. It is a debugging tool only. Don’t leave it turned on in production.

The first part is the summary of the calculation. It tells us that it has calculated the weight—the TF/IDF—of the term honeymoon in the field tweet, for document 0. (This is an internal document ID and, for our purposes, can be ignored.)

It then provides details of how the weight was calculated:

Term frequency: How many times did the term honeymoon appear in the tweet field in this document?
Inverse document frequency: How many times did the term honeymoon appear in the tweet field of all documents in the index?
Field-length norm: How long is the tweet field in this document? The longer the field, the smaller this number.

Explanations for more-complicated queries can appear to be very complex, but really they just contain more of the same calculations that appear in the preceding example. This information can be invaluable for debugging why search results appear in the order that they do.

The output from explain can be difficult to read in JSON, but it is easier when it is formatted as YAML. Just add format=yaml to the query string.

Understanding Why a Document Matchededit

While the explain option adds an explanation for every result, you can use the explain API to understand why one particular document matched or, more important, why it didn’t match.

The path for the request is /index/type/id/_explain, as in the following:

GET /us/tweet/12/_explain
{
   "query" : {
      "bool" : {
         "filter" : { "term" :  { "user_id" : 2           }},
         "must" :  { "match" : { "tweet" :   "honeymoon" }}
      }
   }
}

Along with the full explanation that we saw previously, we also now have a description element, which tells us this:

"failure to match filter: cache(user_id:[2 TO 2])"

In other words, our user_id filter clause is preventing the document from matching.

« String Sorting and Multifields Doc Values Intro »