Field Collapsing | Elasticsearch: The Definitive Guide [2.x]

原文地址: https://www.elastic.co/guide/en/elasticsearch/guide/current/top-hits.html, 版权归 www.elastic.co 所有

WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.

This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.

» » »

« Denormalizing Your Data Denormalization and Concurrency »

Field Collapsingedit

A common requirement is the need to present search results grouped by a particular field. We might want to return the most relevant blog posts grouped by the user’s name. Grouping by name implies the need for a terms aggregation. To be able to group on the user’s whole name, the name field should be available in its original not_analyzed form, as explained in Aggregations and Analysis:

PUT /my_index/_mapping/blogpost
{
  "properties": {
    "user": {
      "properties": {
        "name": { 
          "type": "string",
          "fields": {
            "raw": { 
              "type":  "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

	The `user.name` field will be used for full-text search.
	The `user.name.raw` field will be used for grouping with the `terms` aggregation.

Then add some data:

PUT /my_index/user/1
{
  "name": "John Smith",
  "email": "john@smith.com",
  "dob": "1970/10/24"
}

PUT /my_index/blogpost/2
{
  "title": "Relationships",
  "body": "It's complicated...",
  "user": {
    "id": 1,
    "name": "John Smith"
  }
}

PUT /my_index/user/3
{
  "name": "Alice John",
  "email": "alice@john.com",
  "dob": "1979/01/04"
}

PUT /my_index/blogpost/4
{
  "title": "Relationships are cool",
  "body": "It's not complicated at all...",
  "user": {
    "id": 3,
    "name": "Alice John"
  }
}

Now we can run a query looking for blog posts about relationships, by users called John, and group the results by user, thanks to the top_hits aggregation:

GET /my_index/blogpost/_search
{
  "size" : 0, 
  "query": { 
    "bool": {
      "must": [
        { "match": { "title":     "relationships" }},
        { "match": { "user.name": "John"          }}
      ]
    }
  },
  "aggs": {
    "users": {
      "terms": {
        "field":   "user.name.raw",      
        "order": { "top_score": "desc" } 
      },
      "aggs": {
        "top_score": { "max":      { "script":  "_score"           }}, 
        "blogposts": { "top_hits": { "_source": "title", "size": 5 }}  
      }
    }
  }
}

	The blog posts that we are interested in are returned under the `blogposts` aggregation, so we can disable the usual search `hits` by setting the `size` to 0.
	The `query` returns blog posts about `relationships` by users named `John`.
	The `terms` aggregation creates a bucket for each `user.name.raw` value.
	The `top_score` aggregation orders the terms in the `users` aggregation by the top-scoring document in each bucket.
	The `top_hits` aggregation returns just the `title` field of the five most relevant blog posts for each user.

The abbreviated response is shown here:

...
"hits": {
  "total":     2,
  "max_score": 0,
  "hits":      [] 
},
"aggregations": {
  "users": {
     "buckets": [
        {
           "key":       "John Smith", 
           "doc_count": 1,
           "blogposts": {
              "hits": { 
                 "total":     1,
                 "max_score": 0.35258877,
                 "hits": [
                    {
                       "_index": "my_index",
                       "_type":  "blogpost",
                       "_id":    "2",
                       "_score": 0.35258877,
                       "_source": {
                          "title": "Relationships"
                       }
                    }
                 ]
              }
           },
           "top_score": { 
              "value": 0.3525887727737427
           }
        },
...

	The `hits` array is empty because we set `size` to 0.
	There is a bucket for each user who appeared in the top results.
	Under each user bucket there is a `blogposts.hits` array containing the top results for that user.
	The user buckets are sorted by the user’s most relevant blog post.

Using the top_hits aggregation is the equivalent of running a query to return the names of the users with the most relevant blog posts, and then running the same query for each user, to get their best blog posts. But it is much more efficient.

The top hits returned in each bucket are the result of running a light mini-query based on the original main query. The mini-query supports the usual features that you would expect from search such as highlighting and pagination.

« Denormalizing Your Data Denormalization and Concurrency »