字段折叠 | Elasticsearch: 权威指南

原文地址: https://www.elastic.co/guide/cn/elasticsearch/guide/current/top-hits.html, 版权归 www.elastic.co 所有
英文版地址: https://www.elastic.co/guide/en/elasticsearch/guide/current/top-hits.html

请注意:
本书基于 Elasticsearch 2.x 版本，有些内容可能已经过时。

» » »

« 去规范化你的数据去规范化和并发 »

字段折叠 (Field Collapsing)edit

一个普遍的需求是需要通过特定字段进行分组。例如我们需要按照用户名称分组返回最相关的博客文章。按照用户名分组意味着进行 terms 聚合。为能够按照用户整体名称进行分组，名称字段应保持 not_analyzed 的形式(不做分词处理)，具体说明参考聚合与分析：

PUT /my_index/_mapping/blogpost
{
  "properties": {
    "user": {
      "properties": {
        "name": { 
          "type": "string",
          "fields": {
            "raw": { 
              "type":  "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

	`user.name` 字段将用来进行全文检索 (默认会进行分析/分词)。
	`user.name.raw` 字段将用来通过 `terms` 聚合进行分组。

然后添加一些数据:

PUT /my_index/user/1
{
  "name": "John Smith",
  "email": "john@smith.com",
  "dob": "1970/10/24"
}

PUT /my_index/blogpost/2
{
  "title": "Relationships",
  "body": "It's complicated...",
  "user": {
    "id": 1,
    "name": "John Smith"
  }
}

PUT /my_index/user/3
{
  "name": "Alice John",
  "email": "alice@john.com",
  "dob": "1979/01/04"
}

PUT /my_index/blogpost/4
{
  "title": "Relationships are cool",
  "body": "It's not complicated at all...",
  "user": {
    "id": 3,
    "name": "Alice John"
  }
}

现在我们来查询标题包含 relationships 并且作者名包含 John 的博客，查询结果再按作者名分组，感谢 top_hits 聚合提供了按照用户进行分组的功能：

GET /my_index/blogpost/_search
{
  "size" : 0, 
  "query": { 
    "bool": {
      "must": [
        { "match": { "title":     "relationships" }},
        { "match": { "user.name": "John"          }}
      ]
    }
  },
  "aggs": {
    "users": {
      "terms": {
        "field":   "user.name.raw",      
        "order": { "top_score": "desc" } 
      },
      "aggs": {
        "top_score": { "max":      { "script":  "_score"           }}, 
        "blogposts": { "top_hits": { "_source": "title", "size": 5 }}  
      }
    }
  }
}

	我们感兴趣的博客文章是通过 `blogposts` 聚合返回的，所以我们可以通过将 `size` 设置成 0 来禁止 `hits` 常规搜索(不返回匹配的文档)。
	`query` 返回通过 `relationships` 查找名称为 `John` 的用户的博客文章。
	`terms` 聚合为每一个 `user.name.raw` 创建一个桶。
	`top_score` 聚合对通过 `users` 聚合得到的每一个桶按照文档评分对词项进行排序。
	`top_hits` 聚合仅为每个用户返回五个最相关的博客文章的 `title` 字段。

这里显示简短响应结果：

...
"hits": {
  "total":     2,
  "max_score": 0,
  "hits":      [] 
},
"aggregations": {
  "users": {
     "buckets": [
        {
           "key":       "John Smith", 
           "doc_count": 1,
           "blogposts": {
              "hits": { 
                 "total":     1,
                 "max_score": 0.35258877,
                 "hits": [
                    {
                       "_index": "my_index",
                       "_type":  "blogpost",
                       "_id":    "2",
                       "_score": 0.35258877,
                       "_source": {
                          "title": "Relationships"
                       }
                    }
                 ]
              }
           },
           "top_score": { 
              "value": 0.3525887727737427
           }
        },
...

	因为我们设置 `size` 为 0 ，所以 `hits` 数组是空的。
	查询结果中的每一个用户都会有一个对应的桶。 (aben注: 原文"There is a bucket for each user who appeared in the top results."中的"top"类似SQLSERVER的TOP关键字, 就是前几个结果)
	在每个用户桶下面都会有一个`blogposts.hits`数组, 这个数组包含该用户的前几个查询结果。 (aben注: 原文"Under each user bucket there is a blogposts.hits array containing the top results for that user."中的"the top results"就是该用户匹配到的结果中的前几个)
	用户桶中的博客文章, 按照相关性从高到底进行排序。

使用 top_hits 聚合相当于: 执行一个查询返回这些用户的名字和他们最相关的博客文章，然后为每一个用户执行相同的查询，以获得最好的博客。但前者的效率要好很多。

每一个桶返回的前几个查询命中结果是基于最初主查询进行的一个轻量 迷你查询(mini-query) 结果集。这个迷你查询提供了一些你期望的常用特性，例如高亮显示以及分页功能。

« 去规范化你的数据去规范化和并发 »