本地英文版地址: ../en/search-aggregations-bucket-composite-aggregation.html
一个多桶(multi-bucket)聚合,从不同来源创建复合桶。
与其他多桶(multi-bucket)
聚合不同,composite
聚合可用于高效地对多级聚合中的所有桶进行分页。
这种聚合提供了一种方法,类似于 scroll 对文档所做的那样,对特定聚合的所有桶进行流式处理。
复合桶是从为每个文档提取/创建的值的组合中构建的,并且每个组合被认为是一个复合桶。
比如下面的文档:
{ "keyword": ["foo", "bar"], "number": [23, 65, 76] }
... 当keyword
和 number
用作聚合的值的来源时,创建以下复合桶:
{ "keyword": "foo", "number": 23 } { "keyword": "foo", "number": 65 } { "keyword": "foo", "number": 76 } { "keyword": "bar", "number": 23 } { "keyword": "bar", "number": 65 } { "keyword": "bar", "number": 76 }
参数 sources
控制应该用于构建复合桶的源。
定义 sources
的顺序很重要,因为它也控制着 key 的返回顺序。
每个源的名称必须是唯一的。
值的来源有三种不同类型:
terms
值来源相当于一个简单的 terms
集合。
这些值是从字段或脚本中提取的,就像 terms
聚合一样。
例如:
GET /_search { "size": 0, "aggs" : { "my_buckets": { "composite" : { "sources" : [ { "product": { "terms" : { "field": "product" } } } ] } } } }
与 terms
聚合一样,也可以使用脚本来创建复合桶的值:
GET /_search { "size": 0, "aggs" : { "my_buckets": { "composite" : { "sources" : [ { "product": { "terms" : { "script" : { "source": "doc['product'].value", "lang": "painless" } } } } ] } } } }
histogram
值来源可应用于数值,以在这些值上构建固定大小的间隔。
参数 interval
定义应该如何转换数值。
例如, interval
设置为 5 会将任何数值转换为最接近 5 的间隔,值 101
会转换为 100
,因为 101 在间隔 100 和 105 之间。
例如:
GET /_search { "size": 0, "aggs" : { "my_buckets": { "composite" : { "sources" : [ { "histo": { "histogram" : { "field": "price", "interval": 5 } } } ] } } } }
这些值由 numeric(数值)字段或返回数值的脚本构建而成:
GET /_search { "size": 0, "aggs" : { "my_buckets": { "composite" : { "sources" : [ { "histo": { "histogram" : { "interval": 5, "script" : { "source": "doc['price'].value", "lang": "painless" } } } } ] } } } }
date_histogram
值源类似于 histogram
,只是它的时间间隔是由日期/时间表达式指定的:
GET /_search { "size": 0, "aggs" : { "my_buckets": { "composite" : { "sources" : [ { "date": { "date_histogram" : { "field": "timestamp", "calendar_interval": "1d" } } } ] } } } }
上面的示例创建了一个每天的时间间隔,并将所有 timestamp
值转换为最接近的时间间隔的开始。
间隔的可用表达式有:year
、quarter
, month
、week
、day
、hour
、minute
、second
。
时间值也可以通过时间单位(time units)解析支持的缩写来指定。
请注意,不支持带小数点的时间值,但是你可以通过转换到另一个时间单位来解决这个问题(例如,可以将1.5h
指定为90m
)。
format
在内部,一个日期被表示为一个64位的数字-一个时间戳,以毫秒为单位。
这些时间戳作为桶的 key 返回。
可以使用参数format
指定的格式返回格式化的日期字符串:
GET /_search { "size": 0, "aggs" : { "my_buckets": { "composite" : { "sources" : [ { "date": { "date_histogram" : { "field": "timestamp", "calendar_interval": "1d", "format": "yyyy-MM-dd" } } } ] } } } }
支持日期格式/模式表达式 |
时区
日期时间以 UTC 存储在 Elasticsearch 中。
默认情况下,所有的分桶和舍入也在 UTC 中完成。
time_zone
参数可用于指示分桶时应该使用不同的时区。
时区可以指定为 ISO 8601 UTC 时差(例如+01:00
或 -08:00
),也可以指定为时区id(在TZ数据库中使用的标识符),比如America/Los_Angeles
。
Offset
使用 offset
参数按指定的正(+
)或负(-
)偏移量持续时间来更改每个桶的起始值,例如1h
表示一个小时,1d
表示一天。
有关更多可能的持续时间选项,请参见时间单位(time units)。
例如,当使用day
作为时间间隔时,每个桶的时间区间从午夜到午夜。
将参数 offset
设置为 +6h
会将每个桶的时间区间更改为从早上6点到早上6点:
#添加并索引两个文档 PUT my_index/_doc/1?refresh { "date": "2015-10-01T05:30:00Z" } PUT my_index/_doc/2?refresh { "date": "2015-10-01T06:30:00Z" } #搜索 GET my_index/_search?size=0 { "aggs": { "my_buckets": { "composite" : { "sources" : [ { "date": { "date_histogram" : { "field": "date", "calendar_interval": "day", "offset": "+6h", "format": "iso8601" } } } ] } } } }
上面的请求的单个桶不是从午夜开始,而是从早上6点开始:
{ ... "aggregations": { "my_buckets": { "after_key": { "date": "2015-10-01T06:00:00.000Z" }, "buckets": [ { "key": { "date": "2015-09-30T06:00:00.000Z" }, "doc_count": 1 }, { "key": { "date": "2015-10-01T06:00:00.000Z" }, "doc_count": 1 } ] } } }
在进行 time_zone
调整后,计算每个桶的起始 offset
值。
参数 sources
接受一个值源数组。可以混合不同的值源来创建复合桶。例如:
GET /_search { "size": 0, "aggs" : { "my_buckets": { "composite" : { "sources" : [ { "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d" } } }, { "product": { "terms": {"field": "product" } } } ] } } } }
这将从由两个值源创建的值创建复合桶,一个date_histogram
和一个terms
。
每个桶由两个值组成,聚合中定义的每个值源对应一个值。
允许任何类型的组合,并且在组合桶中保留数组中的顺序。
GET /_search { "size": 0, "aggs" : { "my_buckets": { "composite" : { "sources" : [ { "shop": { "terms": {"field": "shop" } } }, { "product": { "terms": { "field": "product" } } }, { "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d" } } } ] } } } }
默认情况下,复合桶按其自然顺序排序。
值按其值的升序排序。
当请求多个值源时,将对每个值源进行排序,将组合桶的第一个值与另一个组合桶的第一个值进行比较,如果它们相等,则组合桶中的下一个值将用于再次比较。
这意味着复合桶 [foo, 100]
被认为比 [foobar, 0]
小,因为 foo
被认为比 foobar
小。
通过直接在值源定义中将 order
设置为asc
(升序,默认值)或desc
(降序),可以定义每个值源的排序方向。例如:
GET /_search { "size": 0, "aggs" : { "my_buckets": { "composite" : { "sources" : [ { "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } }, { "product": { "terms": {"field": "product", "order": "asc" } } } ] } } } }
当比较 date_histogram
源中的值时,将按降序对复合桶进行排序,当比较 terms
源中的值时,将按升序对复合桶进行排序。
默认情况下,没有给定来源值的文档将被忽略。
通过将 missing_bucket
设置为 true
(默认为 false
),可以将它们包含在响应中:
GET /_search { "size": 0, "aggs" : { "my_buckets": { "composite" : { "sources" : [ { "product_name": { "terms" : { "field": "product", "missing_bucket": true } } } ] } } } }
在上面的示例中,对于字段 product
没有值的文档,源 product_name
将生成一个显式的 null
值。
源中指定的 order
决定了 null
应该排在第一位(升序,asc
)还是最后一位(降序,desc
)。
可以设置参数 size
来定义应该返回多少个复合桶。
每个组合桶都被视为一个桶,因此将大小设置为 10 将返回从值源创建的前 10 个组合桶。
响应的数组中包含每个组合桶的值,该数组包含从每个值源提取的值。
如果复合桶的数量太多(或未知)而无法在单个响应中返回,则可以将检索分成多个请求。
因为复合桶本质上是扁平的,所以请求的 size
正好是响应中返回的复合桶的数量(假设它们至少是要返回的 size
个复合桶)。
如果应该检索所有的组合桶,最好使用一个较小的 size
值(例如 100
或 1000
),然后使用参数 after
检索下一个结果。例如:
GET /_search { "size": 0, "aggs" : { "my_buckets": { "composite" : { "size": 2, "sources" : [ { "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d" } } }, { "product": { "terms": {"field": "product" } } } ] } } } }
... 返回:
{ ... "aggregations": { "my_buckets": { "after_key": { "date": 1494288000000, "product": "mad max" }, "buckets": [ { "key": { "date": 1494201600000, "product": "rocky" }, "doc_count": 1 }, { "key": { "date": 1494288000000, "product": "mad max" }, "doc_count": 2 } ] } } }
要获得下一组桶,请重新发送相同的聚合,将参数 after
设置为响应中返回的 after_key
的值。
例如,下面的请求使用在之前的响应中提供的 after_key
的值:
GET /_search { "size": 0, "aggs" : { "my_buckets": { "composite" : { "size": 2, "sources" : [ { "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } }, { "product": { "terms": {"field": "product", "order": "asc" } } } ], "after": { "date": 1494288000000, "product": "mad max" } } } } }
after_key
通常是响应中返回的最后一个桶的key,但这并不能保证。
总是使用返回的 after_key
,而不是从桶中取出它。
为了获得最佳性能,应该对索引设置索引排序(index sort),以便它匹配复合聚合中的部分或全部源顺序。 例如下面的索引排序:
PUT twitter { "settings" : { "index" : { "sort.field" : ["username", "timestamp"], "sort.order" : ["asc", "desc"] } }, "mappings": { "properties": { "username": { "type": "keyword", "doc_values": true }, "timestamp": { "type": "date" } } } }
GET /_search { "size": 0, "aggs" : { "my_buckets": { "composite" : { "sources" : [ { "user_name": { "terms" : { "field": "user_name" } } } ] } } } }
GET /_search { "size": 0, "aggs" : { "my_buckets": { "composite" : { "sources" : [ { "user_name": { "terms" : { "field": "user_name" } } }, { "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } } ] } } } }
为了对提前终止(early termination)进行优化,建议将请求中的 track_total_hits
设置为false
。
匹配请求的总命中数可以在第一次请求时检索,在每一页上计算这个数值的成本是很高的:
GET /_search { "size": 0, "track_total_hits": false, "aggs" : { "my_buckets": { "composite" : { "sources" : [ { "user_name": { "terms" : { "field": "user_name" } } }, { "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } } ] } } } }
请注意,源的顺序很重要,在下面的示例中,用 timestamp
调换 user_name
将会禁用排序优化,因为这种配置与索引排序的规范不匹配。
如果源的顺序对你的用例不重要,可以遵循这些简单的准则:
- 将基数最高的字段放在第一位。(这个与MySQL的优化类似)
- 确保字段的顺序与索引排序的顺序相匹配。
- 将多值字段放在最后,因为它们不能用于提前终止。
索引排序(index sort) 会降低索引编排的速度,使用你的特定用例和数据集测试索引排序以确保它符合你的要求是非常重要的。
即使你没有注意到这一点,composite
聚合也会尝试在查询匹配所有文档(match_all
查询)的情况下提前终止非排序索引。
与任何 multi-bucket
(多桶) 聚合一样,composite
聚合可以包含子聚合。
这些子聚合可用于计算其他桶或由此父聚合创建的每个复合桶的统计数据。
例如,下面的示例计算每个复合桶的 price 字段的平均值:
GET /_search { "size": 0, "aggs" : { "my_buckets": { "composite" : { "sources" : [ { "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } }, { "product": { "terms": {"field": "product" } } } ] }, "aggregations": { "the_avg": { "avg": { "field": "price" } } } } } }
... 返回:
{ ... "aggregations": { "my_buckets": { "after_key": { "date": 1494201600000, "product": "rocky" }, "buckets": [ { "key": { "date": 1494460800000, "product": "apocalypse now" }, "doc_count": 1, "the_avg": { "value": 10.0 } }, { "key": { "date": 1494374400000, "product": "mad max" }, "doc_count": 1, "the_avg": { "value": 27.0 } }, { "key": { "date": 1494288000000, "product" : "mad max" }, "doc_count": 2, "the_avg": { "value": 22.5 } }, { "key": { "date": 1494201600000, "product": "rocky" }, "doc_count": 1, "the_avg": { "value": 10.0 } } ] } } }
- Elasticsearch权威指南: 其他版本:
- Elasticsearch是什么?
- 7.7版本的新特性
- 开始使用Elasticsearch
- 安装和设置
- 升级Elasticsearch
- 搜索你的数据
- 查询领域特定语言(Query DSL)
- SQL access(暂时不翻译)
- Overview
- Getting Started with SQL
- Conventions and Terminology
- Security
- SQL REST API
- SQL Translate API
- SQL CLI
- SQL JDBC
- SQL ODBC
- SQL Client Applications
- SQL Language
- Functions and Operators
- Comparison Operators
- Logical Operators
- Math Operators
- Cast Operators
- LIKE and RLIKE Operators
- Aggregate Functions
- Grouping Functions
- Date/Time and Interval Functions and Operators
- Full-Text Search Functions
- Mathematical Functions
- String Functions
- Type Conversion Functions
- Geo Functions
- Conditional Functions And Expressions
- System Functions
- Reserved keywords
- SQL Limitations
- 聚合
- 度量(metric)聚合
- 桶(bucket)聚合
- adjacency_matrix 聚合
- auto_date_histogram 聚合
- children 聚合
- composite 聚合
- date_histogram 聚合
- date_range 聚合
- diversified_sampler 聚合
- filter 聚合
- filters 聚合
- geo_distance 聚合
- geohash_grid 聚合
- geotile_grid 聚合
- global 聚合
- histogram 聚合
- ip_range 聚合
- missing 聚合
- nested 聚合
- parent 聚合
- range 聚合
- rare_terms 聚合
- reverse_nested 聚合
- sampler 聚合
- significant_terms 聚合
- significant_text 聚合
- terms 聚合
- 给范围字段分桶的微妙之处
- 管道(pipeline)聚合
- 矩阵(matrix)聚合
- 重度缓存的聚合
- 只返回聚合的结果
- 聚合元数据
- Returning the type of the aggregation
- 使用转换对聚合结果进行索引
- 脚本
- 映射
- 删除的映射类型
- 字段数据类型
- alias(别名)
- array(数组)
- binary(二进制)
- boolean(布尔)
- date(日期)
- date_nanos(日期纳秒)
- dense_vector(密集矢量)
- histogram(直方图)
- flattened(扁平)
- geo_point(地理坐标点)
- geo_shape(地理形状)
- IP
- join(联结)
- keyword(关键词)
- nested(嵌套)
- numeric(数值)
- object(对象)
- percolator(渗透器)
- range(范围)
- rank_feature(特征排名)
- rank_features(特征排名)
- search_as_you_type(输入即搜索)
- Sparse vector
- Text
- Token count
- Shape
- Constant keyword
- Meta-Fields
- Mapping parameters
- Dynamic Mapping
- Text analysis
- Overview
- Concepts
- Configure text analysis
- Built-in analyzer reference
- Tokenizer reference
- Char Group Tokenizer
- Classic Tokenizer
- Edge n-gram tokenizer
- Keyword Tokenizer
- Letter Tokenizer
- Lowercase Tokenizer
- N-gram tokenizer
- Path Hierarchy Tokenizer
- Path Hierarchy Tokenizer Examples
- Pattern Tokenizer
- Simple Pattern Tokenizer
- Simple Pattern Split Tokenizer
- Standard Tokenizer
- Thai Tokenizer
- UAX URL Email Tokenizer
- Whitespace Tokenizer
- Token filter reference
- Apostrophe
- ASCII folding
- CJK bigram
- CJK width
- Classic
- Common grams
- Conditional
- Decimal digit
- Delimited payload
- Dictionary decompounder
- Edge n-gram
- Elision
- Fingerprint
- Flatten graph
- Hunspell
- Hyphenation decompounder
- Keep types
- Keep words
- Keyword marker
- Keyword repeat
- KStem
- Length
- Limit token count
- Lowercase
- MinHash
- Multiplexer
- N-gram
- Normalization
- Pattern capture
- Pattern replace
- Phonetic
- Porter stem
- Predicate script
- Remove duplicates
- Reverse
- Shingle
- Snowball
- Stemmer
- Stemmer override
- Stop
- Synonym
- Synonym graph
- Trim
- Truncate
- Unique
- Uppercase
- Word delimiter
- Word delimiter graph
- Character filters reference
- Normalizers
- Index modules
- Ingest node
- Pipeline Definition
- Accessing Data in Pipelines
- Conditional Execution in Pipelines
- Handling Failures in Pipelines
- Enrich your data
- Processors
- Append Processor
- Bytes Processor
- Circle Processor
- Convert Processor
- CSV Processor
- Date Processor
- Date Index Name Processor
- Dissect Processor
- Dot Expander Processor
- Drop Processor
- Enrich Processor
- Fail Processor
- Foreach Processor
- GeoIP Processor
- Grok Processor
- Gsub Processor
- HTML Strip Processor
- Inference Processor
- Join Processor
- JSON Processor
- KV Processor
- Lowercase Processor
- Pipeline Processor
- Remove Processor
- Rename Processor
- Script Processor
- Set Processor
- Set Security User Processor
- Split Processor
- Sort Processor
- Trim Processor
- Uppercase Processor
- URL Decode Processor
- User Agent processor
- ILM: Manage the index lifecycle
- Monitor a cluster
- Frozen indices
- Roll up or transform your data
- Set up a cluster for high availability
- Snapshot and restore
- Secure a cluster
- Overview
- Configuring security
- User authentication
- Built-in users
- Internal users
- Token-based authentication services
- Realms
- Realm chains
- Active Directory user authentication
- File-based user authentication
- LDAP user authentication
- Native user authentication
- OpenID Connect authentication
- PKI user authentication
- SAML authentication
- Kerberos authentication
- Integrating with other authentication systems
- Enabling anonymous access
- Controlling the user cache
- Configuring SAML single-sign-on on the Elastic Stack
- Configuring single sign-on to the Elastic Stack using OpenID Connect
- User authorization
- Built-in roles
- Defining roles
- Security privileges
- Document level security
- Field level security
- Granting privileges for indices and aliases
- Mapping users and groups to roles
- Setting up field and document level security
- Submitting requests on behalf of other users
- Configuring authorization delegation
- Customizing roles and authorization
- Enabling audit logging
- Encrypting communications
- Restricting connections with IP filtering
- Cross cluster search, clients, and integrations
- Tutorial: Getting started with security
- Tutorial: Encrypting communications
- Troubleshooting
- Some settings are not returned via the nodes settings API
- Authorization exceptions
- Users command fails due to extra arguments
- Users are frequently locked out of Active Directory
- Certificate verification fails for curl on Mac
- SSLHandshakeException causes connections to fail
- Common SSL/TLS exceptions
- Common Kerberos exceptions
- Common SAML issues
- Internal Server Error in Kibana
- Setup-passwords command fails due to connection failure
- Failures due to relocation of the configuration files
- Limitations
- Alerting on cluster and index events
- Command line tools
- How To
- Glossary of terms
- REST APIs
- API conventions
- cat APIs
- cat aliases
- cat allocation
- cat anomaly detectors
- cat count
- cat data frame analytics
- cat datafeeds
- cat fielddata
- cat health
- cat indices
- cat master
- cat nodeattrs
- cat nodes
- cat pending tasks
- cat plugins
- cat recovery
- cat repositories
- cat shards
- cat segments
- cat snapshots
- cat task management
- cat templates
- cat thread pool
- cat trained model
- cat transforms
- Cluster APIs
- Cluster allocation explain
- Cluster get settings
- Cluster health
- Cluster reroute
- Cluster state
- Cluster stats
- Cluster update settings
- Nodes feature usage
- Nodes hot threads
- Nodes info
- Nodes reload secure settings
- Nodes stats
- Pending cluster tasks
- Remote cluster info
- Task management
- Voting configuration exclusions
- Cross-cluster replication APIs
- Document APIs
- Enrich APIs
- Explore API
- Index APIs
- Add index alias
- Analyze
- Clear cache
- Clone index
- Close index
- Create index
- Delete index
- Delete index alias
- Delete index template
- Flush
- Force merge
- Freeze index
- Get field mapping
- Get index
- Get index alias
- Get index settings
- Get index template
- Get mapping
- Index alias exists
- Index exists
- Index recovery
- Index segments
- Index shard stores
- Index stats
- Index template exists
- Open index
- Put index template
- Put mapping
- Refresh
- Rollover index
- Shrink index
- Split index
- Synced flush
- Type exists
- Unfreeze index
- Update index alias
- Update index settings
- Index lifecycle management API
- Ingest APIs
- Info API
- Licensing APIs
- Machine learning anomaly detection APIs
- Add events to calendar
- Add jobs to calendar
- Close jobs
- Create jobs
- Create calendar
- Create datafeeds
- Create filter
- Delete calendar
- Delete datafeeds
- Delete events from calendar
- Delete filter
- Delete forecast
- Delete jobs
- Delete jobs from calendar
- Delete model snapshots
- Delete expired data
- Estimate model memory
- Find file structure
- Flush jobs
- Forecast jobs
- Get buckets
- Get calendars
- Get categories
- Get datafeeds
- Get datafeed statistics
- Get influencers
- Get jobs
- Get job statistics
- Get machine learning info
- Get model snapshots
- Get overall buckets
- Get scheduled events
- Get filters
- Get records
- Open jobs
- Post data to jobs
- Preview datafeeds
- Revert model snapshots
- Set upgrade mode
- Start datafeeds
- Stop datafeeds
- Update datafeeds
- Update filter
- Update jobs
- Update model snapshots
- Machine learning data frame analytics APIs
- Create data frame analytics jobs
- Create inference trained model
- Delete data frame analytics jobs
- Delete inference trained model
- Evaluate data frame analytics
- Explain data frame analytics API
- Get data frame analytics jobs
- Get data frame analytics jobs stats
- Get inference trained model
- Get inference trained model stats
- Start data frame analytics jobs
- Stop data frame analytics jobs
- Migration APIs
- Reload search analyzers
- Rollup APIs
- Search APIs
- Security APIs
- Authenticate
- Change passwords
- Clear cache
- Clear roles cache
- Create API keys
- Create or update application privileges
- Create or update role mappings
- Create or update roles
- Create or update users
- Delegate PKI authentication
- Delete application privileges
- Delete role mappings
- Delete roles
- Delete users
- Disable users
- Enable users
- Get API key information
- Get application privileges
- Get builtin privileges
- Get role mappings
- Get roles
- Get token
- Get users
- Has privileges
- Invalidate API key
- Invalidate token
- OpenID Connect Prepare Authentication API
- OpenID Connect authenticate API
- OpenID Connect logout API
- SAML prepare authentication API
- SAML authenticate API
- SAML logout API
- SAML invalidate API
- SSL certificate
- Snapshot and restore APIs
- Snapshot lifecycle management API
- Transform APIs
- Usage API
- Watcher APIs
- Definitions
- Breaking changes
- Release notes
- Elasticsearch version 7.7.1
- Elasticsearch version 7.7.0
- Elasticsearch version 7.6.2
- Elasticsearch version 7.6.1
- Elasticsearch version 7.6.0
- Elasticsearch version 7.5.2
- Elasticsearch version 7.5.1
- Elasticsearch version 7.5.0
- Elasticsearch version 7.4.2
- Elasticsearch version 7.4.1
- Elasticsearch version 7.4.0
- Elasticsearch version 7.3.2
- Elasticsearch version 7.3.1
- Elasticsearch version 7.3.0
- Elasticsearch version 7.2.1
- Elasticsearch version 7.2.0
- Elasticsearch version 7.1.1
- Elasticsearch version 7.1.0
- Elasticsearch version 7.0.0
- Elasticsearch version 7.0.0-rc2
- Elasticsearch version 7.0.0-rc1
- Elasticsearch version 7.0.0-beta1
- Elasticsearch version 7.0.0-alpha2
- Elasticsearch version 7.0.0-alpha1