本地英文版地址: ../en/percolator.html
percolator
字段类型将一个 json 结构解析成一个本地查询并存储该查询,以便percolate查询可以使用它来匹配提供的文档。
任何包含 json 对象的字段都可以被配置为一个 percolator 字段。
只需配置percolator
字段类型就足以指示 Elasticsearch 将字段视为查询。
如果下面这个映射将query
字段配置为percolator
字段类型:
PUT my_index { "mappings": { "properties": { "query": { "type": "percolator" }, "field": { "type": "text" } } } }
然后你就可以索引一个查询:
PUT my_index/_doc/match_value { "query" : { "match" : { "field" : "value" } } }
percolator 查询中引用的字段必须已经存在于与用于渗透的索引相关联的映射中。 为了确保这些字段存在,请通过创建索引(create index)或设置映射(put mapping) API添加或更新映射。
重新索引 percolator 查询
有时需要重新索引 percolator 查询来从新版本中对percolator
字段类型的改进中获益。
重新索引 percolator 查询可以通过使用 reindex api 来重新索引。 让我们来看看下面的一个带有 percolator 类型字段的索引:
PUT /index { "mappings": { "properties": { "query" : { "type" : "percolator" }, "body" : { "type": "text" } } } } POST /_aliases { "actions": [ { "add": { "index": "index", "alias": "queries" } } ] } PUT /queries/_doc/1?refresh { "query" : { "match" : { "body" : "quick brown fox" } } }
假设要升级到新的 Elasticsearch 主版本,为了让新版本仍然能够读取你的查询,你需要将你的查询重新索引到当前 Elasticsearch 版本的新索引中:
PUT /new_index { "mappings": { "properties": { "query" : { "type" : "percolator" }, "body" : { "type": "text" } } } } POST /_reindex?refresh { "source": { "index": "index" }, "dest": { "index": "new_index" } } POST /_aliases { "actions": [ { "remove": { "index" : "index", "alias": "queries" } }, { "add": { "index": "new_index", "alias": "queries" } } ] }
通过queries
别名执行percolate
查询:
GET /queries/_search { "query": { "percolate" : { "field" : "query", "document" : { "body" : "fox jumps over the lazy dog" } } } }
现在从新索引中返回了匹配的项:
{ "took": 3, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped" : 0, "failed": 0 }, "hits": { "total" : { "value": 1, "relation": "eq" }, "max_score": 0.13076457, "hits": [ { "_index": "new_index", "_type": "_doc", "_id": "1", "_score": 0.13076457, "_source": { "query": { "match": { "body": "quick brown fox" } } }, "fields" : { "_percolator_document_slot" : [0] } } ] } }
优化查询时的文本分析
当 percolator 验证 percolator 候选匹配时,它将进行解析,执行查询时文本分析,并在被渗透的文档上实际运行 percolator 查询。
每次执行percolate
查询时,都会对每个候选匹配进行这样的操作。
如果查询时的文本分析是查询解析中相对昂贵的部分,那么文本分析可能成为渗透时花费时间的主要因素。
当 percolator 最终验证了许多候选 percolator 查询匹配时,这种查询解析开销会变得很明显。
为了避免在渗透时进行最昂贵的文本分析,可以选择在索引 percolator 查询时进行高昂的文本分析。
这个 requires 使用两个不同的分析器。
第一个分析器实际上执行需要执行的文本分析(代价高的部分)。
第二个分析器(通常是“空格”)只是拆分第一个分析器生成的词元(token)。
然后,在索引 percolator 查询之前,应该使用 analyze api 通过更昂贵的分析器来分析查询文本。
analyze api的结果,即词元(token),应该用来替换 percolator 查询中的原始查询文本。
现在应该配置查询以覆盖映射中的分析器,并且只覆盖第二个分析器,这一点很重要。
大多数基于文本的查询支持一个analyzer
选项(match
、query_string
、simple_query_string
)。
使用这种方法,昂贵的文本分析只需要执行一次,而不是多次。
让我们通过一个简单的例子来演示这个工作流程。
假设我们想要索引下面这个 percolator 查询:
{ "query" : { "match" : { "body" : { "query" : "missing bicycles" } } } }
使用这些设置和映射:
PUT /test_index { "settings": { "analysis": { "analyzer": { "my_analyzer" : { "tokenizer": "standard", "filter" : ["lowercase", "porter_stem"] } } } }, "mappings": { "properties": { "query" : { "type": "percolator" }, "body" : { "type": "text", "analyzer": "my_analyzer" } } } }
首先,我们需要使用 analyze api 在编制索引之前执行文本分析:
POST /test_index/_analyze { "analyzer" : "my_analyzer", "text" : "missing bicycles" }
这会导致以下响应:
{ "tokens": [ { "token": "miss", "start_offset": 0, "end_offset": 7, "type": "<ALPHANUM>", "position": 0 }, { "token": "bicycl", "start_offset": 8, "end_offset": 16, "type": "<ALPHANUM>", "position": 1 } ] }
所有词元(token)都需要以返回的顺序去替换 percolator 查询中的查询文本:
PUT /test_index/_doc/1?refresh { "query" : { "match" : { "body" : { "query" : "miss bicycl", "analyzer" : "whitespace" } } } }
在这里选择一个 空格( |
在索引 percolator 流程之前,应该为每个 percolator 查询执行 analyze api。
在渗透时没有任何变化,percolate
查询可以正常定义:
GET /test_index/_search { "query": { "percolate" : { "field" : "query", "document" : { "body" : "Bycicles are missing" } } } }
这将导致如下响应:
{ "took": 6, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped" : 0, "failed": 0 }, "hits": { "total" : { "value": 1, "relation": "eq" }, "max_score": 0.13076457, "hits": [ { "_index": "test_index", "_type": "_doc", "_id": "1", "_score": 0.13076457, "_source": { "query": { "match": { "body": { "query": "miss bicycl", "analyzer": "whitespace" } } } }, "fields" : { "_percolator_document_slot" : [0] } } ] } }
优化通配符查询
对于 percolator 来说,通配符查询比其他查询开销更大,尤其是在通配符表达式很大的情况下。
在使用前缀通配符表达式的 wildcard
(通配符) 查询或仅使用 prefix
(前缀) 查询的情况下,edge_ngram
词元过滤器可用于在配置了edge_ngram
词元过滤器的字段上将这些查询替换为常规term
查询。
使用自定义的 analysis 设置来创建一个索引:
PUT my_queries1 { "settings": { "analysis": { "analyzer": { "wildcard_prefix": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "wildcard_edge_ngram" ] } }, "filter": { "wildcard_edge_ngram": { "type": "edge_ngram", "min_gram": 1, "max_gram": 32 } } } }, "mappings": { "properties": { "query": { "type": "percolator" }, "my_field": { "type": "text", "fields": { "prefix": { "type": "text", "analyzer": "wildcard_prefix", "search_analyzer": "standard" } } } } } }
生成仅在索引时使用的前缀词元的分析器。 |
|
根据前缀搜索需求,增加 |
|
此多字段应用于使用 |
然后,不要索引以下查询:
{ "query": { "wildcard": { "my_field": "abc*" } } }
而应该对以下查询进行索引:
PUT /my_queries1/_doc/1?refresh { "query": { "term": { "my_field.prefix": "abc" } } }
这种方式可以使得第二个查询的处理比第一个查询更有效。
下面这个搜索请求将与先前索引的 percolator 查询相匹配:
GET /my_queries1/_search { "query": { "percolate": { "field": "query", "document": { "my_field": "abcd" } } } }
{ "took": 6, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total" : { "value": 1, "relation": "eq" }, "max_score": 0.18864399, "hits": [ { "_index": "my_queries1", "_type": "_doc", "_id": "1", "_score": 0.18864399, "_source": { "query": { "term": { "my_field.prefix": "abc" } } }, "fields": { "_percolator_document_slot": [ 0 ] } } ] } }
同样的技术也可以用来加速后缀通配符搜索。
通过在edge_ngram
词元过滤器之前使用reverse
词元过滤器:
PUT my_queries2 { "settings": { "analysis": { "analyzer": { "wildcard_suffix": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "reverse", "wildcard_edge_ngram" ] }, "wildcard_suffix_search_time": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "reverse" ] } }, "filter": { "wildcard_edge_ngram": { "type": "edge_ngram", "min_gram": 1, "max_gram": 32 } } } }, "mappings": { "properties": { "query": { "type": "percolator" }, "my_field": { "type": "text", "fields": { "suffix": { "type": "text", "analyzer": "wildcard_suffix", "search_analyzer": "wildcard_suffix_search_time" } } } } } }
然后,不要索引以下查询:
{ "query": { "wildcard": { "my_field": "*xyz" } } }
而应该要索引下面这个查询:
下面这个搜索请求将与先前索引的 percolator 查询相匹配:
GET /my_queries2/_search { "query": { "percolate": { "field": "query", "document": { "my_field": "wxyz" } } } }
专用的渗透器索引
渗透查询可以添加到任何索引中。 除了将渗透查询添加到数据所在的索引中,还可以将这些查询添加到专用的索引中。 这样做的好处是,这个专用的 percolator 索引可以有自己的索引设置(例如主分片和副本分片的数量)。 如果你选择使用专用的渗透索引,需要确保普通索引的映射在渗透索引上也是可用的。 否则渗透查询可能会被错误地解析。
强制将未映射的字段作为字符串处理
在某些情况下,不知道注册了哪种类型的 percolator 查询,如果 percolator 查询引用的字段没有映射,则添加 percolator 查询会失败。
这意味着需要更新映射,使字段具有适当的设置,然后可以添加 percolator 查询。
但有时,如果所有未映射的字段都被当做默认文本字段来处理,这就足够了。
在这些情况下,可以将index.percolator.map_unmapped_fields_as_text
设置配置为true
(默认为false
),然后如果 percolator 查询中引用的字段不存在,它将被作为默认文本字段处理,以便添加 percolator 查询时不会失败。
局限性
父-子
因为percolate
查询一次处理一个文档,所以它不支持针对子文档(如has_child
和has_parent
)运行的查询和过滤器。
获取查询
有许多查询在查询解析期间通过 GET 调用获取数据。
例如,使用词项查找时的terms
查询、使用索引脚本时的template
查询以及使用预索引形状时的geo_shape
。
当这些查询被percolator
字段类型索引时,GET 调用执行一次。
因此,每次percolator
查询评估这些查询时,都会获取词项、形状等。
需要注意的一个重要的点是,每次 percolator 查询在主分片和副本分片上被索引时,都会发生获取词项的查询,因此,如果在建立索引时源索引发生了变化,那么在分片副本之间实际建立索引的词项可能会有所不同。
脚本查询
script
查询中的脚本只能访问文档值字段。
percolate
查询将提供的文档编入内存索引。
此内存索引不支持 存储字段,因此_source
字段和其他存储字段不会被存储。
这就是script
查询中_source
和其他存储字段不可用的原因。
字段别名
包含字段别名的渗透查询可能并不总是如预期的那样运行。 特别是,如果注册了包含字段别名的 percolator 查询,然后在映射中更新该别名以引用不同的字段,则存储的查询仍将引用原始的目标字段。 要获取字段别名的更改,必须显式地重新索引 percolator 查询。
- Elasticsearch权威指南: 其他版本:
- Elasticsearch是什么?
- 7.7版本的新特性
- 开始使用Elasticsearch
- 安装和设置
- 升级Elasticsearch
- 搜索你的数据
- 查询领域特定语言(Query DSL)
- SQL access(暂时不翻译)
- Overview
- Getting Started with SQL
- Conventions and Terminology
- Security
- SQL REST API
- SQL Translate API
- SQL CLI
- SQL JDBC
- SQL ODBC
- SQL Client Applications
- SQL Language
- Functions and Operators
- Comparison Operators
- Logical Operators
- Math Operators
- Cast Operators
- LIKE and RLIKE Operators
- Aggregate Functions
- Grouping Functions
- Date/Time and Interval Functions and Operators
- Full-Text Search Functions
- Mathematical Functions
- String Functions
- Type Conversion Functions
- Geo Functions
- Conditional Functions And Expressions
- System Functions
- Reserved keywords
- SQL Limitations
- 聚合
- 度量(metric)聚合
- 桶(bucket)聚合
- adjacency_matrix 聚合
- auto_date_histogram 聚合
- children 聚合
- composite 聚合
- date_histogram 聚合
- date_range 聚合
- diversified_sampler 聚合
- filter 聚合
- filters 聚合
- geo_distance 聚合
- geohash_grid 聚合
- geotile_grid 聚合
- global 聚合
- histogram 聚合
- ip_range 聚合
- missing 聚合
- nested 聚合
- parent 聚合
- range 聚合
- rare_terms 聚合
- reverse_nested 聚合
- sampler 聚合
- significant_terms 聚合
- significant_text 聚合
- terms 聚合
- 给范围字段分桶的微妙之处
- 管道(pipeline)聚合
- 矩阵(matrix)聚合
- 重度缓存的聚合
- 只返回聚合的结果
- 聚合元数据
- Returning the type of the aggregation
- 使用转换对聚合结果进行索引
- 脚本
- 映射
- 删除的映射类型
- 字段数据类型
- alias(别名)
- array(数组)
- binary(二进制)
- boolean(布尔)
- date(日期)
- date_nanos(日期纳秒)
- dense_vector(密集矢量)
- histogram(直方图)
- flattened(扁平)
- geo_point(地理坐标点)
- geo_shape(地理形状)
- IP
- join(联结)
- keyword(关键词)
- nested(嵌套)
- numeric(数值)
- object(对象)
- percolator(渗透器)
- range(范围)
- rank_feature(特征排名)
- rank_features(特征排名)
- search_as_you_type(输入即搜索)
- Sparse vector
- Text
- Token count
- Shape
- Constant keyword
- Meta-Fields
- Mapping parameters
- Dynamic Mapping
- Text analysis
- Overview
- Concepts
- Configure text analysis
- Built-in analyzer reference
- Tokenizer reference
- Char Group Tokenizer
- Classic Tokenizer
- Edge n-gram tokenizer
- Keyword Tokenizer
- Letter Tokenizer
- Lowercase Tokenizer
- N-gram tokenizer
- Path Hierarchy Tokenizer
- Path Hierarchy Tokenizer Examples
- Pattern Tokenizer
- Simple Pattern Tokenizer
- Simple Pattern Split Tokenizer
- Standard Tokenizer
- Thai Tokenizer
- UAX URL Email Tokenizer
- Whitespace Tokenizer
- Token filter reference
- Apostrophe
- ASCII folding
- CJK bigram
- CJK width
- Classic
- Common grams
- Conditional
- Decimal digit
- Delimited payload
- Dictionary decompounder
- Edge n-gram
- Elision
- Fingerprint
- Flatten graph
- Hunspell
- Hyphenation decompounder
- Keep types
- Keep words
- Keyword marker
- Keyword repeat
- KStem
- Length
- Limit token count
- Lowercase
- MinHash
- Multiplexer
- N-gram
- Normalization
- Pattern capture
- Pattern replace
- Phonetic
- Porter stem
- Predicate script
- Remove duplicates
- Reverse
- Shingle
- Snowball
- Stemmer
- Stemmer override
- Stop
- Synonym
- Synonym graph
- Trim
- Truncate
- Unique
- Uppercase
- Word delimiter
- Word delimiter graph
- Character filters reference
- Normalizers
- Index modules
- Ingest node
- Pipeline Definition
- Accessing Data in Pipelines
- Conditional Execution in Pipelines
- Handling Failures in Pipelines
- Enrich your data
- Processors
- Append Processor
- Bytes Processor
- Circle Processor
- Convert Processor
- CSV Processor
- Date Processor
- Date Index Name Processor
- Dissect Processor
- Dot Expander Processor
- Drop Processor
- Enrich Processor
- Fail Processor
- Foreach Processor
- GeoIP Processor
- Grok Processor
- Gsub Processor
- HTML Strip Processor
- Inference Processor
- Join Processor
- JSON Processor
- KV Processor
- Lowercase Processor
- Pipeline Processor
- Remove Processor
- Rename Processor
- Script Processor
- Set Processor
- Set Security User Processor
- Split Processor
- Sort Processor
- Trim Processor
- Uppercase Processor
- URL Decode Processor
- User Agent processor
- ILM: Manage the index lifecycle
- Monitor a cluster
- Frozen indices
- Roll up or transform your data
- Set up a cluster for high availability
- Snapshot and restore
- Secure a cluster
- Overview
- Configuring security
- User authentication
- Built-in users
- Internal users
- Token-based authentication services
- Realms
- Realm chains
- Active Directory user authentication
- File-based user authentication
- LDAP user authentication
- Native user authentication
- OpenID Connect authentication
- PKI user authentication
- SAML authentication
- Kerberos authentication
- Integrating with other authentication systems
- Enabling anonymous access
- Controlling the user cache
- Configuring SAML single-sign-on on the Elastic Stack
- Configuring single sign-on to the Elastic Stack using OpenID Connect
- User authorization
- Built-in roles
- Defining roles
- Security privileges
- Document level security
- Field level security
- Granting privileges for indices and aliases
- Mapping users and groups to roles
- Setting up field and document level security
- Submitting requests on behalf of other users
- Configuring authorization delegation
- Customizing roles and authorization
- Enabling audit logging
- Encrypting communications
- Restricting connections with IP filtering
- Cross cluster search, clients, and integrations
- Tutorial: Getting started with security
- Tutorial: Encrypting communications
- Troubleshooting
- Some settings are not returned via the nodes settings API
- Authorization exceptions
- Users command fails due to extra arguments
- Users are frequently locked out of Active Directory
- Certificate verification fails for curl on Mac
- SSLHandshakeException causes connections to fail
- Common SSL/TLS exceptions
- Common Kerberos exceptions
- Common SAML issues
- Internal Server Error in Kibana
- Setup-passwords command fails due to connection failure
- Failures due to relocation of the configuration files
- Limitations
- Alerting on cluster and index events
- Command line tools
- How To
- Glossary of terms
- REST APIs
- API conventions
- cat APIs
- cat aliases
- cat allocation
- cat anomaly detectors
- cat count
- cat data frame analytics
- cat datafeeds
- cat fielddata
- cat health
- cat indices
- cat master
- cat nodeattrs
- cat nodes
- cat pending tasks
- cat plugins
- cat recovery
- cat repositories
- cat shards
- cat segments
- cat snapshots
- cat task management
- cat templates
- cat thread pool
- cat trained model
- cat transforms
- Cluster APIs
- Cluster allocation explain
- Cluster get settings
- Cluster health
- Cluster reroute
- Cluster state
- Cluster stats
- Cluster update settings
- Nodes feature usage
- Nodes hot threads
- Nodes info
- Nodes reload secure settings
- Nodes stats
- Pending cluster tasks
- Remote cluster info
- Task management
- Voting configuration exclusions
- Cross-cluster replication APIs
- Document APIs
- Enrich APIs
- Explore API
- Index APIs
- Add index alias
- Analyze
- Clear cache
- Clone index
- Close index
- Create index
- Delete index
- Delete index alias
- Delete index template
- Flush
- Force merge
- Freeze index
- Get field mapping
- Get index
- Get index alias
- Get index settings
- Get index template
- Get mapping
- Index alias exists
- Index exists
- Index recovery
- Index segments
- Index shard stores
- Index stats
- Index template exists
- Open index
- Put index template
- Put mapping
- Refresh
- Rollover index
- Shrink index
- Split index
- Synced flush
- Type exists
- Unfreeze index
- Update index alias
- Update index settings
- Index lifecycle management API
- Ingest APIs
- Info API
- Licensing APIs
- Machine learning anomaly detection APIs
- Add events to calendar
- Add jobs to calendar
- Close jobs
- Create jobs
- Create calendar
- Create datafeeds
- Create filter
- Delete calendar
- Delete datafeeds
- Delete events from calendar
- Delete filter
- Delete forecast
- Delete jobs
- Delete jobs from calendar
- Delete model snapshots
- Delete expired data
- Estimate model memory
- Find file structure
- Flush jobs
- Forecast jobs
- Get buckets
- Get calendars
- Get categories
- Get datafeeds
- Get datafeed statistics
- Get influencers
- Get jobs
- Get job statistics
- Get machine learning info
- Get model snapshots
- Get overall buckets
- Get scheduled events
- Get filters
- Get records
- Open jobs
- Post data to jobs
- Preview datafeeds
- Revert model snapshots
- Set upgrade mode
- Start datafeeds
- Stop datafeeds
- Update datafeeds
- Update filter
- Update jobs
- Update model snapshots
- Machine learning data frame analytics APIs
- Create data frame analytics jobs
- Create inference trained model
- Delete data frame analytics jobs
- Delete inference trained model
- Evaluate data frame analytics
- Explain data frame analytics API
- Get data frame analytics jobs
- Get data frame analytics jobs stats
- Get inference trained model
- Get inference trained model stats
- Start data frame analytics jobs
- Stop data frame analytics jobs
- Migration APIs
- Reload search analyzers
- Rollup APIs
- Search APIs
- Security APIs
- Authenticate
- Change passwords
- Clear cache
- Clear roles cache
- Create API keys
- Create or update application privileges
- Create or update role mappings
- Create or update roles
- Create or update users
- Delegate PKI authentication
- Delete application privileges
- Delete role mappings
- Delete roles
- Delete users
- Disable users
- Enable users
- Get API key information
- Get application privileges
- Get builtin privileges
- Get role mappings
- Get roles
- Get token
- Get users
- Has privileges
- Invalidate API key
- Invalidate token
- OpenID Connect Prepare Authentication API
- OpenID Connect authenticate API
- OpenID Connect logout API
- SAML prepare authentication API
- SAML authenticate API
- SAML logout API
- SAML invalidate API
- SSL certificate
- Snapshot and restore APIs
- Snapshot lifecycle management API
- Transform APIs
- Usage API
- Watcher APIs
- Definitions
- Breaking changes
- Release notes
- Elasticsearch version 7.7.1
- Elasticsearch version 7.7.0
- Elasticsearch version 7.6.2
- Elasticsearch version 7.6.1
- Elasticsearch version 7.6.0
- Elasticsearch version 7.5.2
- Elasticsearch version 7.5.1
- Elasticsearch version 7.5.0
- Elasticsearch version 7.4.2
- Elasticsearch version 7.4.1
- Elasticsearch version 7.4.0
- Elasticsearch version 7.3.2
- Elasticsearch version 7.3.1
- Elasticsearch version 7.3.0
- Elasticsearch version 7.2.1
- Elasticsearch version 7.2.0
- Elasticsearch version 7.1.1
- Elasticsearch version 7.1.0
- Elasticsearch version 7.0.0
- Elasticsearch version 7.0.0-rc2
- Elasticsearch version 7.0.0-rc1
- Elasticsearch version 7.0.0-beta1
- Elasticsearch version 7.0.0-alpha2
- Elasticsearch version 7.0.0-alpha1