本地英文版地址: ../en/analysis-keyword-marker-tokenfilter.html
Keyword marker token filteredit
Marks specified tokens as keywords, which are not stemmed.
The keyword_marker filter assigns specified tokens a keyword attribute of
true. Stemmer token filters, such as
stemmer or
porter_stem, skip tokens with a keyword
attribute of true.
To work properly, the keyword_marker filter must be listed before any stemmer
token filters in the analyzer configuration.
The keyword_marker filter uses Lucene’s
KeywordMarkerFilter.
Exampleedit
To see how the keyword_marker filter works, you first need to produce a token
stream containing stemmed tokens.
The following analyze API request uses the
stemmer filter to create stemmed tokens for
fox running and jumping.
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [ "stemmer" ],
"text": "fox running and jumping"
}
The request produces the following tokens. Note that running was stemmed to
run and jumping was stemmed to jump.
[ fox, run, and, jump ]
To prevent jumping from being stemmed, add the keyword_marker filter before
the stemmer filter in the previous analyze API request. Specify jumping in
the keywords parameter of the keyword_marker filter.
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "keyword_marker",
"keywords": [ "jumping" ]
},
"stemmer"
],
"text": "fox running and jumping"
}
The request produces the following tokens. running is still stemmed to run,
but jumping is not stemmed.
[ fox, run, and, jumping ]
To see the keyword attribute for these tokens, add the following arguments to
the analyze API request:
-
explain:true -
attributes:keyword
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "keyword_marker",
"keywords": [ "jumping" ]
},
"stemmer"
],
"text": "fox running and jumping",
"explain": true,
"attributes": "keyword"
}
The API returns the following response. Note the jumping token has a
keyword attribute of true.
{
"detail": {
"custom_analyzer": true,
"charfilters": [],
"tokenizer": {
"name": "whitespace",
"tokens": [
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "running",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "and",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 2
},
{
"token": "jumping",
"start_offset": 16,
"end_offset": 23,
"type": "word",
"position": 3
}
]
},
"tokenfilters": [
{
"name": "__anonymous__keyword_marker",
"tokens": [
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0,
"keyword": false
},
{
"token": "running",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1,
"keyword": false
},
{
"token": "and",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 2,
"keyword": false
},
{
"token": "jumping",
"start_offset": 16,
"end_offset": 23,
"type": "word",
"position": 3,
"keyword": true
}
]
},
{
"name": "stemmer",
"tokens": [
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0,
"keyword": false
},
{
"token": "run",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1,
"keyword": false
},
{
"token": "and",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 2,
"keyword": false
},
{
"token": "jumping",
"start_offset": 16,
"end_offset": 23,
"type": "word",
"position": 3,
"keyword": true
}
]
}
]
}
}
Configurable parametersedit
-
ignore_case -
(Optional, boolean)
If
true, matching for thekeywordsandkeywords_pathparameters ignores letter case. Defaults tofalse. -
keywords -
(Required*, array of strings) Array of keywords. Tokens that match these keywords are not stemmed.
This parameter,
keywords_path, orkeywords_patternmust be specified. You cannot specify this parameter andkeywords_pattern. -
keywords_path -
(Required*, string) Path to a file that contains a list of keywords. Tokens that match these keywords are not stemmed.
This path must be absolute or relative to the
configlocation, and the file must be UTF-8 encoded. Each word in the file must be separated by a line break.This parameter,
keywords, orkeywords_patternmust be specified. You cannot specify this parameter andkeywords_pattern. -
keywords_pattern -
(Required*, string) Java regular expression used to match tokens. Tokens that match this expression are marked as keywords and not stemmed.
This parameter,
keywords, orkeywords_pathmust be specified. You cannot specify this parameter andkeywordsorkeywords_pattern.Poorly written regular expressions can cause Elasticsearch to run slowly or result in stack overflow errors, causing the running node to suddenly exit.
Customize and add to an analyzeredit
To customize the keyword_marker filter, duplicate it to create the basis for a
new custom token filter. You can modify the filter using its configurable
parameters.
For example, the following create index API request
uses a custom keyword_marker filter and the porter_stem
filter to configure a new custom analyzer.
The custom keyword_marker filter marks tokens specified in the
analysis/example_word_list.txt file as keywords. The porter_stem filter does
not stem these tokens.
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"my_custom_keyword_marker_filter",
"porter_stem"
]
}
},
"filter": {
"my_custom_keyword_marker_filter": {
"type": "keyword_marker",
"keywords_path": "analysis/example_word_list.txt"
}
}
}
}
}