本地英文版地址: ../en/analysis-cjk-bigram-tokenfilter.html
CJK bigram token filteredit
Forms bigrams out of CJK (Chinese, Japanese, and Korean) tokens.
This filter is included in Elasticsearch’s built-in CJK language analyzer. It uses Lucene’s CJKBigramFilter.
Exampleedit
The following analyze API request demonstrates how the CJK bigram token filter works.
GET /_analyze
{
"tokenizer" : "standard",
"filter" : ["cjk_bigram"],
"text" : "東京都は、日本の首都であり"
}
The filter produces the following tokens:
[ 東京, 京都, 都は, 日本, 本の, の首, 首都, 都で, であ, あり ]
Add to an analyzeredit
The following create index API request uses the CJK bigram token filter to configure a new custom analyzer.
PUT /cjk_bigram_example
{
"settings" : {
"analysis" : {
"analyzer" : {
"standard_cjk_bigram" : {
"tokenizer" : "standard",
"filter" : ["cjk_bigram"]
}
}
}
}
}
Configurable parametersedit
-
ignored_scripts -
(Optional, array of character scripts) Array of character scripts for which to disable bigrams. Possible values:
-
han -
hangul -
hiragana -
katakana
All non-CJK input is passed through unmodified.
-
output_unigrams
(Optional, boolean)
If true, emit tokens in both bigram and
unigram form. If false, a CJK character
is output in unigram form when it has no adjacent characters. Defaults to
false.
Customizeedit
To customize the CJK bigram token filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.
PUT /cjk_bigram_example
{
"settings" : {
"analysis" : {
"analyzer" : {
"han_bigrams" : {
"tokenizer" : "standard",
"filter" : ["han_bigrams_filter"]
}
},
"filter" : {
"han_bigrams_filter" : {
"type" : "cjk_bigram",
"ignored_scripts": [
"hangul",
"hiragana",
"katakana"
],
"output_unigrams" : true
}
}
}
}
}