原文地址: https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-cjk-width-tokenfilter.html, 原文档版权归 www.elastic.co 所有
IMPORTANT: No additional bug fixes or documentation updates
will be released for this version. For the latest information, see the
current release documentation.
CJK width token filteredit
Normalizes width differences in CJK (Chinese, Japanese, and Korean) characters as follows:
- Folds full-width ASCII character variants into the equivalent basic Latin characters
- Folds half-width Katakana character variants into the equivalent Kana characters
This filter is included in Elasticsearch’s built-in CJK language analyzer. It uses Lucene’s CJKWidthFilter.
This token filter can be viewed as a subset of NFKC/NFKD Unicode
normalization. See the
analysis-icu
plugin for
full normalization support.
Exampleedit
GET /_analyze { "tokenizer" : "standard", "filter" : ["cjk_width"], "text" : "シーサイドライナー" }
The filter produces the following token:
シーサイドライナー
Add to an analyzeredit
The following create index API request uses the CJK width token filter to configure a new custom analyzer.
PUT /cjk_width_example { "settings" : { "analysis" : { "analyzer" : { "standard_cjk_width" : { "tokenizer" : "standard", "filter" : ["cjk_width"] } } } } }