WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Unicode Character Foldingedit
In the same way as the lowercase
token filter is a good starting point for
many languages but falls short when exposed to the entire tower of Babel, so
the asciifolding
token filter requires a more
effective Unicode character-folding counterpart for dealing with the many
languages of the world.
The icu_folding
token filter (provided by the icu
plug-in)
does the same job as the asciifolding
filter, but extends the transformation
to scripts that are not ASCII-based, such as Greek, Hebrew, Han, conversion
of numbers in other scripts into their Latin equivalents, plus various other
numeric, symbolic, and punctuation transformations.
The icu_folding
token filter applies Unicode normalization and case folding
from nfkc_cf
automatically, so the icu_normalizer
is not required:
PUT /my_index { "settings": { "analysis": { "analyzer": { "my_folder": { "tokenizer": "icu_tokenizer", "filter": [ "icu_folding" ] } } } } } GET /my_index/_analyze?analyzer=my_folder ١٢٣٤٥
If there are particular characters that you would like to protect from
folding, you can use a
UnicodeSet
(much like a character class in regular expressions) to specify which Unicode
characters may be folded. For instance, to exclude the Swedish letters å
,
ä
, ö
, Å
, Ä
, and Ö
from folding, you would specify a character class
representing all Unicode characters, except for those letters: [^åäöÅÄÖ]
(^
means everything except).
PUT /my_index { "settings": { "analysis": { "filter": { "swedish_folding": { "type": "icu_folding", "unicodeSetFilter": "[^åäöÅÄÖ]" } }, "analyzer": { "swedish_analyzer": { "tokenizer": "icu_tokenizer", "filter": [ "swedish_folding", "lowercase" ] } } } } }
The |
|
The |