WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Unicode Character Foldingedit
In the same way as the lowercase
token filter is a good starting point for
many languages but falls short when exposed to the entire tower of Babel, so
the asciifolding
token filter requires a more
effective Unicode character-folding counterpart for dealing with the many
languages of the world.
The icu_folding
token filter (provided by the icu
plug-in)
does the same job as the asciifolding
filter, but extends the transformation
to scripts that are not ASCII-based, such as Greek, Hebrew, Han, conversion
of numbers in other scripts into their Latin equivalents, plus various other
numeric, symbolic, and punctuation transformations.
The icu_folding
token filter applies Unicode normalization and case folding
from nfkc_cf
automatically, so the icu_normalizer
is not required:
PUT /my_index { "settings": { "analysis": { "analyzer": { "my_folder": { "tokenizer": "icu_tokenizer", "filter": [ "icu_folding" ] } } } } } GET /my_index/_analyze?analyzer=my_folder ١٢٣٤٥
If there are particular characters that you would like to protect from
folding, you can use a
UnicodeSet
(much like a character class in regular expressions) to specify which Unicode
characters may be folded. For instance, to exclude the Swedish letters å
,
ä
, ö
, Å
, Ä
, and Ö
from folding, you would specify a character class
representing all Unicode characters, except for those letters: [^åäöÅÄÖ]
(^
means everything except).
PUT /my_index { "settings": { "analysis": { "filter": { "swedish_folding": { "type": "icu_folding", "unicodeSetFilter": "[^åäöÅÄÖ]" } }, "analyzer": { "swedish_analyzer": { "tokenizer": "icu_tokenizer", "filter": [ "swedish_folding", "lowercase" ] } } } } }
The |
|
The |
- Elasticsearch - The Definitive Guide:
- Foreword
- Preface
- Getting Started
- You Know, for Search…
- Installing and Running Elasticsearch
- Talking to Elasticsearch
- Document Oriented
- Finding Your Feet
- Indexing Employee Documents
- Retrieving a Document
- Search Lite
- Search with Query DSL
- More-Complicated Searches
- Full-Text Search
- Phrase Search
- Highlighting Our Searches
- Analytics
- Tutorial Conclusion
- Distributed Nature
- Next Steps
- Life Inside a Cluster
- Data In, Data Out
- What Is a Document?
- Document Metadata
- Indexing a Document
- Retrieving a Document
- Checking Whether a Document Exists
- Updating a Whole Document
- Creating a New Document
- Deleting a Document
- Dealing with Conflicts
- Optimistic Concurrency Control
- Partial Updates to Documents
- Retrieving Multiple Documents
- Cheaper in Bulk
- Distributed Document Store
- Searching—The Basic Tools
- Mapping and Analysis
- Full-Body Search
- Sorting and Relevance
- Distributed Search Execution
- Index Management
- Inside a Shard
- You Know, for Search…
- Search in Depth
- Structured Search
- Full-Text Search
- Multifield Search
- Proximity Matching
- Partial Matching
- Controlling Relevance
- Theory Behind Relevance Scoring
- Lucene’s Practical Scoring Function
- Query-Time Boosting
- Manipulating Relevance with Query Structure
- Not Quite Not
- Ignoring TF/IDF
- function_score Query
- Boosting by Popularity
- Boosting Filtered Subsets
- Random Scoring
- The Closer, The Better
- Understanding the price Clause
- Scoring with Scripts
- Pluggable Similarity Algorithms
- Changing Similarities
- Relevance Tuning Is the Last 10%
- Dealing with Human Language
- Aggregations
- Geolocation
- Modeling Your Data
- Administration, Monitoring, and Deployment