Unicode Character Folding | Elasticsearch: The Definitive Guide [2.x]

原文地址: https://www.elastic.co/guide/en/elasticsearch/guide/current/character-folding.html, 版权归 www.elastic.co 所有

WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.

This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.

» » »

« Unicode Case Folding Sorting and Collations »

Unicode Character Foldingedit

In the same way as the lowercase token filter is a good starting point for many languages but falls short when exposed to the entire tower of Babel, so the asciifolding token filter requires a more effective Unicode character-folding counterpart for dealing with the many languages of the world.

The icu_folding token filter (provided by the icu plug-in) does the same job as the asciifolding filter, but extends the transformation to scripts that are not ASCII-based, such as Greek, Hebrew, Han, conversion of numbers in other scripts into their Latin equivalents, plus various other numeric, symbolic, and punctuation transformations.

The icu_folding token filter applies Unicode normalization and case folding from nfkc_cf automatically, so the icu_normalizer is not required:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_folder": {
          "tokenizer": "icu_tokenizer",
          "filter":  [ "icu_folding" ]
        }
      }
    }
  }
}

GET /my_index/_analyze?analyzer=my_folder
١٢٣٤٥

The Arabic numerals ١٢٣٤٥ are folded to their Latin equivalent: 12345.

If there are particular characters that you would like to protect from folding, you can use a UnicodeSet (much like a character class in regular expressions) to specify which Unicode characters may be folded. For instance, to exclude the Swedish letters å, ä, ö, Å, Ä, and Ö from folding, you would specify a character class representing all Unicode characters, except for those letters: [^åäöÅÄÖ] (^ means everything except).

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "swedish_folding": { 
          "type": "icu_folding",
          "unicodeSetFilter": "[^åäöÅÄÖ]"
        }
      },
      "analyzer": {
        "swedish_analyzer": { 
          "tokenizer": "icu_tokenizer",
          "filter":  [ "swedish_folding", "lowercase" ]
        }
      }
    }
  }
}

	The `swedish_folding` token filter customizes the `icu_folding` token filter to exclude Swedish letters, both uppercase and lowercase.
	The `swedish` analyzer first tokenizes words, then folds each token by using the `swedish_folding` filter, and then lowercases each token in case it includes some of the uppercase excluded letters: `Å`, `Ä`, or `Ö`.

« Unicode Case Folding Sorting and Collations »