WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Getting Started with Languagesedit
Elasticsearch ships with a collection of language analyzers that provide good, basic, out-of-the-box support for many of the world’s most common languages:
Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Kurdish, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Thai.
These analyzers typically perform four roles:
-
Tokenize text into individual words:
The quick brown foxes→ [The,quick,brown,foxes] -
Lowercase tokens:
The→the -
Remove common stopwords:
[
The,quick,brown,foxes] → [quick,brown,foxes] -
Stem tokens to their root form:
foxes→fox
Each analyzer may also apply other transformations specific to its language in order to make words from that language more searchable:
-
The
englishanalyzer removes the possessive's:John's→john -
The
frenchanalyzer removes elisions likel'andqu'and diacritics like¨or^:l'église→eglis -
The
germananalyzer normalizes terms, replacingäandaewitha, orßwithss, among others:äußerst→ausserst