WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Getting Started with Languagesedit
Elasticsearch ships with a collection of language analyzers that provide good, basic, out-of-the-box support for many of the world’s most common languages:
Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Kurdish, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Thai.
These analyzers typically perform four roles:
-
Tokenize text into individual words:
The quick brown foxes
→ [The
,quick
,brown
,foxes
] -
Lowercase tokens:
The
→the
-
Remove common stopwords:
[
The
,quick
,brown
,foxes
] → [quick
,brown
,foxes
] -
Stem tokens to their root form:
foxes
→fox
Each analyzer may also apply other transformations specific to its language in order to make words from that language more searchable:
-
The
english
analyzer removes the possessive's
:John's
→john
-
The
french
analyzer removes elisions likel'
andqu'
and diacritics like¨
or^
:l'église
→eglis
-
The
german
analyzer normalizes terms, replacingä
andae
witha
, orß
withss
, among others:äußerst
→ausserst