WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Mixed-Language Fieldsedit
Usually, documents that mix multiple languages in a single field come from sources beyond your control, such as pages scraped from the Web:
{ "body": "Page not found / Seite nicht gefunden / Page non trouvée" }
They are the most difficult type of multilingual document to handle correctly.
Although you can simply use the standard
analyzer on all fields, your documents
will be less searchable than if you had used an appropriate stemmer. But of
course, you can’t choose just one stemmer—stemmers are language specific.
Or rather, stemmers are language and script specific. As discussed in
Stemmer per Script, if every language uses a different script, then
stemmers can be combined.
Assuming that your mix of languages uses the same script such as Latin, you have three choices available to you:
- Split into separate fields
- Analyze multiple times
- Use n-grams
Split into Separate Fieldsedit
The Compact Language Detector mentioned in Identifying Language can tell you which parts of the document are in which language. You can split up the text based on language and use the same approach as was used in One Language per Field.
Analyze Multiple Timesedit
If you primarily deal with a limited number of languages, you could use multi-fields to analyze the text once per language:
PUT /movies { "mappings": { "title": { "properties": { "title": { "type": "string", "fields": { "de": { "type": "string", "analyzer": "german" }, "en": { "type": "string", "analyzer": "english" }, "fr": { "type": "string", "analyzer": "french" }, "es": { "type": "string", "analyzer": "spanish" } } } } } } }
Use n-gramsedit
You could index all words as n-grams, using the same approach as described in Ngrams for Compound Words. Most inflections involve adding a suffix (or in some languages, a prefix) to a word, so by breaking each word into n-grams, you have a good chance of matching words that are similar but not exactly the same. This can be combined with the analyze-multiple times approach to provide a catchall field for unsupported languages:
PUT /movies { "settings": { "analysis": {...} }, "mappings": { "title": { "properties": { "title": { "type": "string", "fields": { "de": { "type": "string", "analyzer": "german" }, "en": { "type": "string", "analyzer": "english" }, "fr": { "type": "string", "analyzer": "french" }, "es": { "type": "string", "analyzer": "spanish" }, "general": { "type": "string", "analyzer": "trigrams" } } } } } } }
In the |
|
The |
When querying the catchall general
field, you can use
minimum_should_match
to reduce the number of low-quality matches. It may
also be necessary to boost the other fields slightly more than the general
field, so that matches on the main language fields are given more weight
than those on the general
field:
- Elasticsearch - The Definitive Guide:
- Foreword
- Preface
- Getting Started
- You Know, for Search…
- Installing and Running Elasticsearch
- Talking to Elasticsearch
- Document Oriented
- Finding Your Feet
- Indexing Employee Documents
- Retrieving a Document
- Search Lite
- Search with Query DSL
- More-Complicated Searches
- Full-Text Search
- Phrase Search
- Highlighting Our Searches
- Analytics
- Tutorial Conclusion
- Distributed Nature
- Next Steps
- Life Inside a Cluster
- Data In, Data Out
- What Is a Document?
- Document Metadata
- Indexing a Document
- Retrieving a Document
- Checking Whether a Document Exists
- Updating a Whole Document
- Creating a New Document
- Deleting a Document
- Dealing with Conflicts
- Optimistic Concurrency Control
- Partial Updates to Documents
- Retrieving Multiple Documents
- Cheaper in Bulk
- Distributed Document Store
- Searching—The Basic Tools
- Mapping and Analysis
- Full-Body Search
- Sorting and Relevance
- Distributed Search Execution
- Index Management
- Inside a Shard
- You Know, for Search…
- Search in Depth
- Structured Search
- Full-Text Search
- Multifield Search
- Proximity Matching
- Partial Matching
- Controlling Relevance
- Theory Behind Relevance Scoring
- Lucene’s Practical Scoring Function
- Query-Time Boosting
- Manipulating Relevance with Query Structure
- Not Quite Not
- Ignoring TF/IDF
- function_score Query
- Boosting by Popularity
- Boosting Filtered Subsets
- Random Scoring
- The Closer, The Better
- Understanding the price Clause
- Scoring with Scripts
- Pluggable Similarity Algorithms
- Changing Similarities
- Relevance Tuning Is the Last 10%
- Dealing with Human Language
- Aggregations
- Geolocation
- Modeling Your Data
- Administration, Monitoring, and Deployment