Mixed-Language Fields | Elasticsearch: The Definitive Guide [2.x]

原文地址: https://www.elastic.co/guide/en/elasticsearch/guide/current/mixed-lang-fields.html, 版权归 www.elastic.co 所有

WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.

This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.

» » »

« One Language per Field Identifying Words »

Mixed-Language Fieldsedit

Usually, documents that mix multiple languages in a single field come from sources beyond your control, such as pages scraped from the Web:

{ "body": "Page not found / Seite nicht gefunden / Page non trouvée" }

They are the most difficult type of multilingual document to handle correctly. Although you can simply use the standard analyzer on all fields, your documents will be less searchable than if you had used an appropriate stemmer. But of course, you can’t choose just one stemmer—stemmers are language specific. Or rather, stemmers are language and script specific. As discussed in Stemmer per Script, if every language uses a different script, then stemmers can be combined.

Assuming that your mix of languages uses the same script such as Latin, you have three choices available to you:

Split into separate fields
Analyze multiple times
Use n-grams

Split into Separate Fieldsedit

The Compact Language Detector mentioned in Identifying Language can tell you which parts of the document are in which language. You can split up the text based on language and use the same approach as was used in One Language per Field.

Analyze Multiple Timesedit

If you primarily deal with a limited number of languages, you could use multi-fields to analyze the text once per language:

PUT /movies
{
  "mappings": {
    "title": {
      "properties": {
        "title": { 
          "type": "string",
          "fields": {
            "de": { 
              "type":     "string",
              "analyzer": "german"
            },
            "en": { 
              "type":     "string",
              "analyzer": "english"
            },
            "fr": { 
              "type":     "string",
              "analyzer": "french"
            },
            "es": { 
              "type":     "string",
              "analyzer": "spanish"
            }
          }
        }
      }
    }
  }
}

	The main `title` field uses the `standard` analyzer.
	Each subfield applies a different language analyzer to the text in the `title` field.

Use n-gramsedit

You could index all words as n-grams, using the same approach as described in Ngrams for Compound Words. Most inflections involve adding a suffix (or in some languages, a prefix) to a word, so by breaking each word into n-grams, you have a good chance of matching words that are similar but not exactly the same. This can be combined with the analyze-multiple times approach to provide a catchall field for unsupported languages:

PUT /movies
{
  "settings": {
    "analysis": {...} 
  },
  "mappings": {
    "title": {
      "properties": {
        "title": {
          "type": "string",
          "fields": {
            "de": {
              "type":     "string",
              "analyzer": "german"
            },
            "en": {
              "type":     "string",
              "analyzer": "english"
            },
            "fr": {
              "type":     "string",
              "analyzer": "french"
            },
            "es": {
              "type":     "string",
              "analyzer": "spanish"
            },
            "general": { 
              "type":     "string",
              "analyzer": "trigrams"
            }
          }
        }
      }
    }
  }
}

	In the `analysis` section, we define the same `trigrams` analyzer as described in Ngrams for Compound Words.
	The `title.general` field uses the `trigrams` analyzer to index any language.

When querying the catchall general field, you can use minimum_should_match to reduce the number of low-quality matches. It may also be necessary to boost the other fields slightly more than the general field, so that matches on the main language fields are given more weight than those on the general field:

GET /movies/movie/_search
{
    "query": {
        "multi_match": {
            "query":    "club de la lucha",
            "fields": [ "title*^1.5", "title.general" ], 
            "type":     "most_fields",
            "minimum_should_match": "75%" 
        }
    }
}

	All `title` or `title.*` fields are given a slight boost over the `title.general` field.
	The `minimum_should_match` parameter reduces the number of low-quality matches returned, especially important for the `title.general` field.

« One Language per Field Identifying Words »