Algorithmic Stemmers | Elasticsearch: The Definitive Guide [2.x]

原文地址: https://www.elastic.co/guide/en/elasticsearch/guide/current/algorithmic-stemmers.html, 版权归 www.elastic.co 所有

WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.

This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.

» » »

« Reducing Words to Their Root Form Dictionary Stemmers »

Algorithmic Stemmersedit

Most of the stemmers available in Elasticsearch are algorithmic in that they apply a series of rules to a word in order to reduce it to its root form, such as stripping the final s or es from plurals. They don’t have to know anything about individual words in order to stem them.

These algorithmic stemmers have the advantage that they are available out of the box, are fast, use little memory, and work well for regular words. The downside is that they don’t cope well with irregular words like be, are, and am, or mice and mouse.

One of the earliest stemming algorithms is the Porter stemmer for English, which is still the recommended English stemmer today. Martin Porter subsequently went on to create the Snowball language for creating stemming algorithms, and a number of the stemmers available in Elasticsearch are written in Snowball.

The kstem token filter is a stemmer for English which combines the algorithmic approach with a built-in dictionary. The dictionary contains a list of root words and exceptions in order to avoid conflating words incorrectly. kstem tends to stem less aggressively than the Porter stemmer.

Using an Algorithmic Stemmeredit

While you can use the porter_stem or kstem token filter directly, or create a language-specific Snowball stemmer with the snowball token filter, all of the algorithmic stemmers are exposed via a single unified interface: the stemmer token filter, which accepts the language parameter.

For instance, perhaps you find the default stemmer used by the english analyzer to be too aggressive and you want to make it less aggressive. The first step is to look up the configuration for the english analyzer in the language analyzers documentation, which shows the following:

{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_"
        },
        "english_keywords": {
          "type":       "keyword_marker", 
          "keywords":   []
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english" 
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english" 
        }
      },
      "analyzer": {
        "english": {
          "tokenizer":  "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  }
}

	The `keyword_marker` token filter lists words that should not be stemmed. This defaults to the empty list.
	The `english` analyzer uses two stemmers: the `possessive_english` and the `english` stemmer. The possessive stemmer removes `'s` from any words before passing them on to the `english_stop`, `english_keywords`, and `english_stemmer`.

Having reviewed the current configuration, we can use it as the basis for a new analyzer, with the following changes:

Change the english_stemmer from english (which maps to the porter_stem token filter) to light_english (which maps to the less aggressive kstem token filter).
Add the asciifolding token filter to remove any diacritics from foreign words.
Remove the keyword_marker token filter, as we don’t need it. (We discuss this in more detail in Controlling Stemming.)

Our new custom analyzer would look like this:

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_"
        },
        "light_english_stemmer": {
          "type":       "stemmer",
          "language":   "light_english" 
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "english": {
          "tokenizer":  "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "light_english_stemmer", 
            "asciifolding" 
          ]
        }
      }
    }
  }
}

	Replaced the `english` stemmer with the less aggressive `light_english` stemmer
	Added the `asciifolding` token filter

« Reducing Words to Their Root Form Dictionary Stemmers »