Using Stopwords | Elasticsearch: The Definitive Guide [2.x]

原文地址: https://www.elastic.co/guide/en/elasticsearch/guide/current/using-stopwords.html, 版权归 www.elastic.co 所有

WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.

This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.

» » »

« Pros and Cons of Stopwords Stopwords and Performance »

Using Stopwordsedit

The removal of stopwords is handled by the stop token filter which can be used when creating a custom analyzer (see Using the stop Token Filter). However, some out-of-the-box analyzers come with the stop filter pre-integrated:

Language analyzers: Each language analyzer defaults to using the appropriate stopwords list for that language. For instance, the english analyzer uses the _english_ stopwords list.
standard analyzer: Defaults to the empty stopwords list: _none_, essentially disabling stopwords.
pattern analyzer: Defaults to _none_, like the standard analyzer.

Stopwords and the Standard Analyzeredit

To use custom stopwords in conjunction with the standard analyzer, all we need to do is to create a configured version of the analyzer and pass in the list of stopwords that we require:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": { 
          "type": "standard", 
          "stopwords": [ "and", "the" ] 
        }
      }
    }
  }
}

	This is a custom analyzer called `my_analyzer`.
	This analyzer is the `standard` analyzer with some custom configuration.
	The stopwords to filter out are `and` and `the`.

This same technique can be used to configure custom stopword lists for any of the language analyzers.

Maintaining Positionsedit

The output from the analyze API is quite interesting:

GET /my_index/_analyze?analyzer=my_analyzer
The quick and the dead

{
   "tokens": [
      {
         "token":        "quick",
         "start_offset": 4,
         "end_offset":   9,
         "type":         "<ALPHANUM>",
         "position":     1 
      },
      {
         "token":        "dead",
         "start_offset": 18,
         "end_offset":   22,
         "type":         "<ALPHANUM>",
         "position":     4 
      }
   ]
}

Note the position of each token.

The stopwords have been filtered out, as expected, but the interesting part is that the position of the two remaining terms is unchanged: quick is the second word in the original sentence, and dead is the fifth. This is important for phrase queries—if the positions of each term had been adjusted, a phrase query for quick dead would have matched the preceding example incorrectly.

Specifying Stopwordsedit

Stopwords can be passed inline, as we did in the previous example, by specifying an array:

"stopwords": [ "and", "the" ]

The default stopword list for a particular language can be specified using the _lang_ notation:

"stopwords": "_english_"

The predefined language-specific stopword lists available in Elasticsearch can be found in the stop token filter documentation.

Stopwords can be disabled by specifying the special list: _none_. For instance, to use the english analyzer without stopwords, you can do the following:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type":      "english", 
          "stopwords": "_none_" 
        }
      }
    }
  }
}

	The `my_english` analyzer is based on the `english` analyzer.
	But stopwords are disabled.

Finally, stopwords can also be listed in a file with one word per line. The file must be present on all nodes in the cluster, and the path can be specified with the stopwords_path parameter:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type":           "english",
          "stopwords_path": "stopwords/english.txt" 
        }
      }
    }
  }
}

The path to the stopwords file, relative to the Elasticsearch config directory

Using the stop Token Filteredit

The stop token filter can be combined with a tokenizer and other token filters when you need to create a custom analyzer. For instance, let’s say that we wanted to create a Spanish analyzer with the following:

A custom stopwords list
The light_spanish stemmer
The asciifolding filter to remove diacritics

We could set that up as follows:

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "spanish_stop": {
          "type":        "stop",
          "stopwords": [ "si", "esta", "el", "la" ]  
        },
        "light_spanish": { 
          "type":     "stemmer",
          "language": "light_spanish"
        }
      },
      "analyzer": {
        "my_spanish": {
          "tokenizer": "spanish",
          "filter": [ 
            "lowercase",
            "asciifolding",
            "spanish_stop",
            "light_spanish"
          ]
        }
      }
    }
  }
}

	The `stop` token filter takes the same `stopwords` and `stopwords_path` parameters as the `standard` analyzer.
	See Algorithmic Stemmers.
	The order of token filters is important, as explained next.

We have placed the spanish_stop filter after the asciifolding filter. This means that esta, ésta, and está will first have their diacritics removed to become just esta, which will then be removed as a stopword. If, instead, we wanted to remove esta and ésta, but not está, we would have to put the spanish_stop filter before the asciifolding filter, and specify both words in the stopwords list.

Updating Stopwordsedit

A few techniques can be used to update the list of stopwords used by an analyzer. Analyzers are instantiated at index creation time, when a node is restarted, or when a closed index is reopened.

If you specify stopwords inline with the stopwords parameter, your only option is to close the index and update the analyzer configuration with the update index settings API, then reopen the index.

Updating stopwords is easier if you specify them in a file with the stopwords_path parameter. You can just update the file (on every node in the cluster) and then force the analyzers to be re-created by either of these actions:

Closing and reopening the index (see open/close index), or
Restarting each node in the cluster, one by one

Of course, updating the stopwords list will not change any documents that have already been indexed. It will apply only to searches and to new or updated documents. To apply the changes to existing documents, you will need to reindex your data. See Reindexing Your Data.

« Pros and Cons of Stopwords Stopwords and Performance »