Edge n-gram token filter | Elasticsearch Guide [7.7]

原文地址: https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-edgengram-tokenfilter.html, 原文档版权归 www.elastic.co 所有

IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

» » »

« Dictionary decompounder token filter Elision token filter »

Edge n-gram token filteredit

Forms an n-gram of a specified length from the beginning of a token.

For example, you can use the edge_ngram token filter to change quick to qu.

When not customized, the filter creates 1-character edge n-grams by default.

This filter uses Lucene’s EdgeNGramTokenFilter.

The edge_ngram filter is similar to the ngram token filter. However, the edge_ngram only outputs n-grams that start at the beginning of a token. These edge n-grams are useful for search-as-you-type queries.

Exampleedit

The following analyze API request uses the edge_ngram filter to convert the quick brown fox jumps to 1-character and 2-character edge n-grams:

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    { "type": "edge_ngram",
      "min_gram": 1,
      "max_gram": 2
    }
  ],
  "text": "the quick brown fox jumps"
}

The filter produces the following tokens:

[ t, th, q, qu, b, br, f, fo, j, ju ]

Add to an analyzeredit

The following create index API request uses the edge_ngram filter to configure a new custom analyzer.

PUT edge_ngram_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_edge_ngram": {
          "tokenizer": "standard",
          "filter": [ "edge_ngram" ]
        }
      }
    }
  }
}

Configurable parametersedit

max_gram

(Optional, integer) Maximum character length of a gram. For custom token filters, defaults to 2. For the built-in edge_ngram filter, defaults to 1.

See Limitations of the max_gram parameter.

min_gram

(Optional, integer) Minimum character length of a gram. Defaults to 1.

side

(Optional, string) Deprecated. Indicates whether to truncate tokens from the front or back. Defaults to front.

Instead of using the back value, you can use the reverse token filter before and after the edge_ngram filter to achieve the same results.

Customizeedit

To customize the edge_ngram filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

For example, the following request creates a custom edge_ngram filter that forms n-grams between 3-5 characters.

PUT edge_ngram_custom_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "whitespace",
          "filter": [ "3_5_edgegrams" ]
        }
      },
      "filter": {
        "3_5_edgegrams": {
          "type": "edge_ngram",
          "min_gram": 3,
          "max_gram": 5
        }
      }
    }
  }
}

Limitations of the `max_gram` parameteredit

The edge_ngram filter’s max_gram value limits the character length of tokens. When the edge_ngram filter is used with an index analyzer, this means search terms longer than the max_gram length may not match any indexed terms.

For example, if the max_gram is 3, searches for apple won’t match the indexed term app.

To account for this, you can use the truncate filter with a search analyzer to shorten search terms to the max_gram character length. However, this could return irrelevant results.

For example, if the max_gram is 3 and search terms are truncated to three characters, the search term apple is shortened to app. This means searches for apple return any indexed terms matching app, such as apply, snapped, and apple.

We recommend testing both approaches to see which best fits your use case and desired search experience.