Remove duplicates token filter | Elasticsearch Guide [7.7]

原文地址: https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-remove-duplicates-tokenfilter.html, 原文档版权归 www.elastic.co 所有

IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

» » »

« Predicate script token filter Reverse token filter »

Remove duplicates token filteredit

Removes duplicate tokens in the same position.

The remove_duplicates filter uses Lucene’s RemoveDuplicatesTokenFilter.

Exampleedit

To see how the remove_duplicates filter works, you first need to produce a token stream containing duplicate tokens in the same position.

The following analyze API request uses the keyword_repeat and stemmer filters to create stemmed and unstemmed tokens for jumping dog.

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "keyword_repeat",
    "stemmer"
  ],
  "text": "jumping dog"
}

The API returns the following response. Note that the dog token in position 1 is duplicated.

{
  "tokens": [
    {
      "token": "jumping",
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 0
    },
    {
      "token": "jump",
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 0
    },
    {
      "token": "dog",
      "start_offset": 8,
      "end_offset": 11,
      "type": "word",
      "position": 1
    },
    {
      "token": "dog",
      "start_offset": 8,
      "end_offset": 11,
      "type": "word",
      "position": 1
    }
  ]
}

To remove one of the duplicate dog tokens, add the remove_duplicates filter to the previous analyze API request.

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "keyword_repeat",
    "stemmer",
    "remove_duplicates"
  ],
  "text": "jumping dog"
}

The API returns the following response. There is now only one dog token in position 1.

{
  "tokens": [
    {
      "token": "jumping",
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 0
    },
    {
      "token": "jump",
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 0
    },
    {
      "token": "dog",
      "start_offset": 8,
      "end_offset": 11,
      "type": "word",
      "position": 1
    }
  ]
}

Add to an analyzeredit

The following create index API request uses the remove_duplicates filter to configure a new custom analyzer.

This custom analyzer uses the keyword_repeat and stemmer filters to create a stemmed and unstemmed version of each token in a stream. The remove_duplicates filter then removes any duplicate tokens in the same position.

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "keyword_repeat",
            "stemmer",
            "remove_duplicates"
          ]
        }
      }
    }
  }
}

« Predicate script token filter Reverse token filter »