Word delimiter graph token filter | Elasticsearch Guide [7.7]

原文地址: https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-word-delimiter-graph-tokenfilter.html, 原文档版权归 www.elastic.co 所有

IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

» » »

« Word delimiter token filter Character filters reference »

Word delimiter graph token filteredit

Splits tokens at non-alphanumeric characters. The word_delimiter_graph filter also performs optional token normalization based on a set of rules. By default, the filter uses the following rules:

Split tokens at non-alphanumeric characters. The filter uses these characters as delimiters. For example: Super-Duper → Super, Duper
Remove leading or trailing delimiters from each token. For example: XL---42+'Autocoder' → XL, 42, Autocoder
Split tokens at letter case transitions. For example: PowerShot → Power, Shot
Split tokens at letter-number transitions. For example: XL500 → XL, 500
Remove the English possessive ('s) from the end of each token. For example: Neil's → Neil

The word_delimiter_graph filter uses Lucene’s WordDelimiterGraphFilter.

The word_delimiter_graph filter was designed to remove punctuation from complex identifiers, such as product IDs or part numbers. For these use cases, we recommend using the word_delimiter_graph filter with the keyword tokenizer.

Avoid using the word_delimiter_graph filter to split hyphenated words, such as wi-fi. Because users often search for these words both with and without hyphens, we recommend using the synonym_graph filter instead.

Exampleedit

The following analyze API request uses the word_delimiter_graph filter to split Neil's-Super-Duper-XL500--42+AutoCoder into normalized tokens using the filter’s default rules:

GET /_analyze
{
  "tokenizer": "keyword",
  "filter": [ "word_delimiter_graph" ],
  "text": "Neil's-Super-Duper-XL500--42+AutoCoder"
}

The filter produces the following tokens:

[ Neil, Super, Duper, XL, 500, 42, Auto, Coder ]

Add to an analyzeredit

The following create index API request uses the word_delimiter_graph filter to configure a new custom analyzer.

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "filter": [ "word_delimiter_graph" ]
        }
      }
    }
  }
}

Avoid using the word_delimiter_graph filter with tokenizers that remove punctuation, such as the standard tokenizer. This could prevent the word_delimiter_graph filter from splitting tokens correctly. It can also interfere with the filter’s configurable parameters, such as catenate_all or preserve_original. We recommend using the keyword or whitespace tokenizer instead.

Configurable parametersedit

adjust_offsets: (Optional, boolean) If true, the filter adjusts the offsets of split or catenated tokens to better reflect their actual position in the token stream. Defaults to true.

Set adjust_offsets to false if your analyzer uses filters, such as the trim filter, that change the length of tokens without changing their offsets. Otherwise, the word_delimiter_graph filter could produce tokens with illegal offsets.

catenate_all: (Optional, boolean) If true, the filter produces catenated tokens for chains of alphanumeric characters separated by non-alphabetic delimiters. For example: super-duper-xl-500 → [ superduperxl500, super, duper, xl, 500 ]. Defaults to false.

Setting this parameter to true produces multi-position tokens, which are not supported by indexing.

If this parameter is true, avoid using this filter in an index analyzer or use the flatten_graph filter after this filter to make the token stream suitable for indexing.

When used for search analysis, catenated tokens can cause problems for the match_phrase query and other queries that rely on token position for matching. Avoid setting this parameter to true if you plan to use these queries.

catenate_numbers: (Optional, boolean) If true, the filter produces catenated tokens for chains of numeric characters separated by non-alphabetic delimiters. For example: 01-02-03 → [ 010203, 01, 02, 03 ]. Defaults to false.

Setting this parameter to true produces multi-position tokens, which are not supported by indexing.

If this parameter is true, avoid using this filter in an index analyzer or use the flatten_graph filter after this filter to make the token stream suitable for indexing.

When used for search analysis, catenated tokens can cause problems for the match_phrase query and other queries that rely on token position for matching. Avoid setting this parameter to true if you plan to use these queries.

catenate_words: (Optional, boolean) If true, the filter produces catenated tokens for chains of alphabetical characters separated by non-alphabetic delimiters. For example: super-duper-xl → [ superduperxl, super, duper, xl ]. Defaults to false.

Setting this parameter to true produces multi-position tokens, which are not supported by indexing.

If this parameter is true, avoid using this filter in an index analyzer or use the flatten_graph filter after this filter to make the token stream suitable for indexing.

When used for search analysis, catenated tokens can cause problems for the match_phrase query and other queries that rely on token position for matching. Avoid setting this parameter to true if you plan to use these queries.
generate_number_parts: (Optional, boolean) If true, the filter includes tokens consisting of only numeric characters in the output. If false, the filter excludes these tokens from the output. Defaults to true.
generate_word_parts: (Optional, boolean) If true, the filter includes tokens consisting of only alphabetical characters in the output. If false, the filter excludes these tokens from the output. Defaults to true.

preserve_original

(Optional, boolean) If true, the filter includes the original version of any split tokens in the output. This original version includes non-alphanumeric delimiters. For example: super-duper-xl-500 → [ super-duper-xl-500, super, duper, xl, 500 ]. Defaults to false.

Setting this parameter to true produces multi-position tokens, which are not supported by indexing.

If this parameter is true, avoid using this filter in an index analyzer or use the flatten_graph filter after this filter to make the token stream suitable for indexing.

protected_words

(Optional, array of strings) Array of tokens the filter won’t split.

protected_words_path

(Optional, string) Path to a file that contains a list of tokens the filter won’t split.

This path must be absolute or relative to the config location, and the file must be UTF-8 encoded. Each token in the file must be separated by a line break.

split_on_case_change

(Optional, boolean) If true, the filter splits tokens at letter case transitions. For example: camelCase → [ camel, Case ]. Defaults to true.

split_on_numerics

(Optional, boolean) If true, the filter splits tokens at letter-number transitions. For example: j2se → [ j, 2, se ]. Defaults to true.

stem_english_possessive

(Optional, boolean) If true, the filter removes the English possessive ('s) from the end of each token. For example: O'Neil's → [ O, Neil ]. Defaults to true.

type_table

(Optional, array of strings) Array of custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.

For example, the following array maps the plus (+) and hyphen (-) characters as alphanumeric, which means they won’t be treated as delimiters:

[ "+ => ALPHA", "- => ALPHA" ]

Supported types include:

ALPHA (Alphabetical)
ALPHANUM (Alphanumeric)
DIGIT (Numeric)
LOWER (Lowercase alphabetical)
SUBWORD_DELIM (Non-alphanumeric delimiter)
UPPER (Uppercase alphabetical)

type_table_path

(Optional, string) Path to a file that contains custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.

For example, the contents of this file may contain the following:

# Map the $, %, '.', and ',' characters to DIGIT
# This might be useful for financial data.
$ => DIGIT
% => DIGIT
. => DIGIT
\\u002C => DIGIT

# in some cases you might not want to split on ZWJ
# this also tests the case where we need a bigger byte[]
# see http://en.wikipedia.org/wiki/Zero-width_joiner
\\u200D => ALPHANUM

Supported types include:

ALPHA (Alphabetical)
ALPHANUM (Alphanumeric)
DIGIT (Numeric)
LOWER (Lowercase alphabetical)
SUBWORD_DELIM (Non-alphanumeric delimiter)
UPPER (Uppercase alphabetical)

This file path must be absolute or relative to the config location, and the file must be UTF-8 encoded. Each mapping in the file must be separated by a line break.

Customizeedit

To customize the word_delimiter_graph filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

For example, the following request creates a word_delimiter_graph filter that uses the following rules:

Split tokens at non-alphanumeric characters, except the hyphen (-) character.
Remove leading or trailing delimiters from each token.
Do not split tokens at letter case transitions.
Do not split tokens at letter-number transitions.
Remove the English possessive ('s) from the end of each token.

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "filter": [ "my_custom_word_delimiter_graph_filter" ]
        }
      },
      "filter": {
        "my_custom_word_delimiter_graph_filter": {
          "type": "word_delimiter_graph",
          "type_table": [ "- => ALPHA" ],
          "split_on_case_change": false,
          "split_on_numerics": false,
          "stem_english_possessive": true
        }
      }
    }
  }
}

Differences between `word_delimiter_graph` and `word_delimiter`edit

Both the word_delimiter_graph and word_delimiter filters produce tokens that span multiple positions when any of the following parameters are true:

However, only the word_delimiter_graph filter assigns multi-position tokens a positionLength attribute, which indicates the number of positions a token spans. This ensures the word_delimiter_graph filter always produces valid token graphs.

The word_delimiter filter does not assign multi-position tokens a positionLength attribute. This means it produces invalid graphs for streams including these tokens.

While indexing does not support token graphs containing multi-position tokens, queries, such as the match_phrase query, can use these graphs to generate multiple sub-queries from a single query string.

To see how token graphs produced by the word_delimiter and word_delimiter_graph filters differ, check out the following example.

Example

Basic token graph

Both the word_delimiter and word_delimiter_graph produce the following token graph for PowerShot2000 when the following parameters are false:

This graph does not contain multi-position tokens. All tokens span only one position.

word_delimiter_graph graph with a multi-position token

The word_delimiter_graph filter produces the following token graph for PowerShot2000 when catenate_words is true.

This graph correctly indicates the catenated PowerShot token spans two positions.

word_delimiter graph with a multi-position token

When catenate_words is true, the word_delimiter filter produces the following token graph for PowerShot2000.

Note that the catenated PowerShot token should span two positions but only spans one in the token graph, making it invalid.