Delimited payload token filter | Elasticsearch Guide [7.7]

原文地址: https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-delimited-payload-tokenfilter.html, 原文档版权归 www.elastic.co 所有

IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

» » »

« Decimal digit token filter Dictionary decompounder token filter »

Delimited payload token filteredit

The older name delimited_payload_filter is deprecated and should not be used with new indices. Use delimited_payload instead.

Separates a token stream into tokens and payloads based on a specified delimiter.

For example, you can use the delimited_payload filter with a | delimiter to split the|1 quick|2 fox|3 into the tokens the, quick, and fox with respective payloads of 1, 2, and 3.

This filter uses Lucene’s DelimitedPayloadTokenFilter.

Payloads

A payload is user-defined binary data associated with a token position and stored as base64-encoded bytes.

Elasticsearch does not store token payloads by default. To store payloads, you must:

Set the term_vector mapping parameter to with_positions_payloads or with_positions_offsets_payloads for any field storing payloads.
Use an index analyzer that includes the delimited_payload filter

You can view stored payloads using the term vectors API.

Exampleedit

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["delimited_payload"],
  "text": "the|0 brown|10 fox|5 is|0 quick|10"
}

The filter produces the following tokens:

[ the, brown, fox, is, quick ]

Note that the analyze API does not return stored payloads. For an example that includes returned payloads, see Return stored payloads.

Add to an analyzeredit

The following create index API request uses the delimited-payload filter to configure a new custom analyzer.

PUT delimited_payload
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_delimited_payload": {
          "tokenizer": "whitespace",
          "filter": [ "delimited_payload" ]
        }
      }
    }
  }
}

Configurable parametersedit

delimiter

(Optional, string) Character used to separate tokens from payloads. Defaults to |.

encoding

(Optional, string) Datatype for the stored payload. Valid values are:

float: (Default) Float
identity: Characters
int: Integer

Customize and add to an analyzeredit

To customize the delimited_payload filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

For example, the following create index API request uses a custom delimited_payload filter to configure a new custom analyzer. The custom delimited_payload filter uses the + delimiter to separate tokens from payloads. Payloads are encoded as integers.

PUT delimited_payload_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_plus_delimited": {
          "tokenizer": "whitespace",
          "filter": [ "plus_delimited" ]
        }
      },
      "filter": {
        "plus_delimited": {
          "type": "delimited_payload",
          "delimiter": "+",
          "encoding": "int"
        }
      }
    }
  }
}

Return stored payloadsedit

Use the create index API to create an index that:

Includes a field that stores term vectors with payloads.
Uses a custom index analyzer with the delimited_payload filter.

PUT text_payloads
{
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "term_vector": "with_positions_payloads",
        "analyzer": "payload_delimiter"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "payload_delimiter": {
          "tokenizer": "whitespace",
          "filter": [ "delimited_payload" ]
        }
      }
    }
  }
}

Add a document containing payloads to the index.

POST text_payloads/_doc/1
{
  "text": "the|0 brown|3 fox|4 is|0 quick|10"
}

Use the term vectors API to return the document’s tokens and base64-encoded payloads.

GET text_payloads/_termvectors/1
{
  "fields": [ "text" ],
  "payloads": true
}

The API returns the following response:

{
  "_index": "text_payloads",
  "_type": "_doc",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 8,
  "term_vectors": {
    "text": {
      "field_statistics": {
        "sum_doc_freq": 5,
        "doc_count": 1,
        "sum_ttf": 5
      },
      "terms": {
        "brown": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "payload": "QEAAAA=="
            }
          ]
        },
        "fox": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 2,
              "payload": "QIAAAA=="
            }
          ]
        },
        "is": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 3,
              "payload": "AAAAAA=="
            }
          ]
        },
        "quick": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 4,
              "payload": "QSAAAA=="
            }
          ]
        },
        "the": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "payload": "AAAAAA=="
            }
          ]
        }
      }
    }
  }
}

« Decimal digit token filter Dictionary decompounder token filter »