Dealing with Null Values | Elasticsearch: The Definitive Guide [2.x]

原文地址: https://www.elastic.co/guide/en/elasticsearch/guide/current/_dealing_with_null_values.html, 版权归 www.elastic.co 所有

WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.

This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.

» » »

« Ranges All About Caching »

Dealing with Null Valuesedit

Think back to our earlier example, where documents have a field named tags. This is a multivalue field. A document may have one tag, many tags, or potentially no tags at all. If a field has no values, how is it stored in an inverted index?

That’s a trick question, because the answer is: it isn’t stored at all. Let’s look at that inverted index from the previous section:

Token	DocIDs
`open_source`	`2`
`search`	`1`,`2`

How would you store a field that doesn’t exist in that data structure? You can’t! An inverted index is simply a list of tokens and the documents that contain them. If a field doesn’t exist, it doesn’t hold any tokens, which means it won’t be represented in an inverted index data structure.

Ultimately, this means that a null, [] (an empty array), and [null] are all equivalent. They simply don’t exist in the inverted index!

Obviously, the world is not simple, and data is often missing fields, or contains explicit nulls or empty arrays. To deal with these situations, Elasticsearch has a few tools to work with null or missing values.

exists Queryedit

The first tool in your arsenal is the exists query. This query will return documents that have any value in the specified field. Let’s use the tagging example and index some example documents:

POST /my_index/posts/_bulk
{ "index": { "_id": "1"              }}
{ "tags" : ["search"]                }  
{ "index": { "_id": "2"              }}
{ "tags" : ["search", "open_source"] }  
{ "index": { "_id": "3"              }}
{ "other_field" : "some data"        }  
{ "index": { "_id": "4"              }}
{ "tags" : null                      }  
{ "index": { "_id": "5"              }}
{ "tags" : ["search", null]          }

	The `tags` field has one value.
	The `tags` field has two values.
	The `tags` field is missing altogether.
	The `tags` field is set to `null`.
	The `tags` field has one value and a `null`.

The resulting inverted index for our tags field will look like this:

Token	DocIDs
`open_source`	`2`
`search`	`1`,`2`,`5`

Our objective is to find all documents where a tag is set. We don’t care what the tag is, so long as it exists within the document. In SQL parlance, we would use an IS NOT NULL query:

SELECT tags
FROM   posts
WHERE  tags IS NOT NULL

In Elasticsearch, we use the exists query:

GET /my_index/posts/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "exists" : { "field" : "tags" }
            }
        }
    }
}

Our query returns three documents:

"hits" : [
    {
      "_id" :     "1",
      "_score" :  1.0,
      "_source" : { "tags" : ["search"] }
    },
    {
      "_id" :     "5",
      "_score" :  1.0,
      "_source" : { "tags" : ["search", null] } 
    },
    {
      "_id" :     "2",
      "_score" :  1.0,
      "_source" : { "tags" : ["search", "open source"] }
    }
]

Document 5 is returned even though it contains a null value. The field exists because a real-value tag was indexed, so the null had no impact on the filter.

The results are easy to understand. Any document that has terms in the tags field was returned as a hit. The only two documents that were excluded were documents 3 and 4.

missing Queryedit

The missing query is essentially the inverse of exists: it returns documents where there is no value for a particular field, much like this SQL:

SELECT tags
FROM   posts
WHERE  tags IS NULL

Let’s swap the exists query for a missing query from our previous example:

GET /my_index/posts/_search
{
    "query" : {
        "constant_score" : {
            "filter": {
                "missing" : { "field" : "tags" }
            }
        }
    }
}

And, as you would expect, we get back the two docs that have no real values in the tags field—documents 3 and 4:

"hits" : [
    {
      "_id" :     "3",
      "_score" :  1.0,
      "_source" : { "other_field" : "some data" }
    },
    {
      "_id" :     "4",
      "_score" :  1.0,
      "_source" : { "tags" : null }
    }
]

When null Means null

Sometimes you need to be able to distinguish between a field that doesn’t have a value, and a field that has been explicitly set to null. With the default behavior that we saw previously, this is impossible; the data is lost. Luckily, there is an option that we can set that replaces explicit null values with a placeholder value of our choosing.

When specifying the mapping for a string, numeric, Boolean, or date field, you can also set a null_value that will be used whenever an explicit null value is encountered. A field without a value will still be excluded from the inverted index.

When choosing a suitable null_value, ensure the following:

It matches the field’s type. You can’t use a string null_value in a field of type date.
It is different from the normal values that the field may contain, to avoid confusing real values with null values.

exists/missing on Objectsedit

The exists and missing queries also work on inner objects, not just core types. With the following document

{
   "name" : {
      "first" : "John",
      "last" :  "Smith"
   }
}

you can check for the existence of name.first and name.last but also just name. However, in Types and Mappings, we said that an object like the preceding one is flattened internally into a simple field-value structure, much like this:

{
   "name.first" : "John",
   "name.last"  : "Smith"
}

So how can we use an exists or missing query on the name field, which doesn’t really exist in the inverted index?

The reason that it works is that a filter like

{
    "exists" : { "field" : "name" }
}

is really executed as

{
    "bool": {
        "should": [
            { "exists": { "field": "name.first" }},
            { "exists": { "field": "name.last" }}
        ]
    }
}

That also means that if first and last were both empty, the name namespace would not exist.

« Ranges All About Caching »