Finding Distinct Countsedit

The first approximate aggregation provided by Elasticsearch is the cardinality metric. This provides the cardinality of a field, also called a distinct or unique count. You may be familiar with the SQL version:

SELECT COUNT(DISTINCT color)
FROM cars

Distinct counts are a common operation, and answer many fundamental business questions:

  • How many unique visitors have come to my website?
  • How many unique cars have we sold?
  • How many distinct users purchased a product each month?

We can use the cardinality metric to determine the number of car colors being sold at our dealership:

GET /cars/transactions/_search
{
    "size" : 0,
    "aggs" : {
        "distinct_colors" : {
            "cardinality" : {
              "field" : "color"
            }
        }
    }
}

This returns a minimal response showing that we have sold three different-colored cars:

...
"aggregations": {
  "distinct_colors": {
     "value": 3
  }
}
...

We can make our example more useful: how many colors were sold each month? For that metric, we just nest the cardinality metric under a date_histogram:

GET /cars/transactions/_search
{
  "size" : 0,
  "aggs" : {
      "months" : {
        "date_histogram": {
          "field": "sold",
          "interval": "month"
        },
        "aggs": {
          "distinct_colors" : {
              "cardinality" : {
                "field" : "color"
              }
          }
        }
      }
  }
}

Understanding the Trade-offsedit

As mentioned at the top of this chapter, the cardinality metric is an approximate algorithm. It is based on the HyperLogLog++ (HLL) algorithm. HLL works by hashing your input and using the bits from the hash to make probabilistic estimations on the cardinality.

You don’t need to understand the technical details (although if you’re interested, the paper is a great read!), but you should be aware of the properties of the algorithm:

  • Configurable precision, which controls memory usage (more precise == more memory).
  • Excellent accuracy on low-cardinality sets.
  • Fixed memory usage. Whether there are thousands or billions of unique values, memory usage depends on only the configured precision.

To configure the precision, you must specify the precision_threshold parameter. This threshold defines the point under which cardinalities are expected to be very close to accurate. Consider this example:

GET /cars/transactions/_search
{
    "size" : 0,
    "aggs" : {
        "distinct_colors" : {
            "cardinality" : {
              "field" : "color",
              "precision_threshold" : 100 
            }
        }
    }
}

precision_threshold accepts a number from 0–40,000. Larger values are treated as equivalent to 40,000.

This example will ensure that fields with 100 or fewer distinct values will be extremely accurate. Although not guaranteed by the algorithm, if a cardinality is under the threshold, it is almost always 100% accurate. Cardinalities above this will begin to trade accuracy for memory savings, and a little error will creep into the metric.

For a given threshold, the HLL data-structure will use about precision_threshold * 8 bytes of memory. So you must balance how much memory you are willing to sacrifice for additional accuracy.

Practically speaking, a threshold of 100 maintains an error under 5% even when counting millions of unique values.

Optimizing for Speededit

If you want a distinct count, you usually want to query your entire dataset (or nearly all of it). Any operation on all your data needs to execute quickly, for obvious reasons. HyperLogLog is very fast already—​it simply hashes your data and does some bit-twiddling.

But if speed is important to you, we can optimize it a little bit further. Since HLL simply needs the hash of the field, we can precompute that hash at index time. When the query executes, we can skip the hash computation and load the value directly out of fielddata.

Precomputing hashes is useful only on very large and/or high-cardinality fields. Calculating the hash on these fields is non-negligible at query time.

However, numeric fields hash very quickly, and storing the original numeric often requires the same (or less) memory. This is also true on low-cardinality string fields; there are internal optimizations that guarantee that hashes are calculated only once per unique value.

Basically, precomputing hashes is not guaranteed to make all fields faster — only those that have high cardinality and/or large strings. And remember, precomputing simply shifts the cost to index time. You still pay the price; you just choose when to pay it.

To do this, we need to add a new multifield to our data. We’ll delete our index, add a new mapping that includes the hashed field, and then reindex:

DELETE /cars/

PUT /cars/
{
  "mappings": {
    "transactions": {
      "properties": {
        "color": {
          "type": "string",
          "fields": {
            "hash": {
              "type": "murmur3" 
            }
          }
        }
      }
    }
  }
}

POST /cars/transactions/_bulk
{ "index": {}}
{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }
{ "index": {}}
{ "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" }
{ "index": {}}
{ "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" }
{ "index": {}}
{ "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }

This multifield is of type murmur3, which is a hashing function.

Now when we run an aggregation, we use the color.hash field instead of the color field:

GET /cars/transactions/_search
{
    "size" : 0,
    "aggs" : {
        "distinct_colors" : {
            "cardinality" : {
              "field" : "color.hash" 
            }
        }
    }
}

Notice that we specify the hashed multifield, rather than the original.

Now the cardinality metric will load the values (the precomputed hashes) from "color.hash" and use those in place of dynamically hashing the original value.

The savings per document is small, but if hashing each field adds 10 nanoseconds and your aggregation touches 100 million documents, that adds 1 second per query. If you find yourself using cardinality across many documents, perform some profiling to see if precomputing hashes makes sense for your deployment.