Boxplot Aggregation | Elasticsearch Guide [7.7]

原文地址: https://www.elastic.co/guide/en/elasticsearch/reference/7.7/search-aggregations-metrics-boxplot-aggregation.html, 原文档版权归 www.elastic.co 所有

IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

» » »

« Weighted Avg Aggregation Cardinality Aggregation »

Boxplot Aggregationedit

A boxplot metrics aggregation that computes boxplot of numeric values extracted from the aggregated documents. These values can be generated by a provided script or extracted from specific numeric or histogram fields in the documents.

The boxplot aggregation returns essential information for making a box plot: minimum, maximum median, first quartile (25th percentile) and third quartile (75th percentile) values.

Syntaxedit

A boxplot aggregation looks like this in isolation:

{
    "boxplot": {
        "field": "load_time"
    }
}

Let’s look at a boxplot representing load time:

GET latency/_search
{
    "size": 0,
    "aggs" : {
        "load_time_boxplot" : {
            "boxplot" : {
                "field" : "load_time" 
            }
        }
    }
}

The field load_time must be a numeric field

The response will look like this:

{
    ...

   "aggregations": {
      "load_time_boxplot": {
         "min": 0.0,
         "max": 990.0,
         "q1": 165.0,
         "q2": 445.0,
         "q3": 725.0
      }
   }
}

Scriptedit

The boxplot metric supports scripting. For example, if our load times are in milliseconds but we want values calculated in seconds, we could use a script to convert them on-the-fly:

GET latency/_search
{
    "size": 0,
    "aggs" : {
        "load_time_boxplot" : {
            "boxplot" : {
                "script" : {
                    "lang": "painless",
                    "source": "doc['load_time'].value / params.timeUnit", 
                    "params" : {
                        "timeUnit" : 1000   
                    }
                }
            }
        }
    }
}

	The `field` parameter is replaced with a `script` parameter, which uses the script to generate values which percentiles are calculated on
	Scripting supports parameterized input just like any other script

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a stored script use the following syntax:

GET latency/_search
{
    "size": 0,
    "aggs" : {
        "load_time_boxplot" : {
            "boxplot" : {
                "script" : {
                    "id": "my_script",
                    "params": {
                        "field": "load_time"
                    }
                }
            }
        }
    }
}

Boxplot values are (usually) approximateedit

The algorithm used by the boxplot metric is called TDigest (introduced by Ted Dunning in Computing Accurate Quantiles using T-Digests).

Boxplot as other percentile aggregations are also non-deterministic. This means you can get slightly different results using the same data.

Compressionedit

Approximate algorithms must balance memory utilization with estimation accuracy. This balance can be controlled using a compression parameter:

GET latency/_search
{
    "size": 0,
    "aggs" : {
        "load_time_boxplot" : {
            "boxplot" : {
                "field" : "load_time",
                "compression" : 200 
            }
        }
    }
}

Compression controls memory usage and approximation error

The TDigest algorithm uses a number of "nodes" to approximate percentiles — the more nodes available, the higher the accuracy (and large memory footprint) proportional to the volume of data. The compression parameter limits the maximum number of nodes to 20 * compression.

Therefore, by increasing the compression value, you can increase the accuracy of your percentiles at the cost of more memory. Larger compression values also make the algorithm slower since the underlying tree data structure grows in size, resulting in more expensive operations. The default compression value is 100.

A "node" uses roughly 32 bytes of memory, so under worst-case scenarios (large amount of data which arrives sorted and in-order) the default settings will produce a TDigest roughly 64KB in size. In practice data tends to be more random and the TDigest will use less memory.

Missing valueedit

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

GET latency/_search
{
    "size": 0,
    "aggs" : {
        "grade_boxplot" : {
            "boxplot" : {
                "field" : "grade",
                "missing": 10 
            }
        }
    }
}

Documents without a value in the grade field will fall into the same bucket as documents that have the value 10.

« Weighted Avg Aggregation Cardinality Aggregation »