WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Doc Valuesedit
Aggregations work via a data structure known as doc values (briefly introduced in Doc Values Intro). Doc values are what make aggregations fast, efficient and memory-friendly, so it is useful to understand how they work.
Doc values exists because inverted indices are efficient for only certain operations. The inverted index excels at finding documents that contain a term. It does not perform well in the opposite direction: determining which terms exist in a single document. Aggregations need this secondary access pattern.
Consider the following inverted index:
Term Doc_1 Doc_2 Doc_3 ------------------------------------ brown | X | X | dog | X | | X dogs | | X | X fox | X | | X foxes | | X | in | | X | jumped | X | | X lazy | X | X | leap | | X | over | X | X | X quick | X | X | X summer | | X | the | X | | X ------------------------------------
If we want to compile a complete list of terms in any document that mentions
brown
, we might build a query like so:
GET /my_index/_search { "query" : { "match" : { "body" : "brown" } }, "aggs" : { "popular_terms": { "terms" : { "field" : "body" } } } }
The query portion is easy and efficient. The inverted index is sorted by
terms, so first we find brown
in the terms list, and then scan across all the
columns to see which documents contain brown
. We can very quickly see that
Doc_1
and Doc_2
contain the token brown
.
Then, for the aggregation portion, we need to find all the unique terms in
Doc_1
and Doc_2
. Trying to do this with the inverted index would be a
very expensive process: we would have to iterate over every term in the index
and collect tokens from Doc_1
and Doc_2
columns. This would be slow
and scale poorly: as the number of terms and documents grows, so would the
execution time.
Doc values addresses this problem by inverting the relationship. While the inverted index maps terms to the documents containing the term, doc values maps documents to the terms contained by the document:
Doc Terms ----------------------------------------------------------------- Doc_1 | brown, dog, fox, jumped, lazy, over, quick, the Doc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer Doc_3 | dog, dogs, fox, jumped, over, quick, the -----------------------------------------------------------------
Once the data has been uninverted, it is trivial to collect the unique tokens from
Doc_1
and Doc_2
. Go to the rows for each document, collect all the terms, and
take the union of the two sets.
Thus, search and aggregations are closely intertwined. Search finds documents by using the inverted index. Aggregations collect and aggregate values from doc values.
Doc values are not just used for aggregations. They are required for any operation that must look up the value contained in a specific document. Besides aggregations, this includes sorting, scripts that access field values and parent-child relationships (see Parent-Child Relationship).
- Elasticsearch - The Definitive Guide:
- Foreword
- Preface
- Getting Started
- You Know, for Search…
- Installing and Running Elasticsearch
- Talking to Elasticsearch
- Document Oriented
- Finding Your Feet
- Indexing Employee Documents
- Retrieving a Document
- Search Lite
- Search with Query DSL
- More-Complicated Searches
- Full-Text Search
- Phrase Search
- Highlighting Our Searches
- Analytics
- Tutorial Conclusion
- Distributed Nature
- Next Steps
- Life Inside a Cluster
- Data In, Data Out
- What Is a Document?
- Document Metadata
- Indexing a Document
- Retrieving a Document
- Checking Whether a Document Exists
- Updating a Whole Document
- Creating a New Document
- Deleting a Document
- Dealing with Conflicts
- Optimistic Concurrency Control
- Partial Updates to Documents
- Retrieving Multiple Documents
- Cheaper in Bulk
- Distributed Document Store
- Searching—The Basic Tools
- Mapping and Analysis
- Full-Body Search
- Sorting and Relevance
- Distributed Search Execution
- Index Management
- Inside a Shard
- You Know, for Search…
- Search in Depth
- Structured Search
- Full-Text Search
- Multifield Search
- Proximity Matching
- Partial Matching
- Controlling Relevance
- Theory Behind Relevance Scoring
- Lucene’s Practical Scoring Function
- Query-Time Boosting
- Manipulating Relevance with Query Structure
- Not Quite Not
- Ignoring TF/IDF
- function_score Query
- Boosting by Popularity
- Boosting Filtered Subsets
- Random Scoring
- The Closer, The Better
- Understanding the price Clause
- Scoring with Scripts
- Pluggable Similarity Algorithms
- Changing Similarities
- Relevance Tuning Is the Last 10%
- Dealing with Human Language
- Aggregations
- Geolocation
- Modeling Your Data
- Administration, Monitoring, and Deployment