WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Scrolledit
A scroll
query is used to retrieve
large numbers of documents from Elasticsearch efficiently, without paying the
penalty of deep pagination.
Scrolling allows us to do an initial search and to keep pulling batches of results from Elasticsearch until there are no more results left. It’s a bit like a cursor in a traditional database.
A scrolled search takes a snapshot in time. It doesn’t see any changes that are made to the index after the initial search request has been made. It does this by keeping the old data files around, so that it can preserve its “view” on what the index looked like at the time it started.
The costly part of deep pagination is the global sorting of results, but if we
disable sorting, then we can return all documents quite cheaply. To do this, we
sort by _doc
. This instructs Elasticsearch just return the next batch of
results from every shard that still has results to return.
To scroll through results, we execute a search request and set the scroll
value to
the length of time we want to keep the scroll window open. The scroll expiry
time is refreshed every time we run a scroll request, so it only needs to be long enough
to process the current batch of results, not all of the documents that match
the query. The timeout is important because keeping the scroll window open
consumes resources and we want to free them as soon as they are no longer needed.
Setting the timeout enables Elasticsearch to automatically free the resources
after a small period of inactivity.
The response to this request includes a
_scroll_id
, which is a long Base-64 encoded string. Now we can pass the
_scroll_id
to the _search/scroll
endpoint to retrieve the next batch of
results:
GET /_search/scroll { "scroll": "1m", "scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTsxMDk5NDpkUmpiR2FjOFNhNnlCM1ZDMWpWYnRROzEwOTk1OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MTA5OTM6ZFJqYkdhYzhTYTZ5QjNWQzFqVmJ0UTsxMTE5MDpBVUtwN2lxc1FLZV8yRGVjWlI2QUVBOzEwOTk2OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MDs=" }
The response to this scroll request includes the next batch of results.
Although we specified a size
of 1,000, we get back many more
documents. When scanning, the size
is applied to each shard, so you will
get back a maximum of size * number_of_primary_shards
documents in each
batch.
The scroll request also returns a new _scroll_id
. Every time
we make the next scroll request, we must pass the _scroll_id
returned by the
previous scroll request.
When no more hits are returned, we have processed all matching documents.
Some of the official Elasticsearch clients such as Python client and Perl client provide scroll helpers that provide easy-to-use wrappers around this funtionality.
- Elasticsearch - The Definitive Guide:
- Foreword
- Preface
- Getting Started
- You Know, for Search…
- Installing and Running Elasticsearch
- Talking to Elasticsearch
- Document Oriented
- Finding Your Feet
- Indexing Employee Documents
- Retrieving a Document
- Search Lite
- Search with Query DSL
- More-Complicated Searches
- Full-Text Search
- Phrase Search
- Highlighting Our Searches
- Analytics
- Tutorial Conclusion
- Distributed Nature
- Next Steps
- Life Inside a Cluster
- Data In, Data Out
- What Is a Document?
- Document Metadata
- Indexing a Document
- Retrieving a Document
- Checking Whether a Document Exists
- Updating a Whole Document
- Creating a New Document
- Deleting a Document
- Dealing with Conflicts
- Optimistic Concurrency Control
- Partial Updates to Documents
- Retrieving Multiple Documents
- Cheaper in Bulk
- Distributed Document Store
- Searching—The Basic Tools
- Mapping and Analysis
- Full-Body Search
- Sorting and Relevance
- Distributed Search Execution
- Index Management
- Inside a Shard
- You Know, for Search…
- Search in Depth
- Structured Search
- Full-Text Search
- Multifield Search
- Proximity Matching
- Partial Matching
- Controlling Relevance
- Theory Behind Relevance Scoring
- Lucene’s Practical Scoring Function
- Query-Time Boosting
- Manipulating Relevance with Query Structure
- Not Quite Not
- Ignoring TF/IDF
- function_score Query
- Boosting by Popularity
- Boosting Filtered Subsets
- Random Scoring
- The Closer, The Better
- Understanding the price Clause
- Scoring with Scripts
- Pluggable Similarity Algorithms
- Changing Similarities
- Relevance Tuning Is the Last 10%
- Dealing with Human Language
- Aggregations
- Geolocation
- Modeling Your Data
- Administration, Monitoring, and Deployment