Request Body Searchedit
Specifies search criteria as request body parameters.
GET /twitter/_search { "query" : { "term" : { "user" : "kimchy" } } }
Requestedit
GET /<index>/_search
{
"query": {<parameters>}
}
Descriptionedit
The search request can be executed with a search DSL, which includes the Query DSL, within its body.
Path parametersedit
-
<index>
- (Optional, string) Comma-separated list or wildcard expression of index names used to limit the request.
Request bodyedit
See the search API’s request body parameters.
Fast check for any matching docsedit
terminate_after
is always applied after the post_filter
and stops
the query as well as the aggregation executions when enough hits have been
collected on the shard. Though the doc count on aggregations may not reflect
the hits.total
in the response since aggregations are applied before the
post filtering.
In case we only want to know if there are any documents matching a
specific query, we can set the size
to 0
to indicate that we are not
interested in the search results. Also we can set terminate_after
to 1
to indicate that the query execution can be terminated whenever the first
matching document was found (per shard).
GET /_search?q=message:number&size=0&terminate_after=1
The response will not contain any hits as the size
was set to 0
. The
hits.total
will be either equal to 0
, indicating that there were no
matching documents, or greater than 0
meaning that there were at least
as many documents matching the query when it was early terminated.
Also if the query was terminated early, the terminated_early
flag will
be set to true
in the response.
{ "took": 3, "timed_out": false, "terminated_early": true, "_shards": { "total": 1, "successful": 1, "skipped" : 0, "failed": 0 }, "hits": { "total" : { "value": 1, "relation": "eq" }, "max_score": null, "hits": [] } }
The took
time in the response contains the milliseconds that this request
took for processing, beginning quickly after the node received the query, up
until all search related work is done and before the above JSON is returned
to the client. This means it includes the time spent waiting in thread pools,
executing a distributed search across the whole cluster and gathering all the
results.
Doc value fieldsedit
See doc value fields.
Field Collapsingedit
Allows to collapse search results based on field values. The collapsing is done by selecting only the top sorted document per collapse key. For instance the query below retrieves the best tweet for each user and sorts them by number of likes.
GET /twitter/_search { "query": { "match": { "message": "elasticsearch" } }, "collapse" : { "field" : "user" }, "sort": ["likes"], "from": 10 }
collapse the result set using the "user" field |
|
sort the top docs by number of likes |
|
define the offset of the first collapsed result |
The total number of hits in the response indicates the number of matching documents without collapsing. The total number of distinct group is unknown.
The field used for collapsing must be a single valued keyword
or numeric
field with doc_values
activated
The collapsing is applied to the top hits only and does not affect aggregations.
Expand collapse resultsedit
It is also possible to expand each collapsed top hits with the inner_hits
option.
GET /twitter/_search { "query": { "match": { "message": "elasticsearch" } }, "collapse" : { "field" : "user", "inner_hits": { "name": "last_tweets", "size": 5, "sort": [{ "date": "asc" }] }, "max_concurrent_group_searches": 4 }, "sort": ["likes"] }
collapse the result set using the "user" field |
|
the name used for the inner hit section in the response |
|
the number of inner_hits to retrieve per collapse key |
|
how to sort the document inside each group |
|
the number of concurrent requests allowed to retrieve the inner_hits` per group |
See inner hits for the complete list of supported options and the format of the response.
It is also possible to request multiple inner_hits
for each collapsed hit. This can be useful when you want to get
multiple representations of the collapsed hits.
GET /twitter/_search { "query": { "match": { "message": "elasticsearch" } }, "collapse" : { "field" : "user", "inner_hits": [ { "name": "most_liked", "size": 3, "sort": ["likes"] }, { "name": "most_recent", "size": 3, "sort": [{ "date": "asc" }] } ] }, "sort": ["likes"] }
collapse the result set using the "user" field |
|
return the three most liked tweets for the user |
|
return the three most recent tweets for the user |
The expansion of the group is done by sending an additional query for each
inner_hit
request for each collapsed hit returned in the response. This can significantly slow things down
if you have too many groups and/or inner_hit
requests.
The max_concurrent_group_searches
request parameter can be used to control
the maximum number of concurrent searches allowed in this phase.
The default is based on the number of data nodes and the default search thread pool size.
collapse
cannot be used in conjunction with scroll,
rescore or search after.
Second level of collapsingedit
Second level of collapsing is also supported and is applied to inner_hits
.
For example, the following request finds the top scored tweets for
each country, and within each country finds the top scored tweets
for each user.
GET /twitter/_search { "query": { "match": { "message": "elasticsearch" } }, "collapse" : { "field" : "country", "inner_hits" : { "name": "by_location", "collapse" : {"field" : "user"}, "size": 3 } } }
Response:
{ ... "hits": [ { "_index": "twitter", "_type": "_doc", "_id": "9", "_score": ..., "_source": {...}, "fields": {"country": ["UK"]}, "inner_hits":{ "by_location": { "hits": { ..., "hits": [ { ... "fields": {"user" : ["user124"]} }, { ... "fields": {"user" : ["user589"]} }, { ... "fields": {"user" : ["user001"]} } ] } } } }, { "_index": "twitter", "_type": "_doc", "_id": "1", "_score": .., "_source": {...}, "fields": {"country": ["Canada"]}, "inner_hits":{ "by_location": { "hits": { ..., "hits": [ { ... "fields": {"user" : ["user444"]} }, { ... "fields": {"user" : ["user1111"]} }, { ... "fields": {"user" : ["user999"]} } ] } } } }, .... ] }
Second level of collapsing doesn’t allow inner_hits
.
Highlightingedit
Highlighters enable you to get highlighted snippets from one or more fields
in your search results so you can show users where the query matches are.
When you request highlights, the response contains an additional highlight
element for each search hit that includes the highlighted fields and the
highlighted fragments.
Highlighters don’t reflect the boolean logic of a query when extracting
terms to highlight. Thus, for some complex boolean queries (e.g nested boolean
queries, queries using minimum_should_match
etc.), parts of documents may be
highlighted that don’t correspond to query matches.
Highlighting requires the actual content of a field. If the field is not
stored (the mapping does not set store
to true
), the actual _source
is
loaded and the relevant field is extracted from _source
.
For example, to get highlights for the content
field in each search hit
using the default highlighter, include a highlight
object in
the request body that specifies the content
field:
GET /_search { "query" : { "match": { "content": "kimchy" } }, "highlight" : { "fields" : { "content" : {} } } }
Elasticsearch supports three highlighters: unified
, plain
, and fvh
(fast vector
highlighter). You can specify the highlighter type
you want to use
for each field.
Unified highlighteredit
The unified
highlighter uses the Lucene Unified Highlighter. This
highlighter breaks the text into sentences and uses the BM25 algorithm to score
individual sentences as if they were documents in the corpus. It also supports
accurate phrase and multi-term (fuzzy, prefix, regex) highlighting. This is the
default highlighter.
Plain highlighteredit
The plain
highlighter uses the standard Lucene highlighter. It attempts to
reflect the query matching logic in terms of understanding word importance and
any word positioning criteria in phrase queries.
The plain
highlighter works best for highlighting simple query matches in a
single field. To accurately reflect query logic, it creates a tiny in-memory
index and re-runs the original query criteria through Lucene’s query execution
planner to get access to low-level match information for the current document.
This is repeated for every field and every document that needs to be highlighted.
If you want to highlight a lot of fields in a lot of documents with complex
queries, we recommend using the unified
highlighter on postings
or term_vector
fields.
Fast vector highlighteredit
The fvh
highlighter uses the Lucene Fast Vector highlighter.
This highlighter can be used on fields with term_vector
set to
with_positions_offsets
in the mapping. The fast vector highlighter:
-
Can be customized with a
boundary_scanner
. -
Requires setting
term_vector
towith_positions_offsets
which increases the size of the index -
Can combine matches from multiple fields into one result. See
matched_fields
- Can assign different weights to matches at different positions allowing for things like phrase matches being sorted above term matches when highlighting a Boosting Query that boosts phrase matches over term matches
The fvh
highlighter does not support span queries. If you need support for
span queries, try an alternative highlighter, such as the unified
highlighter.
Offsets Strategyedit
To create meaningful search snippets from the terms being queried, the highlighter needs to know the start and end character offsets of each word in the original text. These offsets can be obtained from:
-
The postings list. If
index_options
is set tooffsets
in the mapping, theunified
highlighter uses this information to highlight documents without re-analyzing the text. It re-runs the original query directly on the postings and extracts the matching offsets from the index, limiting the collection to the highlighted documents. This is important if you have large fields because it doesn’t require reanalyzing the text to be highlighted. It also requires less disk space than usingterm_vectors
. -
Term vectors. If
term_vector
information is provided by settingterm_vector
towith_positions_offsets
in the mapping, theunified
highlighter automatically uses theterm_vector
to highlight the field. It’s fast especially for large fields (>1MB
) and for highlighting multi-term queries likeprefix
orwildcard
because it can access the dictionary of terms for each document. Thefvh
highlighter always uses term vectors. -
Plain highlighting. This mode is used by the
unified
when there is no other alternative. It creates a tiny in-memory index and re-runs the original query criteria through Lucene’s query execution planner to get access to low-level match information on the current document. This is repeated for every field and every document that needs highlighting. Theplain
highlighter always uses plain highlighting.
Plain highlighting for large texts may require substantial amount of time and memory.
To protect against this, the maximum number of text characters that will be analyzed has been
limited to 1000000. This default limit can be changed
for a particular index with the index setting index.highlight.max_analyzed_offset
.
Highlighting Settingsedit
Highlighting settings can be set on a global level and overridden at the field level.
- boundary_chars
-
A string that contains each boundary character.
Defaults to
.,!? \t\n
. - boundary_max_scan
-
How far to scan for boundary characters. Defaults to
20
.
- boundary_scanner
-
Specifies how to break the highlighted fragments:
chars
,sentence
, orword
. Only valid for theunified
andfvh
highlighters. Defaults tosentence
for theunified
highlighter. Defaults tochars
for thefvh
highlighter.-
chars
-
Use the characters specified by
boundary_chars
as highlighting boundaries. Theboundary_max_scan
setting controls how far to scan for boundary characters. Only valid for thefvh
highlighter. -
sentence
-
Break highlighted fragments at the next sentence boundary, as determined by Java’s BreakIterator. You can specify the locale to use with
boundary_scanner_locale
.When used with the
unified
highlighter, thesentence
scanner splits sentences bigger thanfragment_size
at the first word boundary next tofragment_size
. You can setfragment_size
to 0 to never split any sentence. -
word
-
Break highlighted fragments at the next word boundary, as determined
by Java’s BreakIterator.
You can specify the locale to use with
boundary_scanner_locale
.
-
- boundary_scanner_locale
-
Controls which locale is used to search for sentence
and word boundaries. This parameter takes a form of a language tag,
e.g.
"en-US"
,"fr-FR"
,"ja-JP"
. More info can be found in the Locale Language Tag documentation. The default value is Locale.ROOT. - encoder
-
Indicates if the snippet should be HTML encoded:
default
(no encoding) orhtml
(HTML-escape the snippet text and then insert the highlighting tags) - fields
-
Specifies the fields to retrieve highlights for. You can use wildcards to specify fields. For example, you could specify
comment_*
to get highlights for all text and keyword fields that start withcomment_
.Only text and keyword fields are highlighted when you use wildcards. If you use a custom mapper and want to highlight on a field anyway, you must explicitly specify that field name.
- force_source
-
Highlight based on the source even if the field is
stored separately. Defaults to
false
. - fragmenter
-
Specifies how text should be broken up in highlight snippets:
simple
orspan
. Only valid for theplain
highlighter. Defaults tospan
.-
simple
- Breaks up text into same-sized fragments.
-
span
- Breaks up text into same-sized fragments, but tries to avoid breaking up text between highlighted terms. This is helpful when you’re querying for phrases. Default.
-
- fragment_offset
-
Controls the margin from which you want to start
highlighting. Only valid when using the
fvh
highlighter. - fragment_size
- The size of the highlighted fragment in characters. Defaults to 100.
- highlight_query
-
Highlight matches for a query other than the search query. This is especially useful if you use a rescore query because those are not taken into account by highlighting by default.
Elasticsearch does not validate that
highlight_query
contains the search query in any way so it is possible to define it so legitimate query results are not highlighted. Generally, you should include the search query as part of thehighlight_query
. - matched_fields
-
Combine matches on multiple fields to highlight a single field.
This is most intuitive for multifields that analyze the same string in different
ways. All
matched_fields
must haveterm_vector
set towith_positions_offsets
, but only the field to which the matches are combined is loaded so only that field benefits from havingstore
set toyes
. Only valid for thefvh
highlighter. - no_match_size
- The amount of text you want to return from the beginning of the field if there are no matching fragments to highlight. Defaults to 0 (nothing is returned).
- number_of_fragments
-
The maximum number of fragments to return. If the
number of fragments is set to 0, no fragments are returned. Instead,
the entire field contents are highlighted and returned. This can be
handy when you need to highlight short texts such as a title or
address, but fragmentation is not required. If
number_of_fragments
is 0,fragment_size
is ignored. Defaults to 5. - order
-
Sorts highlighted fragments by score when set to
score
. By default, fragments will be output in the order they appear in the field (order:none
). Setting this option toscore
will output the most relevant fragments first. Each highlighter applies its own logic to compute relevancy scores. See the document How highlighters work internally for more details how different highlighters find the best fragments. - phrase_limit
-
Controls the number of matching phrases in a document that are
considered. Prevents the
fvh
highlighter from analyzing too many phrases and consuming too much memory. When usingmatched_fields
,phrase_limit
phrases per matched field are considered. Raising the limit increases query time and consumes more memory. Only supported by thefvh
highlighter. Defaults to 256. - pre_tags
-
Use in conjunction with
post_tags
to define the HTML tags to use for the highlighted text. By default, highlighted text is wrapped in<em>
and</em>
tags. Specify as an array of strings. - post_tags
-
Use in conjunction with
pre_tags
to define the HTML tags to use for the highlighted text. By default, highlighted text is wrapped in<em>
and</em>
tags. Specify as an array of strings. - require_field_match
-
By default, only fields that contains a query match are
highlighted. Set
require_field_match
tofalse
to highlight all fields. Defaults totrue
. - tags_schema
-
Set to
styled
to use the built-in tag schema. Thestyled
schema defines the followingpre_tags
and definespost_tags
as</em>
.<em class="hlt1">, <em class="hlt2">, <em class="hlt3">, <em class="hlt4">, <em class="hlt5">, <em class="hlt6">, <em class="hlt7">, <em class="hlt8">, <em class="hlt9">, <em class="hlt10">
Highlighting Examplesedit
- Override global settings
- Specify a highlight query
- Set highlighter type
- Configure highlighting tags
- Highlight source
- Highlight all fields
- Combine matches on multiple fields
- Explicitly order highlighted fields
- Control highlighted fragments
- Highlight using the postings list
- Specify a fragmenter for the plain highlighter
Override global settingsedit
You can specify highlighter settings globally and selectively override them for individual fields.
GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "number_of_fragments" : 3, "fragment_size" : 150, "fields" : { "body" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] }, "blog.title" : { "number_of_fragments" : 0 }, "blog.author" : { "number_of_fragments" : 0 }, "blog.comment" : { "number_of_fragments" : 5, "order" : "score" } } } }
Specify a highlight queryedit
You can specify a highlight_query
to take additional information into account
when highlighting. For example, the following query includes both the search
query and rescore query in the highlight_query
. Without the highlight_query
,
highlighting would only take the search query into account.
GET /_search { "query" : { "match": { "comment": { "query": "foo bar" } } }, "rescore": { "window_size": 50, "query": { "rescore_query" : { "match_phrase": { "comment": { "query": "foo bar", "slop": 1 } } }, "rescore_query_weight" : 10 } }, "_source": false, "highlight" : { "order" : "score", "fields" : { "comment" : { "fragment_size" : 150, "number_of_fragments" : 3, "highlight_query": { "bool": { "must": { "match": { "comment": { "query": "foo bar" } } }, "should": { "match_phrase": { "comment": { "query": "foo bar", "slop": 1, "boost": 10.0 } } }, "minimum_should_match": 0 } } } } } }
Set highlighter typeedit
The type
field allows to force a specific highlighter type.
The allowed values are: unified
, plain
and fvh
.
The following is an example that forces the use of the plain highlighter:
GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "fields" : { "comment" : {"type" : "plain"} } } }
Configure highlighting tagsedit
By default, the highlighting will wrap highlighted text in <em>
and
</em>
. This can be controlled by setting pre_tags
and post_tags
,
for example:
GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "pre_tags" : ["<tag1>"], "post_tags" : ["</tag1>"], "fields" : { "body" : {} } } }
When using the fast vector highlighter, you can specify additional tags and the "importance" is ordered.
GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "pre_tags" : ["<tag1>", "<tag2>"], "post_tags" : ["</tag1>", "</tag2>"], "fields" : { "body" : {} } } }
You can also use the built-in styled
tag schema:
GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "tags_schema" : "styled", "fields" : { "comment" : {} } } }
Highlight on sourceedit
Forces the highlighting to highlight fields based on the source even if fields
are stored separately. Defaults to false
.
GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "fields" : { "comment" : {"force_source" : true} } } }
Highlight in all fieldsedit
By default, only fields that contains a query match are highlighted. Set
require_field_match
to false
to highlight all fields.
GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "require_field_match": false, "fields": { "body" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] } } } }
Combine matches on multiple fieldsedit
This is only supported by the fvh
highlighter
The Fast Vector Highlighter can combine matches on multiple fields to
highlight a single field. This is most intuitive for multifields that
analyze the same string in different ways. All matched_fields
must have
term_vector
set to with_positions_offsets
but only the field to which
the matches are combined is loaded so only that field would benefit from having
store
set to yes
.
In the following examples, comment
is analyzed by the english
analyzer and comment.plain
is analyzed by the standard
analyzer.
GET /_search { "query": { "query_string": { "query": "comment.plain:running scissors", "fields": ["comment"] } }, "highlight": { "order": "score", "fields": { "comment": { "matched_fields": ["comment", "comment.plain"], "type" : "fvh" } } } }
The above matches both "run with scissors" and "running with scissors" and would highlight "running" and "scissors" but not "run". If both phrases appear in a large document then "running with scissors" is sorted above "run with scissors" in the fragments list because there are more matches in that fragment.
GET /_search { "query": { "query_string": { "query": "running scissors", "fields": ["comment", "comment.plain^10"] } }, "highlight": { "order": "score", "fields": { "comment": { "matched_fields": ["comment", "comment.plain"], "type" : "fvh" } } } }
The above highlights "run" as well as "running" and "scissors" but still sorts "running with scissors" above "run with scissors" because the plain match ("running") is boosted.
GET /_search { "query": { "query_string": { "query": "running scissors", "fields": ["comment", "comment.plain^10"] } }, "highlight": { "order": "score", "fields": { "comment": { "matched_fields": ["comment.plain"], "type" : "fvh" } } } }
The above query wouldn’t highlight "run" or "scissor" but shows that
it is just fine not to list the field to which the matches are combined
(comment
) in the matched fields.
Technically it is also fine to add fields to matched_fields
that
don’t share the same underlying string as the field to which the matches
are combined. The results might not make much sense and if one of the
matches is off the end of the text then the whole query will fail.
There is a small amount of overhead involved with setting
matched_fields
to a non-empty array so always prefer
"highlight": { "fields": { "comment": {} } }
to
"highlight": { "fields": { "comment": { "matched_fields": ["comment"], "type" : "fvh" } } }
Explicitly order highlighted fieldsedit
Elasticsearch highlights the fields in the order that they are sent, but per the
JSON spec, objects are unordered. If you need to be explicit about the order
in which fields are highlighted specify the fields
as an array:
GET /_search { "highlight": { "fields": [ { "title": {} }, { "text": {} } ] } }
None of the highlighters built into Elasticsearch care about the order that the fields are highlighted but a plugin might.
Control highlighted fragmentsedit
Each field highlighted can control the size of the highlighted fragment
in characters (defaults to 100
), and the maximum number of fragments
to return (defaults to 5
).
For example:
GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "fields" : { "comment" : {"fragment_size" : 150, "number_of_fragments" : 3} } } }
On top of this it is possible to specify that highlighted fragments need to be sorted by score:
GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "order" : "score", "fields" : { "comment" : {"fragment_size" : 150, "number_of_fragments" : 3} } } }
If the number_of_fragments
value is set to 0
then no fragments are
produced, instead the whole content of the field is returned, and of
course it is highlighted. This can be very handy if short texts (like
document title or address) need to be highlighted but no fragmentation
is required. Note that fragment_size
is ignored in this case.
GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "fields" : { "body" : {}, "blog.title" : {"number_of_fragments" : 0} } } }
When using fvh
one can use fragment_offset
parameter to control the margin to start highlighting from.
In the case where there is no matching fragment to highlight, the default is
to not return anything. Instead, we can return a snippet of text from the
beginning of the field by setting no_match_size
(default 0
) to the length
of the text that you want returned. The actual length may be shorter or longer than
specified as it tries to break on a word boundary.
GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "fields" : { "comment" : { "fragment_size" : 150, "number_of_fragments" : 3, "no_match_size": 150 } } } }
Highlight using the postings listedit
Here is an example of setting the comment
field in the index mapping to
allow for highlighting using the postings:
PUT /example { "mappings": { "properties": { "comment" : { "type": "text", "index_options" : "offsets" } } } }
Here is an example of setting the comment
field to allow for
highlighting using the term_vectors
(this will cause the index to be bigger):
PUT /example { "mappings": { "properties": { "comment" : { "type": "text", "term_vector" : "with_positions_offsets" } } } }
Specify a fragmenter for the plain highlighteredit
When using the plain
highlighter, you can choose between the simple
and
span
fragmenters:
GET twitter/_search { "query" : { "match_phrase": { "message": "number 1" } }, "highlight" : { "fields" : { "message" : { "type": "plain", "fragment_size" : 15, "number_of_fragments" : 3, "fragmenter": "simple" } } } }
Response:
{ ... "hits": { "total" : { "value": 1, "relation": "eq" }, "max_score": 1.6011951, "hits": [ { "_index": "twitter", "_type": "_doc", "_id": "1", "_score": 1.6011951, "_source": { "user": "test", "message": "some message with the number 1", "date": "2009-11-15T14:12:12", "likes": 1 }, "highlight": { "message": [ " with the <em>number</em>", " <em>1</em>" ] } } ] } }
GET twitter/_search { "query" : { "match_phrase": { "message": "number 1" } }, "highlight" : { "fields" : { "message" : { "type": "plain", "fragment_size" : 15, "number_of_fragments" : 3, "fragmenter": "span" } } } }
Response:
{ ... "hits": { "total" : { "value": 1, "relation": "eq" }, "max_score": 1.6011951, "hits": [ { "_index": "twitter", "_type": "_doc", "_id": "1", "_score": 1.6011951, "_source": { "user": "test", "message": "some message with the number 1", "date": "2009-11-15T14:12:12", "likes": 1 }, "highlight": { "message": [ " with the <em>number</em> <em>1</em>" ] } } ] } }
If the number_of_fragments
option is set to 0
,
NullFragmenter
is used which does not fragment the text at all.
This is useful for highlighting the entire contents of a document or field.
How highlighters work internallyedit
Given a query and a text (the content of a document field), the goal of a highlighter is to find the best text fragments for the query, and highlight the query terms in the found fragments. For this, a highlighter needs to address several questions:
- How break a text into fragments?
- How to find the best fragments among all fragments?
- How to highlight the query terms in a fragment?
How to break a text into fragments?edit
Relevant settings: fragment_size
, fragmenter
, type
of highlighter,
boundary_chars
, boundary_max_scan
, boundary_scanner
, boundary_scanner_locale
.
Plain highlighter begins with analyzing the text using the given analyzer,
and creating a token stream from it. Plain highlighter uses a very simple
algorithm to break the token stream into fragments. It loops through terms in the token stream,
and every time the current term’s end_offset exceeds fragment_size
multiplied by the number of
created fragments, a new fragment is created. A little more computation is done with using span
fragmenter to avoid breaking up text between highlighted terms. But overall, since the breaking is
done only by fragment_size
, some fragments can be quite odd, e.g. beginning
with a punctuation mark.
Unified or FVH highlighters do a better job of breaking up a text into
fragments by utilizing Java’s BreakIterator
. This ensures that a fragment
is a valid sentence as long as fragment_size
allows for this.
How to find the best fragments?edit
Relevant settings: number_of_fragments
.
To find the best, most relevant, fragments, a highlighter needs to score each fragment in respect to the given query. The goal is to score only those terms that participated in generating the hit on the document. For some complex queries, this is still work in progress.
The plain highlighter creates an in-memory index from the current token stream, and re-runs the original query criteria through Lucene’s query execution planner to get access to low-level match information for the current text. For more complex queries the original query could be converted to a span query, as span queries can handle phrases more accurately. Then this obtained low-level match information is used to score each individual fragment. The scoring method of the plain highlighter is quite simple. Each fragment is scored by the number of unique query terms found in this fragment. The score of individual term is equal to its boost, which is by default is 1. Thus, by default, a fragment that contains one unique query term, will get a score of 1; and a fragment that contains two unique query terms, will get a score of 2 and so on. The fragments are then sorted by their scores, so the highest scored fragments will be output first.
FVH doesn’t need to analyze the text and build an in-memory index, as it uses pre-indexed document term vectors, and finds among them terms that correspond to the query. FVH scores each fragment by the number of query terms found in this fragment. Similarly to plain highlighter, score of individual term is equal to its boost value. In contrast to plain highlighter, all query terms are counted, not only unique terms.
Unified highlighter can use pre-indexed term vectors or pre-indexed terms offsets, if they are available. Otherwise, similar to Plain Highlighter, it has to create an in-memory index from the text. Unified highlighter uses the BM25 scoring model to score fragments.
How to highlight the query terms in a fragment?edit
Relevant settings: pre-tags
, post-tags
.
The goal is to highlight only those terms that participated in generating the hit on the document. For some complex boolean queries, this is still work in progress, as highlighters don’t reflect the boolean logic of a query and only extract leaf (terms, phrases, prefix etc) queries.
Plain highlighter given the token stream and the original text, recomposes the original text to highlight only terms from the token stream that are contained in the low-level match information structure from the previous step.
FVH and unified highlighter use intermediate data structures to represent fragments in some raw form, and then populate them with actual text.
A highlighter uses pre-tags
, post-tags
to encode highlighted terms.
An example of the work of the unified highlighteredit
Let’s look in more details how unified highlighter works.
First, we create a index with a text field content
, that will be indexed
using english
analyzer, and will be indexed without offsets or term vectors.
PUT test_index { "mappings": { "properties": { "content" : { "type" : "text", "analyzer" : "english" } } } }
We put the following document into the index:
PUT test_index/_doc/doc1 { "content" : "For you I'm only a fox like a hundred thousand other foxes. But if you tame me, we'll need each other. You'll be the only boy in the world for me. I'll be the only fox in the world for you." }
And we ran the following query with a highlight request:
GET test_index/_search { "query": { "match_phrase" : {"content" : "only fox"} }, "highlight": { "type" : "unified", "number_of_fragments" : 3, "fields": { "content": {} } } }
After doc1
is found as a hit for this query, this hit will be passed to the
unified highlighter for highlighting the field content
of the document.
Since the field content
was not indexed either with offsets or term vectors,
its raw field value will be analyzed, and in-memory index will be built from
the terms that match the query:
{"token":"onli","start_offset":12,"end_offset":16,"position":3}, {"token":"fox","start_offset":19,"end_offset":22,"position":5}, {"token":"fox","start_offset":53,"end_offset":58,"position":11}, {"token":"onli","start_offset":117,"end_offset":121,"position":24}, {"token":"onli","start_offset":159,"end_offset":163,"position":34}, {"token":"fox","start_offset":164,"end_offset":167,"position":35}
Our complex phrase query will be converted to the span query:
spanNear([text:onli, text:fox], 0, true)
, meaning that we are looking for
terms "onli: and "fox" within 0 distance from each other, and in the given
order. The span query will be run against the created before in-memory index,
to find the following match:
{"term":"onli", "start_offset":159, "end_offset":163}, {"term":"fox", "start_offset":164, "end_offset":167}
In our example, we have got a single match, but there could be several matches.
Given the matches, the unified highlighter breaks the text of the field into
so called "passages". Each passage must contain at least one match.
The unified highlighter with the use of Java’s BreakIterator
ensures that each
passage represents a full sentence as long as it doesn’t exceed fragment_size
.
For our example, we have got a single passage with the following properties
(showing only a subset of the properties here):
Passage: startOffset: 147 endOffset: 189 score: 3.7158387 matchStarts: [159, 164] matchEnds: [163, 167] numMatches: 2
Notice how a passage has a score, calculated using the BM25 scoring formula
adapted for passages. Scores allow us to choose the best scoring
passages if there are more passages available than the requested
by the user number_of_fragments
. Scores also let us to sort passages by
order: "score"
if requested by the user.
As the final step, the unified highlighter will extract from the field’s text a string corresponding to each passage:
"I'll be the only fox in the world for you."
and will format with the tags <em> and </em> all matches in this string
using the passages’s matchStarts
and matchEnds
information:
I'll be the <em>only</em> <em>fox</em> in the world for you.
This kind of formatted strings are the final result of the highlighter returned to the user.
Index Boostedit
Allows to configure different boost level per index when searching across more than one indices. This is very handy when hits coming from one index matter more than hits coming from another index (think social graph where each user has an index).
Deprecated in 5.2.0.
This format is deprecated. Please use array format instead.
GET /_search { "indices_boost" : { "index1" : 1.4, "index2" : 1.3 } }
You can also specify it as an array to control the order of boosts.
GET /_search { "indices_boost" : [ { "alias1" : 1.4 }, { "index*" : 1.3 } ] }
This is important when you use aliases or wildcard expression.
If multiple matches are found, the first match will be used.
For example, if an index is included in both alias1
and index*
, boost value of 1.4
is applied.
Inner hitsedit
The parent-join and nested features allow the return of documents that have matches in a different scope. In the parent/child case, parent documents are returned based on matches in child documents or child documents are returned based on matches in parent documents. In the nested case, documents are returned based on matches in nested inner objects.
In both cases, the actual matches in the different scopes that caused a document to be returned are hidden. In many cases, it’s very useful to know which inner nested objects (in the case of nested) or children/parent documents (in the case of parent/child) caused certain information to be returned. The inner hits feature can be used for this. This feature returns per search hit in the search response additional nested hits that caused a search hit to match in a different scope.
Inner hits can be used by defining an inner_hits
definition on a nested
, has_child
or has_parent
query and filter.
The structure looks like this:
"<query>" : { "inner_hits" : { <inner_hits_options> } }
If inner_hits
is defined on a query that supports it then each search hit will contain an inner_hits
json object with the following structure:
"hits": [ { "_index": ..., "_type": ..., "_id": ..., "inner_hits": { "<inner_hits_name>": { "hits": { "total": ..., "hits": [ { "_type": ..., "_id": ..., ... }, ... ] } } }, ... }, ... ]
Optionsedit
Inner hits support the following options:
|
The offset from where the first hit to fetch for each |
|
The maximum number of hits to return per |
|
How the inner hits should be sorted per |
|
The name to be used for the particular inner hit definition in the response. Useful when multiple inner hits
have been defined in a single search request. The default depends in which query the inner hit is defined.
For |
Inner hits also supports the following per document features:
Nested inner hitsedit
The nested inner_hits
can be used to include nested inner objects as inner hits to a search hit.
PUT test { "mappings": { "properties": { "comments": { "type": "nested" } } } } PUT test/_doc/1?refresh { "title": "Test title", "comments": [ { "author": "kimchy", "number": 1 }, { "author": "nik9000", "number": 2 } ] } POST test/_search { "query": { "nested": { "path": "comments", "query": { "match": {"comments.number" : 2} }, "inner_hits": {} } } }
An example of a response snippet that could be generated from the above search request:
{ ..., "hits": { "total" : { "value": 1, "relation": "eq" }, "max_score": 1.0, "hits": [ { "_index": "test", "_type": "_doc", "_id": "1", "_score": 1.0, "_source": ..., "inner_hits": { "comments": { "hits": { "total" : { "value": 1, "relation": "eq" }, "max_score": 1.0, "hits": [ { "_index": "test", "_type": "_doc", "_id": "1", "_nested": { "field": "comments", "offset": 1 }, "_score": 1.0, "_source": { "author": "nik9000", "number": 2 } } ] } } } } ] } }
The name used in the inner hit definition in the search request. A custom key can be used via the |
The _nested
metadata is crucial in the above example, because it defines from what inner nested object this inner hit
came from. The field
defines the object array field the nested hit is from and the offset
relative to its location
in the _source
. Due to sorting and scoring the actual location of the hit objects in the inner_hits
is usually
different than the location a nested inner object was defined.
By default the _source
is returned also for the hit objects in inner_hits
, but this can be changed. Either via
_source
filtering feature part of the source can be returned or be disabled. If stored fields are defined on the
nested level these can also be returned via the fields
feature.
An important default is that the _source
returned in hits inside inner_hits
is relative to the _nested
metadata.
So in the above example only the comment part is returned per nested hit and not the entire source of the top level
document that contained the comment.
Nested inner hits and _source
edit
Nested document don’t have a _source
field, because the entire source of document is stored with the root document under
its _source
field. To include the source of just the nested document, the source of the root document is parsed and just
the relevant bit for the nested document is included as source in the inner hit. Doing this for each matching nested document
has an impact on the time it takes to execute the entire search request, especially when size
and the inner hits' size
are set higher than the default. To avoid the relatively expensive source extraction for nested inner hits, one can disable
including the source and solely rely on doc values fields. Like this:
PUT test { "mappings": { "properties": { "comments": { "type": "nested" } } } } PUT test/_doc/1?refresh { "title": "Test title", "comments": [ { "author": "kimchy", "text": "comment text" }, { "author": "nik9000", "text": "words words words" } ] } POST test/_search { "query": { "nested": { "path": "comments", "query": { "match": {"comments.text" : "words"} }, "inner_hits": { "_source" : false, "docvalue_fields" : [ "comments.text.keyword" ] } } } }
Hierarchical levels of nested object fields and inner hits.edit
If a mapping has multiple levels of hierarchical nested object fields each level can be accessed via dot notated path.
For example if there is a comments
nested field that contains a votes
nested field and votes should directly be returned
with the root hits then the following path can be defined:
PUT test { "mappings": { "properties": { "comments": { "type": "nested", "properties": { "votes": { "type": "nested" } } } } } } PUT test/_doc/1?refresh { "title": "Test title", "comments": [ { "author": "kimchy", "text": "comment text", "votes": [] }, { "author": "nik9000", "text": "words words words", "votes": [ {"value": 1 , "voter": "kimchy"}, {"value": -1, "voter": "other"} ] } ] } POST test/_search { "query": { "nested": { "path": "comments.votes", "query": { "match": { "comments.votes.voter": "kimchy" } }, "inner_hits" : {} } } }
Which would look like:
{ ..., "hits": { "total" : { "value": 1, "relation": "eq" }, "max_score": 0.6931471, "hits": [ { "_index": "test", "_type": "_doc", "_id": "1", "_score": 0.6931471, "_source": ..., "inner_hits": { "comments.votes": { "hits": { "total" : { "value": 1, "relation": "eq" }, "max_score": 0.6931471, "hits": [ { "_index": "test", "_type": "_doc", "_id": "1", "_nested": { "field": "comments", "offset": 1, "_nested": { "field": "votes", "offset": 0 } }, "_score": 0.6931471, "_source": { "value": 1, "voter": "kimchy" } } ] } } } } ] } }
This indirect referencing is only supported for nested inner hits.
Parent/child inner hitsedit
The parent/child inner_hits
can be used to include parent or child:
PUT test { "mappings": { "properties": { "my_join_field": { "type": "join", "relations": { "my_parent": "my_child" } } } } } PUT test/_doc/1?refresh { "number": 1, "my_join_field": "my_parent" } PUT test/_doc/2?routing=1&refresh { "number": 1, "my_join_field": { "name": "my_child", "parent": "1" } } POST test/_search { "query": { "has_child": { "type": "my_child", "query": { "match": { "number": 1 } }, "inner_hits": {} } } }
An example of a response snippet that could be generated from the above search request:
{ ..., "hits": { "total" : { "value": 1, "relation": "eq" }, "max_score": 1.0, "hits": [ { "_index": "test", "_type": "_doc", "_id": "1", "_score": 1.0, "_source": { "number": 1, "my_join_field": "my_parent" }, "inner_hits": { "my_child": { "hits": { "total" : { "value": 1, "relation": "eq" }, "max_score": 1.0, "hits": [ { "_index": "test", "_type": "_doc", "_id": "2", "_score": 1.0, "_routing": "1", "_source": { "number": 1, "my_join_field": { "name": "my_child", "parent": "1" } } } ] } } } } ] } }
min_scoreedit
Exclude documents which have a _score
less than the minimum specified
in min_score
:
GET /_search { "min_score": 0.5, "query" : { "term" : { "user" : "kimchy" } } }
Note, most times, this does not make much sense, but is provided for advanced use cases.
Named Queriesedit
Each filter and query can accept a _name
in its top level definition.
GET /_search { "query": { "bool" : { "should" : [ {"match" : { "name.first" : {"query" : "shay", "_name" : "first"} }}, {"match" : { "name.last" : {"query" : "banon", "_name" : "last"} }} ], "filter" : { "terms" : { "name.last" : ["banon", "kimchy"], "_name" : "test" } } } } }
The search response will include for each hit the matched_queries
it matched on. The tagging of queries and filters
only make sense for the bool
query.
Post filteredit
The post_filter
is applied to the search hits
at the very end of a search
request, after aggregations have already been calculated. Its purpose is
best explained by example:
Imagine that you are selling shirts that have the following properties:
PUT /shirts { "mappings": { "properties": { "brand": { "type": "keyword"}, "color": { "type": "keyword"}, "model": { "type": "keyword"} } } } PUT /shirts/_doc/1?refresh { "brand": "gucci", "color": "red", "model": "slim" }
Imagine a user has specified two filters:
color:red
and brand:gucci
. You only want to show them red shirts made by
Gucci in the search results. Normally you would do this with a
bool
query:
GET /shirts/_search { "query": { "bool": { "filter": [ { "term": { "color": "red" }}, { "term": { "brand": "gucci" }} ] } } }
However, you would also like to use faceted navigation to display a list of
other options that the user could click on. Perhaps you have a model
field
that would allow the user to limit their search results to red Gucci
t-shirts
or dress-shirts
.
This can be done with a
terms
aggregation:
GET /shirts/_search { "query": { "bool": { "filter": [ { "term": { "color": "red" }}, { "term": { "brand": "gucci" }} ] } }, "aggs": { "models": { "terms": { "field": "model" } } } }
But perhaps you would also like to tell the user how many Gucci shirts are
available in other colors. If you just add a terms
aggregation on the
color
field, you will only get back the color red
, because your query
returns only red shirts by Gucci.
Instead, you want to include shirts of all colors during aggregation, then
apply the colors
filter only to the search results. This is the purpose of
the post_filter
:
GET /shirts/_search { "query": { "bool": { "filter": { "term": { "brand": "gucci" } } } }, "aggs": { "colors": { "terms": { "field": "color" } }, "color_red": { "filter": { "term": { "color": "red" } }, "aggs": { "models": { "terms": { "field": "model" } } } } }, "post_filter": { "term": { "color": "red" } } }
Preferenceedit
Controls a preference
of the shard copies on which to execute the search. By
default, Elasticsearch selects from the available shard copies in an
unspecified order, taking the allocation awareness and
adaptive replica selection configuration into
account. However, it may sometimes be desirable to try and route certain
searches to certain sets of shard copies.
A possible use case would be to make use of per-copy caches like the request cache. Doing this, however, runs contrary to the idea of search parallelization and can create hotspots on certain nodes because the load might not be evenly distributed anymore.
The preference
is a query string parameter which can be set to:
|
The operation will be executed only on shards allocated to the local node. |
|
The operation will be executed on shards allocated to the local node if possible, and will fall back to other shards if not. |
|
The operation will be executed on nodes with one of the provided node
ids ( |
|
Restricts the operation to the specified shards. ( |
|
Restricts the operation to nodes specified according to the node specification. If suitable shard copies exist on more than one of the selected nodes then the order of preference between these copies is unspecified. |
Custom (string) value |
Any value that does not start with |
For instance, use the user’s session ID xyzabc123
as follows:
GET /_search?preference=xyzabc123 { "query": { "match": { "title": "elasticsearch" } } }
This can be an effective strategy to increase usage of e.g. the request cache for unique users running similar searches repeatedly by always hitting the same cache, while requests of different users are still spread across all shard copies.
The _only_local
preference guarantees only to use shard copies on the
local node, which is sometimes useful for troubleshooting. All other options do
not fully guarantee that any particular shard copies are used in a search,
and on a changing index this may mean that repeated searches may yield
different results if they are executed on different shard copies which are in
different refresh states.
Rescoringedit
Rescoring can help to improve precision by reordering just the top (eg
100 - 500) documents returned by the
query
and
post_filter
phases, using a
secondary (usually more costly) algorithm, instead of applying the
costly algorithm to all documents in the index.
A rescore
request is executed on each shard before it returns its
results to be sorted by the node handling the overall search request.
Currently the rescore API has only one implementation: the query rescorer, which uses a query to tweak the scoring. In the future, alternative rescorers may be made available, for example, a pair-wise rescorer.
An error will be thrown if an explicit sort
(other than _score
in descending order) is provided with a rescore
query.
when exposing pagination to your users, you should not change
window_size
as you step through each page (by passing different
from
values) since that can alter the top hits causing results to
confusingly shift as the user steps through pages.
Query rescoreredit
The query rescorer executes a second query only on the Top-K results
returned by the query
and
post_filter
phases. The
number of docs which will be examined on each shard can be controlled by
the window_size
parameter, which defaults to 10.
By default the scores from the original query and the rescore query are
combined linearly to produce the final _score
for each document. The
relative importance of the original query and of the rescore query can
be controlled with the query_weight
and rescore_query_weight
respectively. Both default to 1
.
For example:
POST /_search { "query" : { "match" : { "message" : { "operator" : "or", "query" : "the quick brown" } } }, "rescore" : { "window_size" : 50, "query" : { "rescore_query" : { "match_phrase" : { "message" : { "query" : "the quick brown", "slop" : 2 } } }, "query_weight" : 0.7, "rescore_query_weight" : 1.2 } } }
The way the scores are combined can be controlled with the score_mode
:
Score Mode | Description |
---|---|
|
Add the original score and the rescore query score. The default. |
|
Multiply the original score by the rescore query score. Useful
for |
|
Average the original score and the rescore query score. |
|
Take the max of original score and the rescore query score. |
|
Take the min of the original score and the rescore query score. |
Multiple Rescoresedit
It is also possible to execute multiple rescores in sequence:
POST /_search { "query" : { "match" : { "message" : { "operator" : "or", "query" : "the quick brown" } } }, "rescore" : [ { "window_size" : 100, "query" : { "rescore_query" : { "match_phrase" : { "message" : { "query" : "the quick brown", "slop" : 2 } } }, "query_weight" : 0.7, "rescore_query_weight" : 1.2 } }, { "window_size" : 10, "query" : { "score_mode": "multiply", "rescore_query" : { "function_score" : { "script_score": { "script": { "source": "Math.log10(doc.likes.value + 2)" } } } } } } ] }
The first one gets the results of the query then the second one gets the results of the first, etc. The second rescore will "see" the sorting done by the first rescore so it is possible to use a large window on the first rescore to pull documents into a smaller window for the second rescore.
Script Fieldsedit
Allows to return a script evaluation (based on different fields) for each hit, for example:
GET /_search { "query" : { "match_all": {} }, "script_fields" : { "test1" : { "script" : { "lang": "painless", "source": "doc['price'].value * 2" } }, "test2" : { "script" : { "lang": "painless", "source": "doc['price'].value * params.factor", "params" : { "factor" : 2.0 } } } } }
Script fields can work on fields that are not stored (price
in
the above case), and allow to return custom values to be returned (the
evaluated value of the script).
Script fields can also access the actual _source
document and
extract specific elements to be returned from it by using params['_source']
.
Here is an example:
GET /_search { "query" : { "match_all": {} }, "script_fields" : { "test1" : { "script" : "params['_source']['message']" } } }
Note the _source
keyword here to navigate the json-like model.
It’s important to understand the difference between
doc['my_field'].value
and params['_source']['my_field']
. The first,
using the doc keyword, will cause the terms for that field to be loaded to
memory (cached), which will result in faster execution, but more memory
consumption. Also, the doc[...]
notation only allows for simple valued
fields (you can’t return a json object from it) and makes sense only for
non-analyzed or single term based fields. However, using doc
is
still the recommended way to access values from the document, if at all
possible, because _source
must be loaded and parsed every time it’s used.
Using _source
is very slow.
Scrolledit
While a search
request returns a single “page” of results, the scroll
API can be used to retrieve large numbers of results (or even all results)
from a single search request, in much the same way as you would use a cursor
on a traditional database.
Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.
The results that are returned from a scroll request reflect the state of
the index at the time that the initial search
request was made, like a
snapshot in time. Subsequent changes to documents (index, update or delete)
will only affect later search requests.
In order to use scrolling, the initial search request should specify the
scroll
parameter in the query string, which tells Elasticsearch how long it
should keep the “search context” alive (see Keeping the search context alive), eg ?scroll=1m
.
POST /twitter/_search?scroll=1m { "size": 100, "query": { "match" : { "title" : "elasticsearch" } } }
The result from the above request includes a _scroll_id
, which should
be passed to the scroll
API in order to retrieve the next batch of
results.
POST /_search/scroll { "scroll" : "1m", "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" }
|
|
The |
|
The |
The size
parameter allows you to configure the maximum number of hits to be
returned with each batch of results. Each call to the scroll
API returns the
next batch of results until there are no more results left to return, ie the
hits
array is empty.
The initial search request and each subsequent scroll request each
return a _scroll_id
. While the _scroll_id
may change between requests, it doesn’t
always change — in any case, only the most recently received _scroll_id
should be used.
If the request specifies aggregations, only the initial search response will contain the aggregations results.
Scroll requests have optimizations that make them faster when the sort
order is _doc
. If you want to iterate over all documents regardless of the
order, this is the most efficient option:
GET /_search?scroll=1m { "sort": [ "_doc" ] }
Keeping the search context aliveedit
A scroll returns all the documents which matched the search at the time of the
initial search request. It ignores any subsequent changes to these documents.
The scroll_id
identifies a search context which keeps track of everything
that Elasticsearch needs to return the correct documents. The search context is created
by the initial request and kept alive by subsequent requests.
The scroll
parameter (passed to the search
request and to every scroll
request) tells Elasticsearch how long it should keep the search context alive.
Its value (e.g. 1m
, see Time units) does not need to be long enough to
process all data — it just needs to be long enough to process the previous
batch of results. Each scroll
request (with the scroll
parameter) sets a
new expiry time. If a scroll
request doesn’t pass in the scroll
parameter, then the search context will be freed as part of that scroll
request.
Normally, the background merge process optimizes the index by merging together smaller segments to create new, bigger segments. Once the smaller segments are no longer needed they are deleted. This process continues during scrolling, but an open search context prevents the old segments from being deleted since they are still in use.
Keeping older segments alive means that more disk space and file handles are needed. Ensure that you have configured your nodes to have ample free file handles. See File Descriptors.
Additionally, if a segment contains deleted or updated documents then the search context must keep track of whether each document in the segment was live at the time of the initial search request. Ensure that your nodes have sufficient heap space if you have many open scrolls on an index that is subject to ongoing deletes or updates.
To prevent against issues caused by having too many scrolls open, the
user is not allowed to open scrolls past a certain limit. By default, the
maximum number of open scrolls is 500. This limit can be updated with the
search.max_open_scroll_context
cluster setting.
You can check how many search contexts are open with the nodes stats API:
GET /_nodes/stats/indices/search
Clear scroll APIedit
Search context are automatically removed when the scroll
timeout has been
exceeded. However keeping scrolls open has a cost, as discussed in the
previous section so scrolls should be explicitly
cleared as soon as the scroll is not being used anymore using the
clear-scroll
API:
DELETE /_search/scroll { "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" }
Multiple scroll IDs can be passed as array:
DELETE /_search/scroll { "scroll_id" : [ "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==", "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAABFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAAAxZrUllkUVlCa1NqNmRMaUhiQlZkMWFBAAAAAAAAAAIWa1JZZFFZQmtTajZkTGlIYkJWZDFhQQAAAAAAAAAFFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAABBZrUllkUVlCa1NqNmRMaUhiQlZkMWFB" ] }
All search contexts can be cleared with the _all
parameter:
DELETE /_search/scroll/_all
The scroll_id
can also be passed as a query string parameter or in the request body.
Multiple scroll IDs can be passed as comma separated values:
DELETE /_search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==,DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAABFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAAAxZrUllkUVlCa1NqNmRMaUhiQlZkMWFBAAAAAAAAAAIWa1JZZFFZQmtTajZkTGlIYkJWZDFhQQAAAAAAAAAFFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAABBZrUllkUVlCa1NqNmRMaUhiQlZkMWFB
Sliced Scrolledit
For scroll queries that return a lot of documents it is possible to split the scroll in multiple slices which can be consumed independently:
GET /twitter/_search?scroll=1m { "slice": { "id": 0, "max": 2 }, "query": { "match" : { "title" : "elasticsearch" } } } GET /twitter/_search?scroll=1m { "slice": { "id": 1, "max": 2 }, "query": { "match" : { "title" : "elasticsearch" } } }
The result from the first request returned documents that belong to the first slice (id: 0) and the result from the
second request returned documents that belong to the second slice. Since the maximum number of slices is set to 2
the union of the results of the two requests is equivalent to the results of a scroll query without slicing.
By default the splitting is done on the shards first and then locally on each shard using the _id field
with the following formula:
slice(doc) = floorMod(hashCode(doc._id), max)
For instance if the number of shards is equal to 2 and the user requested 4 slices then the slices 0 and 2 are assigned
to the first shard and the slices 1 and 3 are assigned to the second shard.
Each scroll is independent and can be processed in parallel like any scroll request.
If the number of slices is bigger than the number of shards the slice filter is very slow on the first calls, it has a complexity of O(N) and a memory cost equals to N bits per slice where N is the total number of documents in the shard. After few calls the filter should be cached and subsequent calls should be faster but you should limit the number of sliced query you perform in parallel to avoid the memory explosion.
To avoid this cost entirely it is possible to use the doc_values
of another field to do the slicing
but the user must ensure that the field has the following properties:
- The field is numeric.
-
doc_values
are enabled on that field - Every document should contain a single value. If a document has multiple values for the specified field, the first value is used.
- The value for each document should be set once when the document is created and never updated. This ensures that each slice gets deterministic results.
- The cardinality of the field should be high. This ensures that each slice gets approximately the same amount of documents.
GET /twitter/_search?scroll=1m { "slice": { "field": "date", "id": 0, "max": 10 }, "query": { "match" : { "title" : "elasticsearch" } } }
For append only time-based indices, the timestamp
field can be used safely.
By default the maximum number of slices allowed per scroll is limited to 1024.
You can update the index.max_slices_per_scroll
index setting to bypass this limit.
Search Afteredit
Pagination of results can be done by using the from
and size
but the cost becomes prohibitive when the deep pagination is reached.
The index.max_result_window
which defaults to 10,000 is a safeguard, search requests take heap memory and time proportional to from + size
.
The Scroll api is recommended for efficient deep scrolling but scroll contexts are costly and it is not
recommended to use it for real time user requests.
The search_after
parameter circumvents this problem by providing a live cursor.
The idea is to use the results from the previous page to help the retrieval of the next page.
Suppose that the query to retrieve the first page looks like this:
GET twitter/_search { "size": 10, "query": { "match" : { "title" : "elasticsearch" } }, "sort": [ {"date": "asc"}, {"tie_breaker_id": "asc"} ] }
A field with one unique value per document should be used as the tiebreaker
of the sort specification. Otherwise the sort order for documents that have
the same sort values would be undefined and could lead to missing or duplicate
results. The _id
field has a unique value per document
but it is not recommended to use it as a tiebreaker directly.
Beware that search_after
looks for the first document which fully or partially
matches tiebreaker’s provided value. Therefore if a document has a tiebreaker value of
"654323"
and you search_after
for "654"
it would still match that document
and return results found after it.
doc value are disabled on this field so sorting on it requires
to load a lot of data in memory. Instead it is advised to duplicate (client side
or with a set ingest processor) the content
of the _id
field in another field that has
doc value enabled and to use this new field as the tiebreaker
for the sort.
The result from the above request includes an array of sort values
for each document.
These sort values
can be used in conjunction with the search_after
parameter to start returning results "after" any
document in the result list.
For instance we can use the sort values
of the last document and pass it to search_after
to retrieve the next page of results:
GET twitter/_search { "size": 10, "query": { "match" : { "title" : "elasticsearch" } }, "search_after": [1463538857, "654323"], "sort": [ {"date": "asc"}, {"tie_breaker_id": "asc"} ] }
The parameter from
must be set to 0 (or -1) when search_after
is used.
search_after
is not a solution to jump freely to a random page but rather to scroll many queries in parallel.
It is very similar to the scroll
API but unlike it, the search_after
parameter is stateless, it is always resolved against the latest
version of the searcher. For this reason the sort order may change during a walk depending on the updates and deletes of your index.
Search Typeedit
There are different execution paths that can be done when executing a distributed search. The distributed search operation needs to be scattered to all the relevant shards and then all the results are gathered back. When doing scatter/gather type execution, there are several ways to do that, specifically with search engines.
One of the questions when executing a distributed search is how many results to retrieve from each shard. For example, if we have 10 shards, the 1st shard might hold the most relevant results from 0 till 10, with other shards results ranking below it. For this reason, when executing a request, we will need to get results from 0 till 10 from all shards, sort them, and then return the results if we want to ensure correct results.
Another question, which relates to the search engine, is the fact that each shard stands on its own. When a query is executed on a specific shard, it does not take into account term frequencies and other search engine information from the other shards. If we want to support accurate ranking, we would need to first gather the term frequencies from all shards to calculate global term frequencies, then execute the query on each shard using these global frequencies.
Also, because of the need to sort the results, getting back a large
document set, or even scrolling it, while maintaining the correct sorting
behavior can be a very expensive operation. For large result set
scrolling, it is best to sort by _doc
if the order in which documents
are returned is not important.
Elasticsearch is very flexible and allows to control the type of search to execute on a per search request basis. The type can be configured by setting the search_type parameter in the query string. The types are:
Query Then Fetchedit
Parameter value: query_then_fetch.
The request is processed in two phases. In the first phase, the query
is forwarded to all involved shards. Each shard executes the search
and generates a sorted list of results, local to that shard. Each
shard returns just enough information to the coordinating node
to allow it to merge and re-sort the shard level results into a globally
sorted set of results, of maximum length size
.
During the second phase, the coordinating node requests the document content (and highlighted snippets, if any) from only the relevant shards.
GET twitter/_search?search_type=query_then_fetch
This is the default setting, if you do not specify a search_type
in your request.
Dfs, Query Then Fetchedit
Parameter value: dfs_query_then_fetch.
Same as "Query Then Fetch", except for an initial scatter phase which goes and computes the distributed term frequencies for more accurate scoring.
GET twitter/_search?search_type=dfs_query_then_fetch
Sortedit
Allows you to add one or more sorts on specific fields. Each sort can be
reversed as well. The sort is defined on a per field level, with special
field name for _score
to sort by score, and _doc
to sort by index order.
Assuming the following index mapping:
PUT /my_index { "mappings": { "properties": { "post_date": { "type": "date" }, "user": { "type": "keyword" }, "name": { "type": "keyword" }, "age": { "type": "integer" } } } }
GET /my_index/_search { "sort" : [ { "post_date" : {"order" : "asc"}}, "user", { "name" : "desc" }, { "age" : "desc" }, "_score" ], "query" : { "term" : { "user" : "kimchy" } } }
_doc
has no real use-case besides being the most efficient sort order.
So if you don’t care about the order in which documents are returned, then you
should sort by _doc
. This especially helps when scrolling.
Sort Valuesedit
The sort values for each document returned are also returned as part of the response.
Sort Orderedit
The order
option can have the following values:
|
Sort in ascending order |
|
Sort in descending order |
The order defaults to desc
when sorting on the _score
, and defaults
to asc
when sorting on anything else.
Sort mode optionedit
Elasticsearch supports sorting by array or multi-valued fields. The mode
option
controls what array value is picked for sorting the document it belongs
to. The mode
option can have the following values:
|
Pick the lowest value. |
|
Pick the highest value. |
|
Use the sum of all values as sort value. Only applicable for number based array fields. |
|
Use the average of all values as sort value. Only applicable for number based array fields. |
|
Use the median of all values as sort value. Only applicable for number based array fields. |
The default sort mode in the ascending sort order is min
— the lowest value
is picked. The default sort mode in the descending order is max
— the highest value is picked.
Sort mode example usageedit
In the example below the field price has multiple prices per document. In this case the result hits will be sorted by price ascending based on the average price per document.
PUT /my_index/_doc/1?refresh { "product": "chocolate", "price": [20, 4] } POST /_search { "query" : { "term" : { "product" : "chocolate" } }, "sort" : [ {"price" : {"order" : "asc", "mode" : "avg"}} ] }
Sorting numeric fieldsedit
For numeric fields it is also possible to cast the values from one type
to another using the numeric_type
option.
This option accepts the following values: ["double", "long", "date", "date_nanos"
]
and can be useful for cross-index search if the sort field is mapped differently on some
indices.
Consider for instance these two indices:
PUT /index_double { "mappings": { "properties": { "field": { "type": "double" } } } }
PUT /index_long { "mappings": { "properties": { "field": { "type": "long" } } } }
Since field
is mapped as a double
in the first index and as a long
in the second index, it is not possible to use this field to sort requests
that query both indices by default. However you can force the type to one
or the other with the numeric_type
option in order to force a specific
type for all indices:
POST /index_long,index_double/_search { "sort" : [ { "field" : { "numeric_type" : "double" } } ] }
In the example above, values for the index_long
index are casted to
a double in order to be compatible with the values produced by the
index_double
index.
It is also possible to transform a floating point field into a long
but note that in this case floating points are replaced by the largest
value that is less than or equal (greater than or equal if the value
is negative) to the argument and is equal to a mathematical integer.
This option can also be used to convert a date
field that uses millisecond
resolution to a date_nanos
field with nanosecond resolution.
Consider for instance these two indices:
PUT /index_double { "mappings": { "properties": { "field": { "type": "date" } } } }
PUT /index_long { "mappings": { "properties": { "field": { "type": "date_nanos" } } } }
Values in these indices are stored with different resolutions so sorting on these
fields will always sort the date
before the date_nanos
(ascending order).
With the numeric_type
type option it is possible to set a single resolution for
the sort, setting to date
will convert the date_nanos
to the millisecond resolution
while date_nanos
will convert the values in the date
field to the nanoseconds resolution:
POST /index_long,index_double/_search { "sort" : [ { "field" : { "numeric_type" : "date_nanos" } } ] }
To avoid overflow, the conversion to date_nanos
cannot be applied on dates before
1970 and after 2262 as nanoseconds are represented as longs.
Sorting within nested objects.edit
Elasticsearch also supports sorting by
fields that are inside one or more nested objects. The sorting by nested
field support has a nested
sort option with the following properties:
-
path
- Defines on which nested object to sort. The actual sort field must be a direct field inside this nested object. When sorting by nested field, this field is mandatory.
-
filter
-
A filter that the inner objects inside the nested path
should match with in order for its field values to be taken into account
by sorting. Common case is to repeat the query / filter inside the
nested filter or query. By default no
nested_filter
is active. -
max_children
- The maximum number of children to consider per root document when picking the sort value. Defaults to unlimited.
-
nested
-
Same as top-level
nested
but applies to another nested path within the current nested object.
Nested sort options before Elasticsearch 6.1
The nested_path
and nested_filter
options have been deprecated in
favor of the options documented above.
Nested sorting examplesedit
In the below example offer
is a field of type nested
.
The nested path
needs to be specified; otherwise, Elasticsearch doesn’t know on what nested level sort values need to be captured.
POST /_search { "query" : { "term" : { "product" : "chocolate" } }, "sort" : [ { "offer.price" : { "mode" : "avg", "order" : "asc", "nested": { "path": "offer", "filter": { "term" : { "offer.color" : "blue" } } } } } ] }
In the below example parent
and child
fields are of type nested
.
The nested_path
needs to be specified at each level; otherwise, Elasticsearch doesn’t know on what nested level sort values need to be captured.
POST /_search { "query": { "nested": { "path": "parent", "query": { "bool": { "must": {"range": {"parent.age": {"gte": 21}}}, "filter": { "nested": { "path": "parent.child", "query": {"match": {"parent.child.name": "matt"}} } } } } } }, "sort" : [ { "parent.child.age" : { "mode" : "min", "order" : "asc", "nested": { "path": "parent", "filter": { "range": {"parent.age": {"gte": 21}} }, "nested": { "path": "parent.child", "filter": { "match": {"parent.child.name": "matt"} } } } } } ] }
Nested sorting is also supported when sorting by scripts and sorting by geo distance.
Missing Valuesedit
The missing
parameter specifies how docs which are missing
the sort field should be treated: The missing
value can be
set to _last
, _first
, or a custom value (that
will be used for missing docs as the sort value).
The default is _last
.
For example:
GET /_search { "sort" : [ { "price" : {"missing" : "_last"} } ], "query" : { "term" : { "product" : "chocolate" } } }
If a nested inner object doesn’t match with
the nested_filter
then a missing value is used.
Ignoring Unmapped Fieldsedit
By default, the search request will fail if there is no mapping
associated with a field. The unmapped_type
option allows you to ignore
fields that have no mapping and not sort by them. The value of this
parameter is used to determine what sort values to emit. Here is an
example of how it can be used:
GET /_search { "sort" : [ { "price" : {"unmapped_type" : "long"} } ], "query" : { "term" : { "product" : "chocolate" } } }
If any of the indices that are queried doesn’t have a mapping for price
then Elasticsearch will handle it as if there was a mapping of type
long
, with all documents in this index having no value for this field.
Geo Distance Sortingedit
Allow to sort by _geo_distance
. Here is an example, assuming pin.location
is a field of type geo_point
:
GET /_search { "sort" : [ { "_geo_distance" : { "pin.location" : [-70, 40], "order" : "asc", "unit" : "km", "mode" : "min", "distance_type" : "arc", "ignore_unmapped": true } } ], "query" : { "term" : { "user" : "kimchy" } } }
-
distance_type
-
How to compute the distance. Can either be
arc
(default), orplane
(faster, but inaccurate on long distances and close to the poles). -
mode
-
What to do in case a field has several geo points. By default, the shortest
distance is taken into account when sorting in ascending order and the
longest distance when sorting in descending order. Supported values are
min
,max
,median
andavg
. -
unit
-
The unit to use when computing sort values. The default is
m
(meters). -
ignore_unmapped
-
Indicates if the unmapped field should be treated as a missing value. Setting it to
true
is equivalent to specifying anunmapped_type
in the field sort. The default isfalse
(unmapped field cause the search to fail).
geo distance sorting does not support configurable missing values: the
distance will always be considered equal to Infinity
when a document does not
have values for the field that is used for distance computation.
The following formats are supported in providing the coordinates:
Lat Lon as Propertiesedit
GET /_search { "sort" : [ { "_geo_distance" : { "pin.location" : { "lat" : 40, "lon" : -70 }, "order" : "asc", "unit" : "km" } } ], "query" : { "term" : { "user" : "kimchy" } } }
Lat Lon as Stringedit
Format in lat,lon
.
GET /_search { "sort" : [ { "_geo_distance" : { "pin.location" : "40,-70", "order" : "asc", "unit" : "km" } } ], "query" : { "term" : { "user" : "kimchy" } } }
Geohashedit
GET /_search { "sort" : [ { "_geo_distance" : { "pin.location" : "drm3btev3e86", "order" : "asc", "unit" : "km" } } ], "query" : { "term" : { "user" : "kimchy" } } }
Multiple reference pointsedit
Multiple geo points can be passed as an array containing any geo_point
format, for example
GET /_search { "sort" : [ { "_geo_distance" : { "pin.location" : [[-70, 40], [-71, 42]], "order" : "asc", "unit" : "km" } } ], "query" : { "term" : { "user" : "kimchy" } } }
and so forth.
The final distance for a document will then be min
/max
/avg
(defined via mode
) distance of all points contained in the document to all points given in the sort request.
Script Based Sortingedit
Allow to sort based on custom scripts, here is an example:
GET /_search { "query" : { "term" : { "user" : "kimchy" } }, "sort" : { "_script" : { "type" : "number", "script" : { "lang": "painless", "source": "doc['field_name'].value * params.factor", "params" : { "factor" : 1.1 } }, "order" : "asc" } } }
Track Scoresedit
When sorting on a field, scores are not computed. By setting
track_scores
to true, scores will still be computed and tracked.
GET /_search { "track_scores": true, "sort" : [ { "post_date" : {"order" : "desc"} }, { "name" : "desc" }, { "age" : "desc" } ], "query" : { "term" : { "user" : "kimchy" } } }
Memory Considerationsedit
When sorting, the relevant sorted field values are loaded into memory.
This means that per shard, there should be enough memory to contain
them. For string based types, the field sorted on should not be analyzed
/ tokenized. For numeric types, if possible, it is recommended to
explicitly set the type to narrower types (like short
, integer
and
float
).
Source filteringedit
See source filtering.
Stored Fieldsedit
The stored_fields
parameter is about fields that are explicitly marked as
stored in the mapping, which is off by default and generally not recommended.
Use source filtering instead to select
subsets of the original source document to be returned.
Allows to selectively load specific stored fields for each document represented by a search hit.
GET /_search { "stored_fields" : ["user", "postDate"], "query" : { "term" : { "user" : "kimchy" } } }
*
can be used to load all stored fields from the document.
An empty array will cause only the _id
and _type
for each hit to be
returned, for example:
GET /_search { "stored_fields" : [], "query" : { "term" : { "user" : "kimchy" } } }
If the requested fields are not stored (store
mapping set to false
), they will be ignored.
Stored field values fetched from the document itself are always returned as an array. On the contrary, metadata fields like _routing
are never returned as an array.
Also only leaf fields can be returned via the stored_fields
option. If an object field is specified, it will be ignored.
On its own, stored_fields
cannot be used to load fields in nested
objects — if a field contains a nested object in its path, then no data will
be returned for that stored field. To access nested fields, stored_fields
must be used within an inner_hits
block.
Track total hitsedit
Generally the total hit count can’t be computed accurately without visiting all
matches, which is costly for queries that match lots of documents. The
track_total_hits
parameter allows you to control how the total number of hits
should be tracked.
Given that it is often enough to have a lower bound of the number of hits,
such as "there are at least 10000 hits", the default is set to 10,000
.
This means that requests will count the total hit accurately up to 10,000
hits.
It’s is a good trade off to speed up searches if you don’t need the accurate number
of hits after a certain threshold.
When set to true
the search response will always track the number of hits that
match the query accurately (e.g. total.relation
will always be equal to "eq"
when track_total_hits
is set to true). Otherwise the "total.relation"
returned
in the "total"
object in the search response determines how the "total.value"
should be interpreted. A value of "gte"
means that the "total.value"
is a
lower bound of the total hits that match the query and a value of "eq"
indicates
that "total.value"
is the accurate count.
GET twitter/_search { "track_total_hits": true, "query": { "match" : { "message" : "Elasticsearch" } } }
... returns:
{ "_shards": ... "timed_out": false, "took": 100, "hits": { "max_score": 1.0, "total" : { "value": 2048, "relation": "eq" }, "hits": ... } }
It is also possible to set track_total_hits
to an integer.
For instance the following query will accurately track the total hit count that match
the query up to 100 documents:
GET twitter/_search { "track_total_hits": 100, "query": { "match" : { "message" : "Elasticsearch" } } }
The hits.total.relation
in the response will indicate if the
value returned in hits.total.value
is accurate ("eq"
) or a lower
bound of the total ("gte"
).
For instance the following response:
{ "_shards": ... "timed_out": false, "took": 30, "hits" : { "max_score": 1.0, "total" : { "value": 42, "relation": "eq" }, "hits": ... } }
... indicates that the number of hits returned in the total
is accurate.
If the total number of hits that match the query is greater than the
value set in track_total_hits
, the total hits in the response
will indicate that the returned value is a lower bound:
{ "_shards": ... "hits" : { "max_score": 1.0, "total" : { "value": 100, "relation": "gte" }, "hits": ... } }
If you don’t need to track the total number of hits at all you can improve query
times by setting this option to false
:
GET twitter/_search { "track_total_hits": false, "query": { "match" : { "message" : "Elasticsearch" } } }
... returns:
Finally you can force an accurate count by setting "track_total_hits"
to true
in the request.