WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Cross-fields Entity Searchedit
Now we come to a common pattern: cross-fields entity search. With entities
like person
, product
, or address
, the identifying information is spread
across several fields. We may have a person
indexed as follows:
{ "firstname": "Peter", "lastname": "Smith" }
Or an address like this:
{ "street": "5 Poland Street", "city": "London", "country": "United Kingdom", "postcode": "W1V 3DG" }
This sounds a lot like the example we described in Multiple Query Strings, but there is a big difference between these two scenarios. In Multiple Query Strings, we used a separate query string for each field. In this scenario, we want to search across multiple fields with a single query string.
Our user might search for the person “Peter Smith” or for the address
“Poland Street W1V.” Each of those words appears in a different field, so
using a dis_max
/ best_fields
query to find the single best-matching
field is clearly the wrong approach.
A Naive Approachedit
Really, we want to query each field in turn and add up the scores of every
field that matches, which sounds like a job for the bool
query:
{ "query": { "bool": { "should": [ { "match": { "street": "Poland Street W1V" }}, { "match": { "city": "Poland Street W1V" }}, { "match": { "country": "Poland Street W1V" }}, { "match": { "postcode": "Poland Street W1V" }} ] } } }
Repeating the query string for every field soon becomes tedious. We can use
the multi_match
query instead, and set the type
to most_fields
to tell it to
combine the scores of all matching fields:
{ "query": { "multi_match": { "query": "Poland Street W1V", "type": "most_fields", "fields": [ "street", "city", "country", "postcode" ] } } }
Problems with the most_fields Approachedit
The most_fields
approach to entity search has some problems that are not
immediately obvious:
- It is designed to find the most fields matching any words, rather than to find the most matching words across all fields.
-
It can’t use the
operator
orminimum_should_match
parameters to reduce the long tail of less-relevant results. - Term frequencies are different in each field and could interfere with each other to produce badly ordered results.
- Elasticsearch - The Definitive Guide:
- Foreword
- Preface
- Getting Started
- You Know, for Search…
- Installing and Running Elasticsearch
- Talking to Elasticsearch
- Document Oriented
- Finding Your Feet
- Indexing Employee Documents
- Retrieving a Document
- Search Lite
- Search with Query DSL
- More-Complicated Searches
- Full-Text Search
- Phrase Search
- Highlighting Our Searches
- Analytics
- Tutorial Conclusion
- Distributed Nature
- Next Steps
- Life Inside a Cluster
- Data In, Data Out
- What Is a Document?
- Document Metadata
- Indexing a Document
- Retrieving a Document
- Checking Whether a Document Exists
- Updating a Whole Document
- Creating a New Document
- Deleting a Document
- Dealing with Conflicts
- Optimistic Concurrency Control
- Partial Updates to Documents
- Retrieving Multiple Documents
- Cheaper in Bulk
- Distributed Document Store
- Searching—The Basic Tools
- Mapping and Analysis
- Full-Body Search
- Sorting and Relevance
- Distributed Search Execution
- Index Management
- Inside a Shard
- You Know, for Search…
- Search in Depth
- Structured Search
- Full-Text Search
- Multifield Search
- Proximity Matching
- Partial Matching
- Controlling Relevance
- Theory Behind Relevance Scoring
- Lucene’s Practical Scoring Function
- Query-Time Boosting
- Manipulating Relevance with Query Structure
- Not Quite Not
- Ignoring TF/IDF
- function_score Query
- Boosting by Popularity
- Boosting Filtered Subsets
- Random Scoring
- The Closer, The Better
- Understanding the price Clause
- Scoring with Scripts
- Pluggable Similarity Algorithms
- Changing Similarities
- Relevance Tuning Is the Last 10%
- Dealing with Human Language
- Aggregations
- Geolocation
- Modeling Your Data
- Administration, Monitoring, and Deployment