WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Inverted Indexedit
Elasticsearch uses a structure called an inverted index, which is designed to allow very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.
For example, let’s say we have two documents, each with a content
field
containing the following:
- The quick brown fox jumped over the lazy dog
- Quick brown foxes leap over lazy dogs in summer
To create an inverted index, we first split the content
field of each
document into separate words (which we call terms, or tokens), create a
sorted list of all the unique terms, and then list in which document each term
appears. The result looks something like this:
Term Doc_1 Doc_2 ------------------------- Quick | | X The | X | brown | X | X dog | X | dogs | | X fox | X | foxes | | X in | | X jumped | X | lazy | X | X leap | | X over | X | X quick | X | summer | | X the | X | ------------------------
Now, if we want to search for quick brown
, we just need to find the
documents in which each term appears:
Term Doc_1 Doc_2 ------------------------- brown | X | X quick | X | ------------------------ Total | 2 | 1
Both documents match, but the first document has more matches than the second. If we apply a naive similarity algorithm that just counts the number of matching terms, then we can say that the first document is a better match—is more relevant to our query—than the second document.
But there are a few problems with our current inverted index:
-
Quick
andquick
appear as separate terms, while the user probably thinks of them as the same word. -
fox
andfoxes
are pretty similar, as aredog
anddogs
; They share the same root word. -
jumped
andleap
, while not from the same root word, are similar in meaning. They are synonyms.
With the preceding index, a search for +Quick +fox
wouldn’t match any
documents. (Remember, a preceding +
means that the word must be present.)
Both the term Quick
and the term fox
have to be in the same document
in order to satisfy the query, but the first doc contains quick fox
and
the second doc contains Quick foxes
.
Our user could reasonably expect both documents to match the query. We can do better.
If we normalize the terms into a standard format, then we can find documents that contain terms that are not exactly the same as the user requested, but are similar enough to still be relevant. For instance:
-
Quick
can be lowercased to becomequick
. -
foxes
can be stemmed--reduced to its root form—to becomefox
. Similarly,dogs
could be stemmed todog
. -
jumped
andleap
are synonyms and can be indexed as just the single termjump
.
Now the index looks like this:
Term Doc_1 Doc_2 ------------------------- brown | X | X dog | X | X fox | X | X in | | X jump | X | X lazy | X | X over | X | X quick | X | X summer | | X the | X | X ------------------------
But we’re not there yet. Our search for +Quick +fox
would still fail,
because we no longer have the exact term Quick
in our index. However, if
we apply the same normalization rules that we used on the content
field to
our query string, it would become a query for +quick +fox
, which would
match both documents!
This is very important. You can find only terms that exist in your index, so both the indexed text and the query string must be normalized into the same form.
This process of tokenization and normalization is called analysis, which we discuss in the next section.
- Elasticsearch - The Definitive Guide:
- Foreword
- Preface
- Getting Started
- You Know, for Search…
- Installing and Running Elasticsearch
- Talking to Elasticsearch
- Document Oriented
- Finding Your Feet
- Indexing Employee Documents
- Retrieving a Document
- Search Lite
- Search with Query DSL
- More-Complicated Searches
- Full-Text Search
- Phrase Search
- Highlighting Our Searches
- Analytics
- Tutorial Conclusion
- Distributed Nature
- Next Steps
- Life Inside a Cluster
- Data In, Data Out
- What Is a Document?
- Document Metadata
- Indexing a Document
- Retrieving a Document
- Checking Whether a Document Exists
- Updating a Whole Document
- Creating a New Document
- Deleting a Document
- Dealing with Conflicts
- Optimistic Concurrency Control
- Partial Updates to Documents
- Retrieving Multiple Documents
- Cheaper in Bulk
- Distributed Document Store
- Searching—The Basic Tools
- Mapping and Analysis
- Full-Body Search
- Sorting and Relevance
- Distributed Search Execution
- Index Management
- Inside a Shard
- You Know, for Search…
- Search in Depth
- Structured Search
- Full-Text Search
- Multifield Search
- Proximity Matching
- Partial Matching
- Controlling Relevance
- Theory Behind Relevance Scoring
- Lucene’s Practical Scoring Function
- Query-Time Boosting
- Manipulating Relevance with Query Structure
- Not Quite Not
- Ignoring TF/IDF
- function_score Query
- Boosting by Popularity
- Boosting Filtered Subsets
- Random Scoring
- The Closer, The Better
- Understanding the price Clause
- Scoring with Scripts
- Pluggable Similarity Algorithms
- Changing Similarities
- Relevance Tuning Is the Last 10%
- Dealing with Human Language
- Aggregations
- Geolocation
- Modeling Your Data
- Administration, Monitoring, and Deployment