WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Reducing Words to Their Root Formedit
Most languages of the world are inflected, meaning that words can change their form to express differences in the following:
- Number: fox, foxes
- Tense: pay, paid, paying
- Gender: waiter, waitress
- Person: hear, hears
- Case: I, me, my
- Aspect: ate, eaten
- Mood: so be it, were it so
While inflection aids expressivity, it interferes with retrievability, as a single root word sense (or meaning) may be represented by many different sequences of letters. English is a weakly inflected language (you could ignore inflections and still get reasonable search results), but some other languages are highly inflected and need extra work in order to achieve high-quality search results.
Stemming attempts to remove the differences between inflected forms of a
word, in order to reduce each word to its root form. For instance foxes
may
be reduced to the root fox
, to remove the difference between singular and
plural in the same way that we removed the difference between lowercase and
uppercase.
The root form of a word may not even be a real word. The words jumping
and
jumpiness
may both be stemmed to jumpi
. It doesn’t matter—as long as
the same terms are produced at index time and at search time, search will just
work.
If stemming were easy, there would be only one implementation. Unfortunately, stemming is an inexact science that suffers from two issues: understemming and overstemming.
Understemming is the failure to reduce words with the same meaning to the same
root. For example, jumped
and jumps
may be reduced to jump
, while
jumping
may be reduced to jumpi
. Understemming reduces retrieval;
relevant documents are not returned.
Overstemming is the failure to keep two words with distinct meanings separate.
For instance, general
and generate
may both be stemmed to gener
.
Overstemming reduces precision: irrelevant documents are returned when they
shouldn’t be.
First we will discuss the two classes of stemmers available in Elasticsearch—Algorithmic Stemmers and Dictionary Stemmers—and then look at how to choose the right stemmer for your needs in Choosing a Stemmer. Finally, we will discuss options for tailoring stemming in Controlling Stemming and Stemming in situ.
- Elasticsearch - The Definitive Guide:
- Foreword
- Preface
- Getting Started
- You Know, for Search…
- Installing and Running Elasticsearch
- Talking to Elasticsearch
- Document Oriented
- Finding Your Feet
- Indexing Employee Documents
- Retrieving a Document
- Search Lite
- Search with Query DSL
- More-Complicated Searches
- Full-Text Search
- Phrase Search
- Highlighting Our Searches
- Analytics
- Tutorial Conclusion
- Distributed Nature
- Next Steps
- Life Inside a Cluster
- Data In, Data Out
- What Is a Document?
- Document Metadata
- Indexing a Document
- Retrieving a Document
- Checking Whether a Document Exists
- Updating a Whole Document
- Creating a New Document
- Deleting a Document
- Dealing with Conflicts
- Optimistic Concurrency Control
- Partial Updates to Documents
- Retrieving Multiple Documents
- Cheaper in Bulk
- Distributed Document Store
- Searching—The Basic Tools
- Mapping and Analysis
- Full-Body Search
- Sorting and Relevance
- Distributed Search Execution
- Index Management
- Inside a Shard
- You Know, for Search…
- Search in Depth
- Structured Search
- Full-Text Search
- Multifield Search
- Proximity Matching
- Partial Matching
- Controlling Relevance
- Theory Behind Relevance Scoring
- Lucene’s Practical Scoring Function
- Query-Time Boosting
- Manipulating Relevance with Query Structure
- Not Quite Not
- Ignoring TF/IDF
- function_score Query
- Boosting by Popularity
- Boosting Filtered Subsets
- Random Scoring
- The Closer, The Better
- Understanding the price Clause
- Scoring with Scripts
- Pluggable Similarity Algorithms
- Changing Similarities
- Relevance Tuning Is the Last 10%
- Dealing with Human Language
- Aggregations
- Geolocation
- Modeling Your Data
- Administration, Monitoring, and Deployment