WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Controlling Stemmingedit
Out-of-the-box stemming solutions are never perfect. Algorithmic stemmers,
especially, will blithely apply their rules to any words they encounter,
perhaps conflating words that you would prefer to keep separate. Maybe, for
your use case, it is important to keep skies
and skiing
as distinct words
rather than stemming them both down to ski
(as would happen with the
english
analyzer).
The keyword_marker
and
stemmer_override
token filters
allow us to customize the stemming process.
Preventing Stemmingedit
The stem_exclusion
parameter for language analyzers (see
Configuring Language Analyzers) allowed us to specify a list of words that
should not be stemmed. Internally, these language analyzers use the
keyword_marker
token filter
to mark the listed words as keywords, which prevents subsequent stemming
token filters from touching those words.
For instance, we can create a simple custom analyzer that uses the
porter_stem
token filter,
but prevents the word skies
from being stemmed:
PUT /my_index { "settings": { "analysis": { "filter": { "no_stem": { "type": "keyword_marker", "keywords": [ "skies" ] } }, "analyzer": { "my_english": { "tokenizer": "standard", "filter": [ "lowercase", "no_stem", "porter_stem" ] } } } } }
Testing it with the analyze
API shows that just the word skies
has
been excluded from stemming:
While the language analyzers allow us only to specify an array of words in the
stem_exclusion
parameter, the keyword_marker
token filter also accepts a
keywords_path
parameter that allows us to store all of our keywords in a
file. The file should contain one word per line, and must be present on every
node in the cluster. See Updating Stopwords for tips on how to update this
file.
Customizing Stemmingedit
In the preceding example, we prevented skies
from being stemmed, but perhaps we
would prefer it to be stemmed to sky
instead. The
stemmer_override
token
filter allows us to specify our own custom stemming rules. At the same time,
we can handle some irregular forms like stemming mice
to mouse
and feet
to foot
:
PUT /my_index { "settings": { "analysis": { "filter": { "custom_stem": { "type": "stemmer_override", "rules": [ "skies=>sky", "mice=>mouse", "feet=>foot" ] } }, "analyzer": { "my_english": { "tokenizer": "standard", "filter": [ "lowercase", "custom_stem", "porter_stem" ] } } } } } GET /my_index/_analyze?analyzer=my_english The mice came down from the skies and ran over my feet
Rules take the form |
|
The |
|
Returns |
Just as for the keyword_marker
token filter, rules can be stored
in a file whose location should be specified with the rules_path
parameter.
- Elasticsearch - The Definitive Guide:
- Foreword
- Preface
- Getting Started
- You Know, for Search…
- Installing and Running Elasticsearch
- Talking to Elasticsearch
- Document Oriented
- Finding Your Feet
- Indexing Employee Documents
- Retrieving a Document
- Search Lite
- Search with Query DSL
- More-Complicated Searches
- Full-Text Search
- Phrase Search
- Highlighting Our Searches
- Analytics
- Tutorial Conclusion
- Distributed Nature
- Next Steps
- Life Inside a Cluster
- Data In, Data Out
- What Is a Document?
- Document Metadata
- Indexing a Document
- Retrieving a Document
- Checking Whether a Document Exists
- Updating a Whole Document
- Creating a New Document
- Deleting a Document
- Dealing with Conflicts
- Optimistic Concurrency Control
- Partial Updates to Documents
- Retrieving Multiple Documents
- Cheaper in Bulk
- Distributed Document Store
- Searching—The Basic Tools
- Mapping and Analysis
- Full-Body Search
- Sorting and Relevance
- Distributed Search Execution
- Index Management
- Inside a Shard
- You Know, for Search…
- Search in Depth
- Structured Search
- Full-Text Search
- Multifield Search
- Proximity Matching
- Partial Matching
- Controlling Relevance
- Theory Behind Relevance Scoring
- Lucene’s Practical Scoring Function
- Query-Time Boosting
- Manipulating Relevance with Query Structure
- Not Quite Not
- Ignoring TF/IDF
- function_score Query
- Boosting by Popularity
- Boosting Filtered Subsets
- Random Scoring
- The Closer, The Better
- Understanding the price Clause
- Scoring with Scripts
- Pluggable Similarity Algorithms
- Changing Similarities
- Relevance Tuning Is the Last 10%
- Dealing with Human Language
- Aggregations
- Geolocation
- Modeling Your Data
- Administration, Monitoring, and Deployment