significant_terms Demoedit

Because the significant_terms aggregation works by analyzing statistics, you need to have a certain threshold of data for it to become effective. That means we won’t be able to index a small amount of example data for the demo.

Instead, we prepared a dataset that contains about 80,000 documents and saved it as a snapshot in our public demo repository. To "restore" this dataset into your cluster:

  1. Add the following setting to your elasticsearch.yml configuration file to whitelist the Elastic demo repository:

    repositories.url.allowed_urls: ["http://download.elastic.co/*"]
  2. Restart Elasticsearch.
  3. Run the following snapshot commands. (For more information about using snapshots, see Backing Up Your Cluster.)

    PUT /_snapshot/sigterms 
    {
        "type": "url",
        "settings": {
            "url": "http://download.elastic.co/definitiveguide/sigterms_demo/"
        }
    }
    
    GET /_snapshot/sigterms/_all 
    
    POST /_snapshot/sigterms/snapshot/_restore 
    
    GET /mlmovies,mlratings/_recovery 

    Register a new read-only URL repository pointing at the demo snapshot

    (Optional) Inspect the repository to learn details about available snapshots

    Begin the Restore process. This will download two indices into your cluster: mlmovies and mlratings

    (Optional) Monitor the Restore process using the Recovery API

The dataset is around 50 MB and may take some time to download.

In this demo, we are going to look at movie ratings by users of MovieLens. At MovieLens, users make movie recommendations so other users can find new movies to watch. For this demo, we are going to recommend movies by using significant_terms based on an input movie.

Let’s take a look at some sample data, to get a feel for what we are working with. There are two indices in this dataset, mlmovies and mlratings. Let’s look at mlmovies first:

GET mlmovies/_search 

{
   "took": 4,
   "timed_out": false,
   "_shards": {...},
   "hits": {
      "total": 10681,
      "max_score": 1,
      "hits": [
         {
            "_index": "mlmovies",
            "_type": "mlmovie",
            "_id": "2",
            "_score": 1,
            "_source": {
               "offset": 2,
               "bytes": 34,
               "title": "Jumanji (1995)"
            }
         },
         ....

Execute a search without a query, so that we can see a random sampling of docs.

Each document in mlmovies represents a single movie. The two important pieces of data are the _id of the movie and the title of the movie. You can ignore offset and bytes; they are artifacts of the process used to extract this data from the original CSV files. There are 10,681 movies in this dataset.

Now let’s look at mlratings:

GET mlratings/_search

{
   "took": 3,
   "timed_out": false,
   "_shards": {...},
   "hits": {
      "total": 69796,
      "max_score": 1,
      "hits": [
         {
            "_index": "mlratings",
            "_type": "mlrating",
            "_id": "00IC-2jDQFiQkpD6vhbFYA",
            "_score": 1,
            "_source": {
               "offset": 1,
               "bytes": 108,
               "movie": [122,185,231,292,
                  316,329,355,356,362,364,370,377,420,
                  466,480,520,539,586,588,589,594,616
               ],
               "user": 1
            }
         },
         ...

Here we can see the recommendations of individual users. Each document represents a single user, denoted by the user ID field. The movie field holds a list of movies that this user watched and recommended.

Recommending Based on Popularityedit

The first strategy we could take is trying to recommend movies based on popularity. Given a particular movie, we find all users who recommended that movie. Then we aggregate all their recommendations and take the top five most popular.

We can express that easily with a terms aggregation and some filtering. Let’s look at Talladega Nights, a comedy about NASCAR racing starring Will Ferrell. Ideally, our recommender should find other comedies in a similar style (and more than likely also starring Will Ferrell).

First we need to find the Talladega Nights ID:

GET mlmovies/_search
{
  "query": {
    "match": {
      "title": "Talladega Nights"
    }
  }
}

    ...
    "hits": [
     {
        "_index": "mlmovies",
        "_type": "mlmovie",
        "_id": "46970", 
        "_score": 3.658795,
        "_source": {
           "offset": 9575,
           "bytes": 74,
           "title": "Talladega Nights: The Ballad of Ricky Bobby (2006)"
        }
     },
    ...

Talladega Nights is ID 46970.

Armed with the ID, we can now filter the ratings and apply our terms aggregation to find the most popular movies from people who also like Talladega Nights:

GET mlratings/_search
{
  "size" : 0, 
  "query": {
    "filtered": {
      "filter": {
        "term": {
          "movie": 46970 
        }
      }
    }
  },
  "aggs": {
    "most_popular": {
      "terms": {
        "field": "movie", 
        "size": 6
      }
    }
  }
}

We execute our query on mlratings this time, and set the size to 0 since we are interested only in the aggregation results.

Apply a filter on the ID corresponding to Talladega Nights.

Finally, find the most popular movies by using a terms bucket.

We perform the search on the mlratings index, and apply a filter for the ID of Talladega Nights. Since aggregations operate on query scope, this will effectively filter the aggregation results to only the users who recommended Talladega Nights. Finally, we execute a terms aggregation to bucket the most popular movies. We are requesting the top six results, since it is likely that Talladega Nights itself will be returned as a hit (and we don’t want to recommend the same movie).

The results come back like so:

{
...
   "aggregations": {
      "most_popular": {
         "buckets": [
            {
               "key": 46970,
               "key_as_string": "46970",
               "doc_count": 271
            },
            {
               "key": 2571,
               "key_as_string": "2571",
               "doc_count": 197
            },
            {
               "key": 318,
               "key_as_string": "318",
               "doc_count": 196
            },
            {
               "key": 296,
               "key_as_string": "296",
               "doc_count": 183
            },
            {
               "key": 2959,
               "key_as_string": "2959",
               "doc_count": 183
            },
            {
               "key": 260,
               "key_as_string": "260",
               "doc_count": 90
            }
         ]
      }
   }
...

We need to correlate these back to their original titles, which can be done with a simple filtered query:

GET mlmovies/_search
{
  "query": {
    "filtered": {
      "filter": {
        "ids": {
          "values": [2571,318,296,2959,260]
        }
      }
    }
  }
}

And finally, we end up with the following list:

  1. Matrix, The
  2. Shawshank Redemption
  3. Pulp Fiction
  4. Fight Club
  5. Star Wars Episode IV: A New Hope

OK—​well that is certainly a good list! I like all of those movies. But that’s the problem: most everyone likes that list. Those movies are universally well-liked, which means they are popular on everyone’s recommendations. The list is basically a recommendation of popular movies, not recommendations related to Talladega Nights.

This is easily verified by running the aggregation again, but without the filter on Talladega Nights. This will give a top-five most popular movie list:

GET mlratings/_search
{
  "size" : 0,
  "aggs": {
    "most_popular": {
      "terms": {
        "field": "movie",
        "size": 5
      }
    }
  }
}

This returns a list that is very similar:

  1. Shawshank Redemption
  2. Silence of the Lambs, The
  3. Pulp Fiction
  4. Forrest Gump
  5. Star Wars Episode IV: A New Hope

Clearly, just checking the most popular movies is not sufficient to build a good, discriminating recommender.

Recommending Based on Statisticsedit

Now that the scene is set, let’s try using significant_terms. significant_terms will analyze the group of people who enjoy Talladega Nights (the foreground group) and determine what movies are most popular. It will then construct a list of popular films for everyone (the background group) and compare the two.

The statistical anomalies will be the movies that are over-represented in the foreground compared to the background. Theoretically, this should be a list of comedies, since people who enjoy Will Ferrell comedies will recommend them at a higher rate than the background population of people.

Let’s give it a shot:

GET mlratings/_search
{
  "size" : 0,
  "query": {
    "filtered": {
      "filter": {
        "term": {
          "movie": 46970
        }
      }
    }
  },
  "aggs": {
    "most_sig": {
      "significant_terms": { 
        "field": "movie",
        "size": 6
      }
    }
  }
}

The setup is nearly identical — we just use significant_terms instead of terms.

As you can see, the query is nearly the same. We filter for users who liked Talladega Nights; this forms the foreground group. By default, significant_terms will use the entire index as the background, so we don’t need to do anything special.

The results come back as a list of buckets similar to terms, but with some extra metadata:

...
   "aggregations": {
      "most_sig": {
         "doc_count": 271, 
         "buckets": [
            {
               "key": 46970,
               "key_as_string": "46970",
               "doc_count": 271,
               "score": 256.549815498155,
               "bg_count": 271
            },
            {
               "key": 52245, 
               "key_as_string": "52245",
               "doc_count": 59, 
               "score": 17.66462367106966,
               "bg_count": 185 
            },
            {
               "key": 8641,
               "key_as_string": "8641",
               "doc_count": 107,
               "score": 13.884387742677438,
               "bg_count": 762
            },
            {
               "key": 58156,
               "key_as_string": "58156",
               "doc_count": 17,
               "score": 9.746428133759462,
               "bg_count": 28
            },
            {
               "key": 52973,
               "key_as_string": "52973",
               "doc_count": 95,
               "score": 9.65770100311672,
               "bg_count": 857
            },
            {
               "key": 35836,
               "key_as_string": "35836",
               "doc_count": 128,
               "score": 9.199001116457955,
               "bg_count": 1610
            }
         ]
 ...

The top-level doc_count shows the number of docs in the foreground group.

Each bucket lists the key (for example, movie ID) being aggregated.

A doc_count for that bucket.

And a background count, which shows the rate at which this value appears in the entire background.

You can see that the first bucket we get back is Talladega Nights. It is found in all 271 documents, which is not surprising. Let’s look at the next bucket: key 52245.

This ID corresponds to Blades of Glory, a comedy about male figure skating that also stars Will Ferrell. We can see that it was recommended 59 times by the people who also liked Talladega Nights. This means that 21% of the foreground group recommended Blades of Glory (59 / 271 = 0.2177).

In contrast, Blades of Glory was recommended only 185 times in the entire dataset, which equates to a mere 0.26% (185 / 69796 = 0.00265). Blades of Glory is therefore a statistical anomaly: it is uncommonly common in the group of people who like Talladega Nights. We just found a good recommendation!

If we look at the entire list, they are all comedies that would fit as good recommendations (many of which also star Will Ferrell):

  1. Blades of Glory
  2. Anchorman: The Legend of Ron Burgundy
  3. Semi-Pro
  4. Knocked Up
  5. 40-Year-Old Virgin, The

This is just one example of the power of significant_terms. Once you start using significant_terms, you find many situations where you don’t want the most popular—​you want the most uncommonly common. This simple aggregation can uncover some surprisingly sophisticated trends in your data.