Practical Considerations | Elasticsearch: The Definitive Guide [2.x]

原文地址: https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child-performance.html, 版权归 www.elastic.co 所有

WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.

This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.

» » »

« Grandparents and Grandchildren Designing for Scale »

Practical Considerationsedit

Parent-child joins can be a useful technique for managing relationships when index-time performance is more important than search-time performance, but it comes at a significant cost. Parent-child queries can be 5 to 10 times slower than the equivalent nested query!

Global Ordinals and Latencyedit

Parent-child uses global ordinals to speed up joins. Regardless of whether the parent-child map uses an in-memory cache or on-disk doc values, global ordinals still need to be rebuilt after any change to the index.

The more parents in a shard, the longer global ordinals will take to build. Parent-child is best suited to situations where there are many children for each parent, rather than many parents and few children.

Global ordinals, by default, are built lazily: the first parent-child query or aggregation after a refresh will trigger building of global ordinals. This can introduce a significant latency spike for your users. You can use eager_global_ordinals to shift the cost of building global ordinals from query time to refresh time, by mapping the _parent field as follows:

PUT /company
{
  "mappings": {
    "branch": {},
    "employee": {
      "_parent": {
        "type": "branch",
        "fielddata": {
          "loading": "eager_global_ordinals" 
        }
      }
    }
  }
}

Global ordinals for the _parent field will be built before a new segment becomes visible to search.

With many parents, global ordinals can take several seconds to build. In this case, it makes sense to increase the refresh_interval so that refreshes happen less often and global ordinals remain valid for longer. This will greatly reduce the CPU cost of rebuilding global ordinals every second.

Multigenerations and Concluding Thoughtsedit

The ability to join multiple generations (see Grandparents and Grandchildren) sounds attractive until you think of the costs involved:

The more joins you have, the worse performance will be.
Each generation of parents needs to have their string _id fields stored in memory, which can consume a lot of RAM.

As you consider your relationship schemes and whether parent-child is right for you, consider this advice about parent-child relationships:

Use parent-child relationships sparingly, and only when there are many more children than parents.
Avoid using multiple parent-child joins in a single query.
Avoid scoring by using the has_child filter, or the has_child query with score_mode set to none.
Keep the parent IDs short, so that they compress better in doc values, and use less memory when transiently loaded.

Above all: think about the other relationship techniques that we have discussed before reaching for parent-child.

« Grandparents and Grandchildren Designing for Scale »