join(联结)数据类型

原英文版地址: https://www.elastic.co/guide/en/elasticsearch/reference/7.7/parent-join.html, 原文档版权归 www.elastic.co 所有
本地英文版地址: ../en/parent-join.html

重要: 此版本不会发布额外的bug修复或文档更新。最新信息请参考当前版本文档。

» » »

« IP 数据类型 keyword(关键词)数据类型 »

join数据类型是一种特殊的字段，它在同一个索引的文档中创建父-子关系。 relations部分定义了文档中一组可能的关系，每个关系都是一个父名称和一个子名称。父-子关系可以定义如下：

PUT /my_index
{
  "mappings": {
    "properties": {
      "my_id": {
        "type": "keyword"
      },
      "my_join_field": { 
        "type": "join",
        "relations": {
          "question": "answer" 
        }
      }
    }
  }
}

	字段名称
	定义`question`是`answer`的父项的单一关系。

要使用 join 对文档进行索引，必须在source中提供关系的名称和文档的可选父文档。例如，下面这个例子在question上下文中创建了两个 parent 文档：

PUT /my_index/_doc/1?refresh
{
  "my_id": "1",
  "text": "This is a question",
  "my_join_field": {
    "name": "question" 
  }
}

PUT /my_index/_doc/2?refresh
{
  "my_id": "2",
  "text": "This is another question",
  "my_join_field": {
    "name": "question"
  }
}

该文档是一个question文档。

在给父文档编制索引时，可以选择仅指定关系的名称作为快捷方式，而不是将其封装在普通对象符号中：

PUT my_index/_doc/1?refresh
{
  "my_id": "1",
  "text": "This is a question",
  "my_join_field": "question" 
}

PUT my_index/_doc/2?refresh
{
  "my_id": "2",
  "text": "This is another question",
  "my_join_field": "question"
}

更简单的父文档符号只使用关系名。

当为子文档编制索引时，必须在_source中添加关系的名称以及文档的父id。

需要在同一个分片中索引父文档的传承关系，因此必须始终使用更大的父id来路由子文档。

例如，下面这个例子显示了如何索引两个 child 文档：

PUT my_index/_doc/3?routing=1&refresh 
{
  "my_id": "3",
  "text": "This is an answer",
  "my_join_field": {
    "name": "answer", 
    "parent": "1" 
  }
}

PUT my_index/_doc/4?routing=1&refresh
{
  "my_id": "4",
  "text": "This is another answer",
  "my_join_field": {
    "name": "answer",
    "parent": "1"
  }
}

	路由值是必需的，因为父文档和子文档必须在同一个分片上建立索引
	`answer`是该文档的联结(join)的名字
	子文档的父id

父联结及性能

join 字段不应该像关系型数据库中的 JOIN 一样使用。在Elasticsearch中，良好性能的关键是将数据反归一化到文档中。每个 join 字段、has_child或has_parent查询都会显著提高查询性能。

join 字段有意义的唯一情况是数据包含一对多关系，其中一个实体的数量远远超过另一个实体。这种情况的一个例子是产品及其优惠信息的用例。在优惠信息明显多于产品数量的情况下，将产品建模为父文档，将优惠建模为子文档是有意义的。

父联结的限制

每个索引中只能有一个join字段映射。
父文档和子文档必须被索引在同一个分片上。这意味着在获取、删除或更新子文档时，需要提供相同的 routing(路由) 值。
一个元素可以有多个子元素，但只能有一个父元素。
可以向现有的join字段添加新的关系。
也可以向现有元素添加子元素，但前提是该元素已经是父元素。

使用父联结(parent join)进行搜索

父联结(parent join)创建一个字段来索引文档中关系的名称(my_parent, my_child，...)。

它还为每个父-子关系创建一个字段。该字段的名称是join字段的名称，后跟#及关系中父字段的名称。例如，对于my_parent → [my_child, another_child]关系，join字段会创建一个名为my_join_field#my_parent的附加字段。

如果该文档是一个子文档(my_child or another_child)，则此字段包含文档链接到的父文档的_id，如果它是一个父文档(my_parent)，则包含文档的_id。

当搜索一个包含join字段的索引时，搜索响应中总是返回这两个字段：

GET my_index/_search
{
  "query": {
    "match_all": {}
  },
  "sort": ["my_id"]
}

它将返回：

{
    ...,
    "hits": {
        "total" : {
            "value": 4,
            "relation": "eq"
        },
        "max_score": null,
        "hits": [
            {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "1",
                "_score": null,
                "_source": {
                    "my_id": "1",
                    "text": "This is a question",
                    "my_join_field": "question" 
                },
                "sort": [
                    "1"
                ]
            },
            {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "2",
                "_score": null,
                "_source": {
                    "my_id": "2",
                    "text": "This is another question",
                    "my_join_field": "question" 
                },
                "sort": [
                    "2"
                ]
            },
            {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "3",
                "_score": null,
                "_routing": "1",
                "_source": {
                    "my_id": "3",
                    "text": "This is an answer",
                    "my_join_field": {
                        "name": "answer", 
                        "parent": "1"  
                    }
                },
                "sort": [
                    "3"
                ]
            },
            {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "4",
                "_score": null,
                "_routing": "1",
                "_source": {
                    "my_id": "4",
                    "text": "This is another answer",
                    "my_join_field": {
                        "name": "answer",
                        "parent": "1"
                    }
                },
                "sort": [
                    "4"
                ]
            }
        ]
    }
}

	该文档属于`question`联结
	该文档属于`question`联结
	该文档属于`answer`联结
	与子文档关联的父id

父联结查询及聚合

更多信息，请参考has_child和has_parent查询，children聚合及内部命中(inner hits)。

可以在聚合和脚本中访问join字段的值，并且可以使用parent_id查询进行查询：

GET my_index/_search
{
  "query": {
    "parent_id": { 
      "type": "answer",
      "id": "1"
    }
  },
  "aggs": {
    "parents": {
      "terms": {
        "field": "my_join_field#question", 
        "size": 10
      }
    }
  },
  "script_fields": {
    "parent": {
      "script": {
         "source": "doc['my_join_field#question']" 
      }
    }
  }
}

	查询`parent id`字段(另请参见`has_parent`查询和`has_child`查询)
	在`parent id`字段上聚合(另请参见`children`聚合)
	在脚本中访问父对象的id字段

全局序号(global ordinals)

字段使用join field uses 全局序号(global ordinals)来加速联结。在对分片进行任何更改后，都需要重新构建全局序号。一个分片中存储的父 id 值越多，为join字段重新构建全局序号所需的时间就越长。

默认情况下，会急切地构建全局序号：如果索引发生了变化，那么join字段的全局序号将作为 refresh 的一部分重新构建。这可能会大大增加 refresh 的时间。然而，大多数情况下这是正确的权衡，否则当使用第一个父联结(parent-join) 查询或聚合时，将重新构建全局序号。这可能会给你的用户带来显著的延迟峰值，通常情况下，当发生多次写入时，可能会在单个 refresh 间隔内尝试重建join字段的多个全局序号，这种情况会更糟。

当join字段很少使用但频繁写入时，禁用快速加载(eager loading) 可能是有意义的：

PUT my_index
{
  "mappings": {
    "properties": {
      "my_join_field": {
        "type": "join",
        "relations": {
           "question": "answer"
        },
        "eager_global_ordinals": false
      }
    }
  }
}

可以按如下方式检查每个父关系的全局序号使用的 heap 的大小：

# Per-index
GET _stats/fielddata?human&fields=my_join_field#question

# Per-node per-index
GET _nodes/stats/indices/fielddata?human&fields=my_join_field#question

一父多子

也可以为单个父对象定义多个子对象：

PUT my_index
{
  "mappings": {
    "properties": {
      "my_join_field": {
        "type": "join",
        "relations": {
          "question": ["answer", "comment"]  
        }
      }
    }
  }
}

question是answer和comment的父对象。

多层级父联结

不建议使用多层关系来复制关系模型。每一级关系都会在查询时增加内存和计算方面的开销。如果你关心性能，应该对数据进行反归一化。

多层级的父-子关系：

PUT my_index
{
  "mappings": {
    "properties": {
      "my_join_field": {
        "type": "join",
        "relations": {
          "question": ["answer", "comment"],  
          "answer": "vote" 
        }
      }
    }
  }
}

	`question`是`answer`和`comment`的父对象
	`answer`是`vote`的父级

上面的映射表示下面的树状结构：

   question
    /    \
   /      \
comment  answer
           |
           |
          vote

为孙子文档编制索引需要routing值等于曾祖父(曾祖父的传承)：

PUT my_index/_doc/3?routing=1&refresh 
{
  "text": "This is a vote",
  "my_join_field": {
    "name": "vote",
    "parent": "2" 
  }
}

	子文档必须与其祖父文档和父文档在同一个分片上
	此文档的父id(必须指向一个`answer`文档)

« IP 数据类型 keyword(关键词)数据类型 »