Elasticsearch document id 生成方式

手动指定

根据应用情况来说，是否满足手动指定 document id 的前提：

一般来说，是从某些其他的系统中，导入一些数据到es时，会采取这种方式，就是使用系统中已有数据的唯一标识，作为es中document的id。举个例子，比如说，我们现在在开发一个电商网站，做搜索功能，或者是OA系统，做员工检索功能。这个时候，数据首先会在网站系统或者IT系统内部的数据库中，会先有一份，此时就肯定会有一个数据库的primary key（自增长，UUID，或者是业务编号）。如果将数据导入到 Elasticsearch 中，此时就比较适合采用数据在数据库中已有的primary key。

如果说，我们是在做一个系统，这个系统主要的数据存储就是 Elasticsearch 一种，也就是说，数据产生出来以后，可能就没有id，直接就放es一个存储，那么这个时候，可能就不太适合说手动指定document id的形式了，因为你也不知道id应该是什么，此时可以采取下面要讲解的让 Elasticsearch 自动生成id的方式。

# put /index/type/id

PUT /test_index/test_type/2
{
  "test_content": "my test"
}

{
  "_index" : "test_index",
  "_type" : "test_type",
  "_id" : "2",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

自动生成


# post /index/type

PUT test_index/test_type
{
  "test_content": "my test automated document id"
}

{
  "error" : "Incorrect HTTP method for uri [/test_index/test_type?pretty=true] and method [PUT], allowed: [POST]",
  "status" : 405
}

POST /test_index/test_type
{
  "test_content": "my test"
}

{
  "_index" : "test_index",
  "_type" : "test_type",
  "_id" : "A7Ma5XYB_s8SuYmy2Xg0",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 1
}
# post /index/type

PUT test_index/test_type
{
  "test_content": "my test automated document id"
}

{
  "error" : "Incorrect HTTP method for uri [/test_index/test_type?pretty=true] and method [PUT], allowed: [POST]",
  "status" : 405
}

POST /test_index/test_type
{
  "test_content": "my test"
}

{
  "_index" : "test_index",
  "_type" : "test_type",
  "_id" : "A7Ma5XYB_s8SuYmy2Xg0",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 1
}

有可能两个创建 Document 的请求是完全在同一时间执行的（小概率事件），只不过在不同的 Elastic 节点上，那么，如果 _id 自动生成的算法不够好的话，有没有可能出现两个节点，给两个不同的 Document 创建了相同的 _id ?

当然是不可能的。

GUID 算法可以保证在分布式的环境下，不同节点同一时间创建的 _id 一定是不冲突的（即使是同一个节点，也不会有任何的问题）。

Elasticsearch 自动生成 _id 的机制，可以保证不会出现两个不同的 Document 的 _id 是一样的。

注意，自动生成 ID 的时候，使用的是 POST 而不是 PUT；手动生成 ID 的时候使用 PUT 或者 POST 都可以。

另外，这一节的实际操作，我是在 cloud.elastic.co 提供的虚拟机上进行的。其实在准备认证期间，我觉得可以考虑购买两个月左右的服务；也可以考虑在阿里云上购买。

自动生成的id，长度为20个字符，URL安全，base64编码，GUID，分布式系统并行生成时不可能会发生冲突。

GUID

以下文字来自维基百科

> A universally unique identifier (UUID) is a 128-bit number used to identify information in computer systems. The term globally unique identifier (GUID) is also used, typically in software created by Microsoft.

file

而 CODING HORROR 说

> Each globally unique ID is like a beautiful snowflake: every one a unique item waiting to be born.

> GUID Pros
>
>
>
> * Unique across every table, every database, every server
>
> * Allows easy merging of records from different databases
>
> * Allows easy distribution of databases across multiple servers
>
> * You can generate IDs anywhere, instead of having to roundtrip to the database
>
> * Most replication scenarios require GUID columns anyway
>
>
>
> GUID Cons
>
>
>
> * It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you’re not careful
>
> * Cumbersome to debug where userid=’{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}’
>
> * The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes

就我个人来说，我是不太喜欢 GUID 的。

本文转自网络，文章版权归原作者所有。暂未找到原作者，如有知晓请留言。