Postgres分区性能：哈希 vs 列表-原创手记-慕课网

在我们的设计中，我们设计了一个计划中的一个大型PostgreSQL表，用于存储传入的Kafka事件的ID，目的是为了去重。这些ID的类型为UUID（通用唯一识别码）。

创建表 t，其中 id 是 UUID 主键。

经过一番思考，我们决定将这个表分成16个部分。问题是这样的：我们应该使用哈希分区还是列表分区？哪种更有效？

👉 最初的想法是哈希函数会比简单的模数运算慢。因此，列表分区应该比哈希分区更高效。但这种猜测能否通过某些基准测试得到验证？

> 在我们继续之前，让我们先明确一点，本文讨论的是一个非常特定的案例和工作负载。然而，它应该为你提供足够的信息，以便你可以根据你自己的需求进行调整。

哈希分片

最初的设想是使用哈希分片

    drop table if exists t;  

    create table t(id uuid primary key)  
        分区依据 hash(id);  

    create table t_00  
        是分区，值为16的模数且余数为0;
    create table t_01  
        是分区，值为16的模数且余数为1;
    create table t_02  
        是分区，值为16的模数且余数为2;
    create table t_03  
        是分区，值为16的模数且余数为3;
    create table t_04  
        是分区，值为16的模数且余数为4;
    create table t_05  
        是分区，值为16的模数且余数为5;
    create table t_06  
        是分区，值为16的模数且余数为6;
    create table t_07  
        是分区，值为16的模数且余数为7;
    create table t_08  
        是分区，值为16的模数且余数为8;
    create table t_09  
        是分区，值为16的模数且余数为9;
    create table t_10  
        是分区，值为16的模数且余数为10;
    create table t_11  
        是分区，值为16的模数且余数为11;
    create table t_12  
        是分区，值为16的模数且余数为12;
    create table t_13  
        是分区，值为16的模数且余数为13;
    create table t_14  
        是分区，值为16的模数且余数为14;
    create table t_15  
        是分区，值为16的模数且余数为15;

另一种想法是放弃使用 hash，转而使用 id 的最后一位作为分区键。接下来我们会讨论这个想法，但在进行讨论之前，我们先来测试一下基于哈希的分区方法。

time \  
pgbench -c10 -t 900 -j30 -n -f - << EOF  
将数据插入表t中  
生成gen_random_uuid()  
从generate_series生成1到1000的序列;  
EOF

我选择了建立10个数据库连接来，并使用30个并发线程。每个连接发送900个查询到数据库。每个查询会插入1000个UUID到表中。

只是好玩一下。只有符合真实流量才有意义。但让我们看看在我的笔记本电脑上会怎样：一台旧的2017年的笔记本电脑！

    pgbench (16.2 (Ubuntu 16.2-1ubuntu4))  
    事务种类: -  
    缩放因子: 1  
    查询模式: 简单  
    客户端数量: 10  
    线程数量: 10  
    最大尝试次数: 1（即只尝试一次）  
    每个客户端处理的事务数: 900  
    实际完成的事务数: 9000/9000  
    失败的事务数（共0个）: 0 (0.000%)  
    平均延迟 = 13.776 ms  
    初始连接时间 = 6.491 ms  
    tps = 725.901931 (不考虑初始连接时间)

    实际运行时间 12.438s  
    用户CPU时间 0.162s  
    系统CPU时间 0.396s

插入900万行数据耗时12.4秒。平均TPS为725.9次/秒。

使用 psql 元命令（metacommands），我们就可以看到表的/索引的大小。

使用 \dt+ 命令查看表的大小（省略了一些列具体内容）：

    ┌──────┬───────────────────┬─────────┐  
    │ 名称 │ 分区表             │   大小  │  
    ├──────┼───────────────────┼─────────┤  
    │ t    │ 分区表            │ 0 字节 (0B) │  
    │ t_00 │ 表                │ 24 MB   │  
    │ t_01 │ 表                │ 24 MB   │  
    │ t_02 │ 表                │ 24 MB   │  
    │ t_03 │ 表                │ 24 MB   │  
    │ t_04 │ 表                │ 24 MB   │  
    │ t_05 │ 表                │ 24 MB   │  
    │ t_06 │ 表                │ 24 MB   │  
    │ t_07 │ 表                │ 24 MB   │  
    │ t_08 │ 表                │ 24 MB   │  
    │ t_09 │ 表                │ 24 MB   │  
    │ t_10 │ 表                │ 24 MB   │  
    │ t_11 │ 表                │ 24 MB   │  
    │ t_12 │ 表                │ 24 MB   │  
    │ t_13 │ 表                │ 24 MB   │  
    │ t_14 │ 表                │ 24 MB   │  
    │ t_15 │ 表                │ 24 MB   │  
    └──────┴───────────────────┴─────────┘

使用 \di+ 查看索引的大小。省略了一些内容。

    ┌───────────┬───────────────────┬─────────┐  
    │   名称    │       类型        │  大小   │  
    ├───────────┼───────────────────┼─────────┤  
    │ t_pkey    │ 分区索引（分区索引表） │ 0 MB   │  
    │ t_00_pkey │ 索引             │ 21 MB   │  
    │ t_01_pkey │ 索引             │ 21 MB   │  
    │ t_02_pkey │ 索引             │ 22 MB   │  
    │ t_03_pkey │ 索引             │ 20 MB   │  
    │ t_04_pkey │ 索引             │ 21 MB   │  
    │ t_05_pkey │ 索引             │ 21 MB   │  
    │ t_06_pkey │ 索引             │ 21 MB   │  
    │ t_07_pkey │ 索引             │ 20 MB   │  
    │ t_08_pkey │ 索引             │ 20 MB   │  
    │ t_09_pkey │ 索引             │ 21 MB   │  
    │ t_10_pkey │ 索引             │ 21 MB   │  
    │ t_11_pkey │ 索引             │ 21 MB   │  
    │ t_12_pkey │ 索引             │ 21 MB   │  
    │ t_13_pkey │ 索引             │ 21 MB   │  
    │ t_14_pkey │ 索引             │ 21 MB   │  
    │ t_15_pkey │ 索引             │ 21 MB   │  
    └───────────┴───────────────────┴─────────┘

索引的大小几乎与表本身相当。此外，虽然每个分区的数据量都是 24 MB，但索引的大小却在 20 MB 到 22 MB 之间。索引的总大小是 334 MB。

列表分割

如果我们想用 id 的最后一位数字作为分区键，主键就不能加到父表上：

创建表t，其中id是主键且为uuid类型；按(id转换为文本，取第一个字符并转为小写)进行列表分区。

这导致了错误。

    错误：不支持带有分区键定义的主键约束  
    详细信息：当分区键包含表达式时，主键约束将无法使用。

所以我们决定这么做，在每个单独的分区上为主键添加（这实际上保证了所有数据的唯一性）。

    drop table if exists t;  

    create table t(id uuid not null)  
    partition by list (lower(left(id::text, 1))); -- 分区表t，基于id的首字母进行列表分区

    create table t_00  
        partition of t  
        (id primary key)  
        for values in ('0');  
    create table t_01  
        partition of t  
        (id primary key)   
        for values in ('1');  
    create table t_02  
        partition of t  
        (id primary key)   
        for values in ('2');  
    create table t_03  
        partition of t  
        (id primary key)   
        for values in ('3');  
    create table t_04  
        partition of t  
        (id primary key)   
        for values in ('4');  
    create table t_05  
        partition of t  
        (id primary key)   
        for values in ('5');  
    create table t_06  
        partition of t  
        (id primary key)   
        for values in ('6');  
    create table t_07  
        partition of t  
        (id primary key)   
        for values in ('7');  
    create table t_08  
        partition of t  
        (id primary key)   
        for values in ('8');  
    create table t_09  
        partition of t  
        (id primary key)   
        for values in ('9');  
    create table t_10  
        partition of t  
        (id primary key)   
        for values in ('a');  
    create table t_11  
        partition of t  
        (id primary key)   
        for values in ('b');  
    create table t_12  
        partition of t  
        (id primary key)   
        for values in ('c');  
    create table t_13  
        partition of t  
        (id primary key)   
        for values in ('d');  
    create table t_14  
        partition of t  
        (id primary key)   
        for values in ('e');  
    create table t_15  
        partition of t  
        (id primary key)   
        for values in ('f');

现在再来一次基准测试：

time \  
# 为数据库插入操作计时
pgbench -c10 -t 900 -j30 -n -f - << 结束  
# pgbench是一个用于测试PostgreSQL性能的工具，这里用来模拟插入操作
insert into t  
select 生成随机UUID()  
from 生成序列(1, 1000);  
结束
# 生成序列(1, 1000)生成从1到1000的序列

生成随机UUID()是生成随机UUID的函数

结果:

    pgbench (16.2 (Ubuntu 16.2-1ubuntu4))  
    事务类型: 无  
    缩放因子: 1  
    查询模式: 简单  
    客户端数量: 10  
    线程数量: 10  
    最大尝试次数: 仅一次  
    每个客户端的事务数: 900  
    实际完成的事务数: 9000/9000  
    失败的事务数: 0 (0.000%)  
    平均延迟时间: 15.123 ms  
    初始连接时间时间: 8.810 ms  
    tps = 661.264382 (不包括初始连接时间)

    实际运行时间 0分13.654秒  
    用户占用时间 0分0.150秒  
    系统占用时间 0分0.409秒

这比哈希分区要慢一些：

持续时间: 13.654s 而不是 12.438s
TPS: 661.264382, 而不是 725.901931

所以，我们失去了主键，而且变慢了！哈希分区在这里显然更胜一筹。

使用 \dt+ 和 \di+ 会得到几乎相同的效果，所以就不再重复了。

使用哈希索引（hash索引）而不是B树索引

另一种建议的方式是使用哈希索引而非B树索引来确保唯一性。好处是，在这种情况下，当唯一重要的操作是等值检查时，它们通常比B树索引更小、更快。

Postgres 的主键目前还不支持这一点，但可以使用一种变通手段，通过应用基于哈希的非等式约束。

    drop table if exists t;  

    -- 创建名为t的表，包含一个非空的uuid类型id字段，并根据id字段的左部分进行分区
    create table t(id uuid not null)  
    partition by list (left(id::text, 1));  

    -- 创建名为t_00的子表，该子表是t的分区，用于存放id左部分为'0'的记录，并排除哈希冲突
    create table t_00  
        partition of t  
        (exclude using hash(id with =))  
        for values in ('0');  
    -- 创建名为t_01的子表，该子表是t的分区，用于存放id左部分为'1'的记录，并排除哈希冲突
    create table t_01  
        partition of t  
        (exclude using hash(id with =))   
        for values in ('1');  
    -- 创建名为t_02的子表，该子表是t的分区，用于存放id左部分为'2'的记录，并排除哈希冲突
    create table t_02  
        partition of t  
        (exclude using hash(id with =))   
        for values in ('2');  
    -- 创建名为t_03的子表，该子表是t的分区，用于存放id左部分为'3'的记录，并排除哈希冲突
    create table t_03  
        partition of t  
        (exclude using hash(id with =))   
        for values in ('3');  
    -- 创建名为t_04的子表，该子表是t的分区，用于存放id左部分为'4'的记录，并排除哈希冲突
    create table t_04  
        partition of t  
        (exclude using hash(id with =))   
        for values in ('4');  
    -- 创建名为t_05的子表，该子表是t的分区，用于存放id左部分为'5'的记录，并排除哈希冲突
    create table t_05  
        partition of t  
        (exclude using hash(id with =))   
        for values in ('5');  
    -- 创建名为t_06的子表，该子表是t的分区，用于存放id左部分为'6'的记录，并排除哈希冲突
    create table t_06  
        partition of t  
        (exclude using hash(id with =))   
        for values in ('6');  
    -- 创建名为t_07的子表，该子表是t的分区，用于存放id左部分为'7'的记录，并排除哈希冲突
    create table t_07  
        partition of t  
        (exclude using hash(id with =))   
        for values in ('7');  
    -- 创建名为t_08的子表，该子表是t的分区，用于存放id左部分为'8'的记录，并排除哈希冲突
    create table t_08  
        partition of t  
        (exclude using hash(id with =))   
        for values in ('8');  
    -- 创建名为t_09的子表，该子表是t的分区，用于存放id左部分为'9'的记录，并排除哈希冲突
    create table t_09  
        partition of t  
        (exclude using hash(id with =))   
        for values in ('9');  
    -- 创建名为t_10的子表，该子表是t的分区，用于存放id左部分为'a'的记录，并排除哈希冲突
    create table t_10  
        partition of t  
        (exclude using hash(id with =))   
        for values in ('a');  
    -- 创建名为t_11的子表，该子表是t的分区，用于存放id左部分为'b'的记录，并排除哈希冲突
    create table t_11  
        partition of t  
        (exclude using hash(id with =))   
        for values in ('b');  
    -- 创建名为t_12的子表，该子表是t的分区，用于存放id左部分为'c'的记录，并排除哈希冲突
    create table t_12  
        partition of t  
        (exclude using hash(id with =))   
        for values in ('c');  
    -- 创建名为t_13的子表，该子表是t的分区，用于存放id左部分为'd'的记录，并排除哈希冲突
    create table t_13  
        partition of t  
        (exclude using hash(id with =))   
        for values in ('d');  
    -- 创建名为t_14的子表，该子表是t的分区，用于存放id左部分为'e'的记录，并排除哈希冲突
    create table t_14  
        partition of t  
        (exclude using hash(id with =))   
        for values in ('e');  
    -- 创建名为t_15的子表，该子表是t的分区，用于存放id左部分为'f'的记录，并排除哈希冲突
    create table t_15  
        partition of t  
        (exclude using hash(id with =))   
        for values in ('f');

我们也要来跑个分测试一下：

    time \  
    pgbench -c10 -t 900 -j30 -n -f - << EOF  
    INSERT INTO t  
    SELECT gen_random_uuid()  
    FROM generate_series(1, 1000);  
    EOF

结果是：

    pgbench (16.2 (Ubuntu 16.2-1ubuntu4))  
    事务种类: -  
    放大因子: 1  
    查询模式: 简单  
    客户端数量: 10  
    线程数量: 10  
    最大尝试次数: 1  
    每个客户端的事务数: 900  
    实际处理的事务数: 9000/9000  
    失败的事务数: 0 (0.000%)  
    平均延迟 = 16.686 毫秒  
    初始连接耗时 = 7.089 毫秒  
    tps = 599.314265 (不含初始连接耗时)

    实际耗时：0m15.067s  
    用户耗时：0m0.127s  
    系统耗时：0m0.468s

嗯，我没想到会这样子。现在居然更慢了。看表的大小（\dt+），还是24MB。

不过，索引的大小（\di+）稍微小一点。

    ┌──────────────┬───────┐  
    │     名字     │ 体积  │  
    ├──────────────┼───────┤  
    │ t_00_id_excl │ 20 MB │  
    │ t_01_id_excl │ 20 MB │  
    │ t_02_id_excl │ 20 MB │  
    │ t_03_id_excl │ 20 MB │  
    │ t_04_id_excl │ 20 MB │  
    │ t_05_id_excl │ 20 MB │  
    │ t_06_id_excl │ 20 MB │  
    │ t_07_id_excl │ 20 MB │  
    │ t_08_id_excl │ 20 MB │  
    │ t_09_id_excl │ 20 MB │  
    │ t_10_id_excl │ 20 MB │  
    │ t_11_id_excl │ 20 MB │  
    │ t_12_id_excl │ 20 MB │  
    │ t_13_id_excl │ 20 MB │  
    │ t_14_id_excl │ 20 MB │  
    │ t_15_id_excl │ 20 MB │  
    └──────────────┴───────┘

因此，索引大小从334 MB缩减到了320 MB。

摘要

在该示例中，哈希分区表现更佳（不过需要注意的是，这并不总是成立）
哈希分区的一个好处是，在此示例中每个表都有主键（这对于使用逻辑复制来说相当重要）
使用哈希索引替代B树索引并未提升性能，但索引大小减少了不到5%

终章

我的一位不愿透露姓名的同事解释说，为什么列表分割会变慢。

在计算基于列表的方法的分区键时，你使用了一个类型转换（将UUID转换为text），然后应用了两个函数（LEFT和LOWER）。这些函数运行速度较快，但是类型转换比较慢。这就是为什么整体效果比hash函数慢，hash函数是用C语言编写的，运行速度非常快。

另一位同事汤姆也做了个很好的总结如下：

所以，如果我没理解错的话，意思是“别耍花样，只需要用简单的方式去做，PostgreSQL 会自动优化处理。”

这让我想起了一个故事，让我想起当我们实现了一个strlen()函数的变体时，发现它比GLIBC库中的函数慢300倍！我也应该写一篇关于这件事的文章 🙂

更多阅读：

想了解更多关于哈希索引的信息，可以查看this Twitter thread以及里面的链接。

Erwin Brandstetter 在 pgsql-general 邮件列表中的帖子有关 EXCLUDE USING hash(i WITH =)。
这篇文章再次介绍了 PostgreSQL 中的哈希索引：重新介绍 PostgreSQL 中的哈希索引。

_(文章已经发布到dev.to，您可以在这里查看：这里 _)