python 获取 CASIA 脱机和在线手写汉字库（三）-原创手记-慕课网

关于该数据集的特征的使用可参考我之前的博客：

手写单字的离线与在线的图片是分别以 .gnt 与 .pot 格式进行编码的。下面先看看离线的手写单字长什么样？

1.4.1 离线手写单字的图片解析

首先要获取离线手写单字的图片文件：

image_s = [f'{root}/{name}' for name in os.listdir(root) if 'gnt' in name] # 图片的源文件
image_s

输出结果：

['E:/OCR/CASIA/data/HWDB1.0trn_gnt.zip',
 'E:/OCR/CASIA/data/HWDB1.0tst_gnt.zip',
 'E:/OCR/CASIA/data/HWDB1.1trn_gnt.zip',
 'E:/OCR/CASIA/data/HWDB1.1tst_gnt.zip']

先定义一个 gnt 解码器：

class GNT:
    # GNT 文件的解码器
    def __init__(self, Z, set_name):
        self.Z = Z
        self.set_name = set_name # 数据集名称
    def __iter__(self):
        with self.Z.open(self.set_name) as fp:
            head = True
            while head:
                head = fp.read(4)
                if not head: # 判断文件是否读到结尾
                    break # 读到文件结尾立即结束
                head = struct.unpack('I', head)[0]
                tag_code = fp.read(2).decode('gb2312-80')
                width, height = struct.unpack('2H', fp.read(4))
                bitmap = np.frombuffer(fp.read(width*height), np.uint8)
                img = bitmap.reshape((height, width))
                yield img, tag_code

选择 HWDB1.0trn_gnt.zip 数据子集作为示范来说明 GNT 的使用：

Z = zipfile.ZipFile(f'{root}/HWDB1.0trn_gnt.zip')
Z.namelist()

输出结果：

['1.0train-gb1.gnt']

由输出结果知道 HWDB1.0trn_gnt.zip 仅仅封装了一个数据 '1.0train-gb1.gnt'，下面直接传入 GNT 类：

set_name = '1.0train-gb1.gnt'
gnt = GNT(Z, set_name)
for imgs, labels in gnt: # 仅仅查看一个字
    break

为了更加直观，引入可视化包：

%matplotlib inline
from matplotlib import pyplot as plt

这样便可以查看图片了：

plt.imshow(imgs)
plt.title(labels)
plt.show()

输出结果：

图片描述

可以看出，此时报错了，说缺少字体。实际上，这是 matplotlib 的默认设置不支持汉字，为了让其支持汉字，需要如下操作：

plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号

接下来便可以正常显示了：

plt.imshow(imgs)
plt.title(labels);

显示截图：

图片描述

可以查看 '1.0train-gb1.gnt' 总有多少字符？

labels = np.asanyarray([l for _, l in gnt])
labels.shape[0]

输出：

故而，'1.0train-gb1.gnt' 总有 $1246991$ 个字符，与官网提供的信息一致。

关于在线单字的图片解读请移步到我的 GitHub：https://xinetzone.github.io/loader/。并且也做了一个预发布版本：CASIA-HWDB & CASIA-OLHWDB。

python 获取 CASIA 脱机和在线手写汉字库 （三）

1.4.1 离线手写单字的图片解析

python 获取 CASIA 脱机和在线手写汉字库（三）