更多细节参考我的博客：Bunch 转换为 HDF5 文件：高效存储 Cifar 等数据集

Feature

我首先将 python 获取 CASIA 脱机和在线手写汉字库（二）生成的 Bunch「Feature 数据」转换为 HDF5 文件。

载入 `Bunch`

因为该 Bunch 包含了 XFeature, Feature, Writer 这 3 个结构，所以我们还需要将它们载入：

import sys

sys.path.append('E:/xlab')
from base.xhw import json2bunch, XFeature, Feature, Writer

%%time
root = 'E:/OCR/CASIA/'

feature = json2bunch(f'{root}mpf/feature.json')

Wall time: 33.5 s

feature.HWDB10trn.writer001.keys()

dict_keys(['text', 'feature'])

转为 HDF5 文件

import numpy as np
import pandas as pd
import tables as tb

def mpf2tables(path, feature):
    filters = tb.Filters(complevel=7)
    with tb.open_file(path, 'w', title='手写单字特征', filters=filters) as h5:
        for setname in feature.keys():
            h5.create_group('/', setname, filters=filters)
            for writername in feature[setname].keys():
                h5.create_group(h5.root[setname], writername, filters=filters)
                X = feature[setname][writername]
                df = pd.DataFrame.from_dict(dict(X.feature))
                label = np.array([label.encode() for label in df.columns])
                h5.create_array(h5.root[setname][writername], 'label', label, title=X.text)
                h5.create_array(h5.root[setname][writername], 'feature', np.array(df).T, title=X.text)

%%time
path = f'{root}mpf/feature.h5'
mpf2tables(path, feature)

Wall time: 5min 29s

从本地载入 `feature.h5`

h5 = tb.open_file(path)

h5.root.HWDB10trn.writer001.feature.shape

(3728, 512)

我们可以看出数据集 HWDB10trn 的写手 writer001 所写的单字的特征信息为：
3728 个单字，每个单字 512 维。

下面我们编码单字的标签：

np.array([label.decode() for label in h5.root.HWDB10trn.writer001.label])

array(['扼', '遏', '鄂', ..., '娥', '恶', '厄'], dtype='

更多精彩见我的 GitHub：https://github.com/xinetzone/loader/blob/casia/casia/README.md

MPF(Bunch) 转换为 HDF5原创

Feature

载入 Bunch

转为 HDF5 文件

从本地载入 feature.h5

载入 `Bunch`

从本地载入 `feature.h5`