更多细节参考我的博客:Bunch 转换为 HDF5 文件:高效存储 Cifar 等数据集
Feature
我首先将 python 获取 CASIA 脱机和在线手写汉字库 (二) 生成的 Bunch
「Feature 数据」转换为 HDF5
文件。
载入 Bunch
因为该 Bunch 包含了 XFeature
, Feature
, Writer
这 3 个结构,所以我们还需要将它们载入:
import sys
sys.path.append('E:/xlab')
from base.xhw import json2bunch, XFeature, Feature, Writer
%%time
root = 'E:/OCR/CASIA/'
feature = json2bunch(f'{root}mpf/feature.json')
Wall time: 33.5 s
feature.HWDB10trn.writer001.keys()
dict_keys(['text', 'feature'])
转为 HDF5 文件
import numpy as np
import pandas as pd
import tables as tb
def mpf2tables(path, feature):
filters = tb.Filters(complevel=7)
with tb.open_file(path, 'w', title='手写单字特征', filters=filters) as h5:
for setname in feature.keys():
h5.create_group('/', setname, filters=filters)
for writername in feature[setname].keys():
h5.create_group(h5.root[setname], writername, filters=filters)
X = feature[setname][writername]
df = pd.DataFrame.from_dict(dict(X.feature))
label = np.array([label.encode() for label in df.columns])
h5.create_array(h5.root[setname][writername], 'label', label, title=X.text)
h5.create_array(h5.root[setname][writername], 'feature', np.array(df).T, title=X.text)
%%time
path = f'{root}mpf/feature.h5'
mpf2tables(path, feature)
Wall time: 5min 29s
从本地载入 feature.h5
h5 = tb.open_file(path)
h5.root.HWDB10trn.writer001.feature.shape
(3728, 512)
我们可以看出数据集 HWDB10trn
的写手 writer001
所写的单字的特征信息为:
3728 个单字,每个单字 512 维。
下面我们编码单字的标签:
np.array([label.decode() for label in h5.root.HWDB10trn.writer001.label])
array(['扼', '遏', '鄂', ..., '娥', '恶', '厄'], dtype='