手记

MPF(Bunch) 转换为 HDF5

Feature

我首先将 python 获取 CASIA 脱机和在线手写汉字库 (二) 生成的 Bunch「Feature 数据」转换为 HDF5 文件。

载入 Bunch

因为该 Bunch 包含了 XFeature, Feature, Writer 这 3 个结构,所以我们还需要将它们载入:

import sys

sys.path.append('E:/xlab')
from base.xhw import json2bunch, XFeature, Feature, Writer
%%time
root = 'E:/OCR/CASIA/'

feature = json2bunch(f'{root}mpf/feature.json')
Wall time: 33.5 s
feature.HWDB10trn.writer001.keys()
dict_keys(['text', 'feature'])

转为 HDF5 文件

import numpy as np
import pandas as pd
import tables as tb
def mpf2tables(path, feature):
    filters = tb.Filters(complevel=7)
    with tb.open_file(path, 'w', title='手写单字特征', filters=filters) as h5:
        for setname in feature.keys():
            h5.create_group('/', setname, filters=filters)
            for writername in feature[setname].keys():
                h5.create_group(h5.root[setname], writername, filters=filters)
                X = feature[setname][writername]
                df = pd.DataFrame.from_dict(dict(X.feature))
                label = np.array([label.encode() for label in df.columns])
                h5.create_array(h5.root[setname][writername], 'label', label, title=X.text)
                h5.create_array(h5.root[setname][writername], 'feature', np.array(df).T, title=X.text)
%%time
path = f'{root}mpf/feature.h5'
mpf2tables(path, feature)
Wall time: 5min 29s

从本地载入 feature.h5

h5 = tb.open_file(path)
h5.root.HWDB10trn.writer001.feature.shape
(3728, 512)

我们可以看出数据集 HWDB10trn 的写手 writer001 所写的单字的特征信息为:
3728 个单字,每个单字 512 维。

下面我们编码单字的标签:

np.array([label.decode() for label in h5.root.HWDB10trn.writer001.label])
array(['扼', '遏', '鄂', ..., '娥', '恶', '厄'], dtype='
2人推荐
随时随地看视频
慕课网APP