继续浏览精彩内容
慕课网APP
程序员的梦工厂
打开
继续
感谢您的支持,我会继续努力的
赞赏金额会直接到老师账户
将二维码发送给自己后长按识别
微信支付
支付宝支付

MPF(Bunch) 转换为 HDF5

心之宙
关注TA
已关注
手记 71
粉丝 35
获赞 167

Feature

我首先将 python 获取 CASIA 脱机和在线手写汉字库 (二) 生成的 Bunch「Feature 数据」转换为 HDF5 文件。

载入 Bunch

因为该 Bunch 包含了 XFeature, Feature, Writer 这 3 个结构,所以我们还需要将它们载入:

import sys

sys.path.append('E:/xlab')
from base.xhw import json2bunch, XFeature, Feature, Writer
%%time
root = 'E:/OCR/CASIA/'

feature = json2bunch(f'{root}mpf/feature.json')
Wall time: 33.5 s
feature.HWDB10trn.writer001.keys()
dict_keys(['text', 'feature'])

转为 HDF5 文件

import numpy as np
import pandas as pd
import tables as tb
def mpf2tables(path, feature):
    filters = tb.Filters(complevel=7)
    with tb.open_file(path, 'w', title='手写单字特征', filters=filters) as h5:
        for setname in feature.keys():
            h5.create_group('/', setname, filters=filters)
            for writername in feature[setname].keys():
                h5.create_group(h5.root[setname], writername, filters=filters)
                X = feature[setname][writername]
                df = pd.DataFrame.from_dict(dict(X.feature))
                label = np.array([label.encode() for label in df.columns])
                h5.create_array(h5.root[setname][writername], 'label', label, title=X.text)
                h5.create_array(h5.root[setname][writername], 'feature', np.array(df).T, title=X.text)
%%time
path = f'{root}mpf/feature.h5'
mpf2tables(path, feature)
Wall time: 5min 29s

从本地载入 feature.h5

h5 = tb.open_file(path)
h5.root.HWDB10trn.writer001.feature.shape
(3728, 512)

我们可以看出数据集 HWDB10trn 的写手 writer001 所写的单字的特征信息为:
3728 个单字,每个单字 512 维。

下面我们编码单字的标签:

np.array([label.decode() for label in h5.root.HWDB10trn.writer001.label])
array(['扼', '遏', '鄂', ..., '娥', '恶', '厄'], dtype='
打开App,阅读手记
2人推荐
发表评论
随时随地看视频慕课网APP