我有一个巨大的CSV文件(3.5GB,并且每天都会越来越大),该文件具有正常值和一个名为“元数据”的列,其中包含嵌套的JSON值。我的脚本如下,其目的只是将其每个键值对的JSON列转换为普通列。我正在使用Python3(Anaconda; Windows)。
import pandas as pd
import numpy as np
import csv
import datetime as dt
from pandas.io.json import json_normalize
for df in pd.read_csv("source.csv", engine='c',
dayfirst=True,
encoding='utf-8',
header=0,
nrows=10,
chunksize=2,
converters={'Metadata':json.loads}):
## parsing code comes here
with open("output.csv", 'a', encoding='utf-8') as ofile:
df.to_csv(ofile, index=False, encoding='utf-8')
并且该列具有以下格式的JSON:
{
"content_id":"xxxx",
"parental":"F",
"my_custom_data":{
"GroupId":"NA",
"group":null,
"userGuid":"xxxxxxxxxxxxxx",
"deviceGuid":"xxxxxxxxxxxxx",
"connType":"WIFI",
"channelName":"VOD",
"assetId":"xxxxxxxxxxxxx",
"GroupName":"NA",
"playType":"VOD",
"appVersion":"2.1.0",
"userEnvironmentContext":"",
"vodEncode":"H.264",
"language":"English"
}
}
期望的输出是将所有上述键值对作为列。数据框将具有其他非JSON列,我需要向其中添加从上述JSON解析的列。我尝试过,json_normalize但不确定如何应用json_normalize到Series对象,然后将其转换(或分解)为多列。
函数式编程
相关分类