我有 3 个相当大的文件(67gb、36gb、30gb)需要训练模型。但是,特征是行,样本是列。由于 Dask 尚未实现转置并存储按行拆分的 DataFrame,因此我需要自己编写一些东西来执行此操作。有没有一种方法可以有效地转置而不加载到内存中?
我有 16 GB 的内存可供我使用,并且正在使用 jupyter notebook。我写了一些相当慢的代码,但真的很感激更快的解决方案。以下代码的速度将需要一个月才能完成所有文件。几个数量级的最慢步骤是 awk。
import dask.dataframe as dd
import subprocess
from IPython.display import clear_output
df = dd.read_csv('~/VeryLarge.tsv')
with open('output.csv','wb') as fout:
for i in range(1, len(df.columns)+1):
print('AWKing')
#read a column from the original data and store it elsewhere
x = "awk '{print $"+str(i)+"}' ~/VeryLarge.tsv > ~/file.temp"
subprocess.check_call([x], shell=True)
print('Reading')
#load and transpose the column
col = pd.read_csv('~/file.temp')
row = col.T
display(row)
print('Deleting')
#remove the temporary file created
!rm ../file.temp
print('Storing')
#store the row in its own csv just to be safe. not entirely necessary
row.to_csv('~/columns/col_{:09d}'.format(i), header=False)
print('Appending')
#append the row (transposed column) to the new file
with open('~/columns/col_{:09d}', 'rb') as fin:
for line in fin:
fout.write(line)
clear_output()
#Just a measure of progress
print(i/len(df.columns))
数据本身有 1000 万行(特征)和 2000 列(样本)。它只需要转置。目前,它看起来像这样:
慕标琳琳
相关分类