使用 Dask 防止多次读取数据

首页课程实战体系课手记专栏慕课教程

我能做些什么来防止同一个文件被读取两次以上？

对于背景，我有以下详细信息

我试图读取文件夹中的文件列表，对其进行转换，将其输出到文件中，并检查转换前后的间隙

首先是阅读部分

def load_file(file):

df = pd.read_excel(file)

return df

file_list = glob2.glob("folder path here")

future_list = [delayed(load_file)(file) for file in file_list]

read_result_dd = dd.from_delayed(future_list)

之后，我将对数据进行一些转换：

def transform(df):

# do something to df

return df

transformation_result = read_result_dd.map_partitions(lambda df: transform(df))

我想实现两件事：首先获得转换输出：

Outputfile = transformation_result.compute()

Outputfile.to_csv("path and param here")

第二个得到比较

read_result_comp = read_result_dd.groupby("groupby param here")["result param here"].sum().reset_index()

transformation_result_comp = transformation_result_dd.groupby("groupby param here")["result param here"].sum().reset_index()

Checker = read_result_dd.merge(transformation_result, on=['header_list'], how='outer').compute()

Checker.to_csv("path and param here")

问题是如果我按顺序调用，即Outputfile：Checker

Outputfile = transformation_result.compute()

Checker = read_result_dd.merge(transformation_result, on=['header_list'], how='outer').compute()

Outputfile.to_csv("path and param here")

Checker.to_csv("path and param here")

它将读取整个文件两次（对于每个计算）

有没有办法让读取结果只完成一次？

还有什么办法可以让两者都compute()按顺序运行？（如果我分两行运行它，从 dask 仪表板我可以看到它将运行第一行，清除 dasboard，然后运行第二行，而不是同时运行两个）

我无法运行.compute()结果文件，因为我的 ram 不能包含它，结果数据框太大。与原始数据相比，检查器和输出文件都小得多。

繁华开满天机

浏览 105回答 1

慕容3067478

您可以dask.compute在多个 Dask 集合上调用该函数a, b = dask.compute(a, b)https://docs.dask.org/en/latest/api.html#dask.compute将来，我建议制作一个MCVE

0 0

随时随地看视频慕课网APP