墨色风雨
问题2有一种方法可以部分解决这个问题。我假设它df2比它小得多df1并且它实际上适合内存所以我们可以读取作为 pandas 数据帧。df1如果是这种情况,如果是一个pandas或一个数据帧,则以下函数可以工作dask,但df2应该是pandas一个。import pandas as pdimport dask.dataframe as dddef replace_names(df1, # can be pandas or dask dataframe df2, # this should be pandas. idxCol='id', srcCol='name', dstCol='name'): diz = df2[[idxCol, srcCol]].set_index(idxCol).to_dict()[srcCol] out = df1.copy() out[dstCol] = out[idxCol].map(diz) return out问题一关于第一个问题,以下代码适用于pandas和daskdf = pd.DataFrame({'a': {0: 1, 1: 2}, 'b': {0: 3, 1: 4}, '1/1/20': {0: 10, 1: 10}, '1/2/20': {0: 20, 1: 30}, '1/3/20': {0: 0, 1: 30}, '1/4/20': {0: 40, 1: 0}, '1/5/20': {0: 0, 1: 0}, '1/6/20': {0: 50, 1: 50}})# if you want to try with dask# df = dd.from_pandas(df, npartitions=2)cols = [col for col in df.columns if "/" in col]df[cols] = df[cols].mask(lambda x: x==0).ffill(1) #.astype(int)如果您希望输出为整数,请删除最后一行中的注释。更新问题 2 如果您想要一个dask唯一的解决方案,您可以尝试以下方法。数据import numpy as npimport pandas as pdimport dask.dataframe as dddf1 = pd.DataFrame({'id': {0: 123, 1: 456, 2: 789, 3: 789, 4: 456, 5: 123}, 'name': {0: 'city a', 1: 'city b', 2: 'city c', 3: 'city c', 4: 'city b', 5: 'city a'}})df2 = pd.DataFrame({'id': {0: 123, 1: 456, 2: 789}, 'name': {0: 'City A', 1: 'City B', 2: 'City C'}})df1 = dd.from_pandas(df1, npartitions=2)df2 = dd.from_pandas(df2, npartitions=2)情况1在这种情况下,如果一个id存在于df1但不存在于中,df2则将名称保留在df1.def replace_names_dask(df1, df2, idxCol='id', srcCol='name', dstCol='name'): if srcCol == dstCol: df2 = df2.rename(columns={srcCol:f"{srcCol}_new"}) srcCol = f"{srcCol}_new" def map_replace(x, srcCol, dstCol): x[dstCol] = np.where(x[srcCol].notnull(), x[srcCol], x[dstCol]) return x df = dd.merge(df1, df2, on=idxCol, how="left") df = df.map_partitions(lambda x: map_replace(x, srcCol, dstCol)) df = df.drop(srcCol, axis=1) return dfdf = replace_names_dask(df1, df2)案例二在这种情况下,如果一个id存在于df1但不存在于df2则name输出df将是NaN(如在标准左连接中)def replace_names_dask(df1, df2, idxCol='id', srcCol='name', dstCol='name'): df1 = df1.drop(dstCol, axis=1) df2 = df2.rename(columns={srcCol: dstCol}) df = dd.merge(df1, df2, on=idxCol, how="left") return dfdf = replace_names_dask(df1, df2)