如何使用前几行的数据在数据框列上应用函数？

5回答

呼如林

如果您想保留该功能some_calc_func而不使用其他库，则不应尝试在每次迭代时访问每个元素，您可以zip在列 nums 和 b 上使用，并在您尝试访问前一行的 nums 和在每次迭代时将 prev_res 保存在内存中。此外，append到列表而不是数据框，并在循环后将列表分配给列。prev_res = df.loc[0, 'result'] #get first resultl_res = [prev_res] #initialize the list of results# loop with zip to get both values at same time, # use loc to start b at second row but not numfor prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):    # use your function to calculate the new prev_res    prev_res = some_calc_func (prev_res, prev_num, curren_b)    # add to the list of results    l_res.append(prev_res)# assign to the columndf['result'] = l_resprint (df) #same result than with your method   nums  b  result0  20.0  1    20.01  22.0  0    37.02  30.0  1   407.03  29.1  1  6105.04  20.0  0    46.1现在有了 5000 行的数据框 df，我得到了：%%timeitprev_res = df.loc[0, 'result']l_res = [prev_res]for prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):    prev_res = some_calc_func (prev_res, prev_num, curren_b)    l_res.append(prev_res)df['result'] = l_res# 4.42 ms ± 695 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)使用您原来的解决方案，速度慢了 ~750 倍%%timeit for i in range(1, len(df.index)):    row = df.index[i]    new_row = df.index[i - 1]  # get index of previous row for "nums" and "result"    df.loc[row, 'result'] = some_calc_func(prev_result=df.loc[new_row, 'result'], prev_num=df.loc[new_row, 'nums'], \                             current_b=df.loc[row, 'b'])#3.25 s ± 392 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)numba如果该函数some_calc_func可以很容易地与 Numba 装饰器一起使用，则使用另一个名为的库进行编辑。from numba import jit# decorate your function@jitdef some_calc_func(prev_result, prev_num, current_b):    if current_b == 1:        return prev_result * prev_num / 2    else:        return prev_num + 17# create a function to do your job# numba likes numpy arrays@jitdef with_numba(prev_res, arr_nums, arr_b):    # array for results and initialize    arr_res = np.zeros_like(arr_nums)    arr_res[0] = prev_res    # loop on the length of arr_b    for i in range(len(arr_b)):        #do the calculation and set the value in result array        prev_res = some_calc_func (prev_res, arr_nums[i], arr_b[i])        arr_res[i+1] = prev_res    return arr_res最后，称它为df['result'] = with_numba(df.loc[0, 'result'],                           df['nums'].to_numpy(),                            df.loc[1:, 'b'].to_numpy())使用 timeit，我的速度比使用 zip 的方法快 9 倍，而且速度会随着大小的增加而增加%timeit df['result'] = with_numba(df.loc[0, 'result'],                                   df['nums'].to_numpy(),                                    df.loc[1:, 'b'].to_numpy()) # 526 µs ± 45.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)请注意，根据您的实际情况，使用 Numba 可能会出现问题some_calc_func

0 0

慕田峪9158850

IIUC：>>> df['result'] = (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums                    ).fillna(df.result).cumsum()>>> df   nums  b  result0  20.0  1    20.01  22.0  0    42.02  30.0  1    12.03  29.1  1   -17.14  20.0  0     2.9解释：# replace 0 with 1 and 1 with -1 in column `b` for rows where result==0>>> df[df.result.eq(0)].b.replace({0: 1, 1: -1})1    12   -13   -14    1Name: b, dtype: int64# multiply with nums>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums)0     NaN1    22.02   -30.03   -29.14    20.0dtype: float64# fill the 'NaN' with the corresponding value from df.result (which is 20 here)>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result)0    20.01    22.02   -30.03   -29.14    20.0dtype: float64# take the cumulative sum (cumsum)>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result).cumsum()0    20.01    42.02    12.03   -17.14     2.9dtype: float64根据您在评论中的要求，我想不出没有循环的方法：c1, c2 = 2, 1l = [df.loc[0, 'result']]            # store the first result in a list# then loop over the series (df.b * df.nums)for i, val in (df.b * df.nums).iteritems():    if i:                            # except for 0th index        if val == 0:                 # (df.b * df.nums) == 0 if df.b == 0            l.append(l[-1])          # append the last result        else:                        # otherwise apply the rule            t = l[-1] *c2 + val * c1            l.append(t)>>> l[20.0, 20.0, 80.0, 138.2, 138.2]>>> df['result'] = l   nums  b  result0  20.0  1    20.01  22.0  0    20.02  30.0  1    80.0   # [ 20 * 1 +   30 * 2]3  29.1  1   138.2   # [ 80 * 1 + 29.1 * 2]4  20.0  0   138.2似乎速度不够快，没有测试大样本。

0 0

回首忆惘然

您有 af(...) 可以申请，但不能申请，因为您需要保留（前一）行的记忆。您可以使用闭包或类来执行此操作。下面是一个类的实现：import pandas as pdclass Func():    def __init__(self, value):        self._prev = value        self._init = True    def __call__(self, x):        if self._init:            res = self._prev            self._init = False        elif x.b == 0:            res = x.nums - self._prev        else:            res = x.nums + self._prev        self._prev = res        return res#df = pd.read_clipboard()f = Func(20)df['result'] = df.apply(f, axis=1)你可以用__call__你想要的任何东西替换some_calc_func身体。

0 0

守着一只汪

我意识到这就是@Prodipta 的答案，但这种方法使用global关键字来记住每次迭代的先前结果apply：prev_result = 20def my_calc(row):    global prev_result    i = int(row.name)   #the index of the current row    if i==0:        return prev_result       elif row['b'] == 1:        out = prev_result * df.loc[i-1,'nums']/2   #loc to get prev_num    else:        out = df.loc[i-1,'nums'] + 17    prev_result = out    return outdf['result'] = df.apply(my_calc, axis=1)您的示例数据的结果：   nums  b  result0  20.0  1    20.01  22.0  0    37.02  30.0  1   407.03  29.1  1  6105.04  20.0  0    46.1这是@Ben T 的答案的速度测试 - 不是最好的但也不是最差的？In[0]df = pd.DataFrame({'nums':np.random.randint(0,100,5000),'b':np.random.choice([0,1],5000)})prev_result = 20%%timeitdf['result'] = df.apply(my_calc, axis=1)Out[0]117 ms ± 5.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

0 0

临摹微笑

重新使用你的循环和 some_calc_func我正在使用您的循环并将其减少到最低限度，如下所示   for i in range(1, len(df)):      df.loc[i, 'result'] = some_calc_func(df.loc[i, 'b'], df.loc[i - 1, 'result'], df.loc[i, 'nums'])并且some_calc_func实现如下def some_calc_func(bval, prev_result, curr_num):    if bval == 0:        return prev_result + curr_num    else:        return prev_result - curr_num结果如下   nums  b  result0  20.0  1    20.01  22.0  0    42.02  30.0  1    12.03  29.1  1   -17.14  20.0  0     2.9

0 0