加快 pandas 迭代,以测试后续元素的条件

给定一个具有三列 (C1,C2,C3) 和一系列等长正数 (coeff) 的 pandas 数据帧,我正在计算第四列 C4,如下所示


def event(data, coeff, rate_low=2, rate_high=2):


    bot_col_name = 'C4'


    data[bot_col_name] = -1


    I = data.index 

    for k in range(len(I)-1):

        i = I[k]


        next_val = data.at[ I[k+1], 'C1']

        c = coeff.at[i]


        low_bound = next_val - rate_low*c

        high_bound = next_val + rate_high*c


        for j in range(k+1, len(data)):

            if data.at[ I[j], 'C2'] < low_bound:

                data.at[i, bot_col_name] = 0 

                break


            if data.at[ I[j], 'C3'] >= high_bound:

                data.at[i, bot_col_name] = 1 

                break

    return data

换句话说,给定一行,我们计算某个上限和下限,然后根据我们首先达到 C2 下的上限还是 C3 上的下限来设置相应的行元素。


作为一个例子,考虑熊猫表 D


   C1  C2  C3

0  2   5   5

1  10  12   2

2   8   3  17 

3  30  25   3

现在如果 coeff = [3,3,5,7] 那么在计算第一行的值时,low_bound 为 10-2*3=4,high_bound 为 10+2*3=16。我们现在必须找到最小索引 i>0 使得 D.loc[i, 'C2'] < 4 或 D.loc[i,'C3'] >= 16。我们看到第一个这样的 i 是 1 并且因为这恰好满足第一个条件,所以我们将这一行的新列设置为 0。


不幸的是,上述解决方案效率很低。我已经尝试通过向后计算值并尝试缓存结果来优化它(有时可以从“过去”值推断出 C4 的值),但不幸的是它并没有明显更好。


根据我的经验,获得最大性能的最佳方法是尝试在 pandas 框架内尽可能多地表达。


是否有任何有意义的方式可以优化上述代码?


编辑。使用已接受答案的代码并替换以下函数可获得最佳结果。


@njit

def get_c4(low_bound, high_bound, c2, c3):

    r1 = np.argwhere( c2 < low_bound )

    r2 = np.argwhere( c3 >= high_bound )


    if len(r1) == 0 and len(r2) == 0:

        return -1

    elif len(r1) == 0:

        return 1

    elif len(r2) == 0:

        return 0


    return int (r1[0] > r2[0])


红颜莎娜
浏览 129回答 2
2回答

人到中年有点甜

如果你真的需要一个快速的解决方案,你应该使用numba。numba的替代方案是cython。两者都编译你的 python 代码c以使其更快,但我认为numba更简单,它们或多或少具有相同的性能。将代码编译到c/Fortran是numpy / pandas内部函数如此快速的原因。更多信息请参阅pandas 文档。让我们首先创建示例:import numpy as npimport pandas as pdfrom numba import njitdf = pd.DataFrame({&nbsp; &nbsp; 'C1': [2, 10, 8, 30],&nbsp; &nbsp; 'C2': [5, 12, 3, 25],&nbsp; &nbsp; 'C3': [5, 2, 17, 3]})coeff = pd.Series([3, 3, 5, 7])然后通过转换为numba我们得到答案的代码:@njitdef event_v(data, coeff, rate_low=2, rate_high=2):&nbsp; &nbsp; out = -np.ones(len(data), dtype=np.int8)&nbsp; &nbsp; for k in range(len(data) - 1):&nbsp; &nbsp; &nbsp; &nbsp; next_val = data[k + 1, 0]&nbsp; &nbsp; &nbsp; &nbsp; c = coeff[k]&nbsp; &nbsp; &nbsp; &nbsp; low_bound = next_val - rate_low * c&nbsp; &nbsp; &nbsp; &nbsp; high_bound = next_val + rate_high * c&nbsp; &nbsp; &nbsp; &nbsp; for j in range(k + 1, len(data)):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if data[j, 1] < low_bound:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; out[k] = 0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if data[j, 2] >= high_bound:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; out[k] = 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break&nbsp; &nbsp; return outdf["C4"] = event_v(df.values, coeff.values)测试 10 000 行:n = 10_000df = pd.DataFrame(np.random.randint(30, size=[n, 3]), columns=["C1", "C2", "C3"])coeff = pd.Series(np.random.randint(10, size=n))%timeit event_v(df.values, coeff.values)3.39 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%timeit event(df, coeff) # Code from the question28.4 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)它快了大约8500 倍测试 1 000 000 行:n = 1_000_000df = pd.DataFrame(np.random.randint(30, size=[n, 3]), columns=["C1", "C2", "C3"])coeff = pd.Series(np.random.randint(10, size=n))%timeit event_v(df.values, coeff.values)27.6 s ± 1.16 s per loop (mean ± std. dev. of 7 runs, 1 loop each)我尝试使用问题的代码运行它,但超过 2 小时%timeit后没有完成。

Helenr

next_value、low_bound 和 high_bound 可以很容易地向量化,并且它们的计算速度非常快。第二部分不容易矢量化,因为它可能需要扫描整个数组的每一行。通过在 numpy 数组中进行比较,可以获得对您的实现的轻微改进(对于较大的 n 变得更加重要)。def get_c4(low_bound, high_bound, c2, c3):&nbsp; &nbsp; for idx in range(len(c2)):&nbsp; &nbsp; &nbsp; &nbsp; if c2[idx] < low_bound:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return 0&nbsp; &nbsp; &nbsp; &nbsp; if c3[idx] >= high_bound:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return 1&nbsp; &nbsp; return -1def event_new(data: pd.DataFrame, coeff, rate_low=2, rate_high=2):&nbsp; &nbsp; data['next_val'] = data['C1'].shift(periods=-1).ffill().astype('int')&nbsp; &nbsp; data['low_bound'] = (data['next_val'] - rate_low * coeff).astype('int')&nbsp; &nbsp; data['high_bound'] = (data['next_val'] + rate_high * coeff).astype('int')&nbsp; &nbsp; c2 = data['C2'].to_numpy()&nbsp; &nbsp; c3 = data['C3'].to_numpy()&nbsp; &nbsp; data['C4'] = data.apply(lambda x: get_c4(x.low_bound, x.high_bound, c2[data.index.get_loc(x) + 1:], c3[data.index.get_loc(x) + 1:]), axis=1)&nbsp; &nbsp; data.drop(columns=['next_val', 'low_bound', 'high_bound'])&nbsp; &nbsp; return data基准代码:for n in [1e2, 1e3, 1e4, 1e5, 1e6]:&nbsp; &nbsp; n = int(n)&nbsp; &nbsp; df = pd.DataFrame({'C1': random_list(n=n), 'C2': random_list(n=n), 'C3': random_list(n=n)})&nbsp; &nbsp; coeff = pd.Series(random_list(start=2, stop=7, n=n))&nbsp; &nbsp; print(f"n={n}:")&nbsp; &nbsp; print(f"Time org: {timeit.timeit(lambda: event(df.copy(), coeff), number=1):.3f} seconds")&nbsp; &nbsp; print(f"Time new: {timeit.timeit(lambda: event_new(df.copy(), coeff), number=1):.3f} seconds")输出:n=100:Time org: 0.007 secondsTime new: 0.012 secondsn=1000:Time org: 0.070 secondsTime new: 0.048 secondsn=10000:Time org: 0.854 secondsTime new: 0.493 secondsn=100000:Time org: 7.565 secondsTime new: 4.456 secondsn=1000000:Time org: 216.408 secondsTime new: 45.199 seconds
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python