猿问

熊猫系列的二进制移位

我在熊猫数据框中有一些布尔变量,我需要获取所有唯一的元组。所以我的想法是创建一个新的变量连接值列,然后使用 pandas.DataFrame.unique() 来获取所有唯一的元组。


所以我的想法是使用二进制开发进行连接。例如,对于数据框:


import pandas as pd

df = pd.DataFrame({'v1':[0,1,0,0,1],'v2':[0,0,0,1,1], 'v3':[0,1,1,0,1], 'v4':[0,1,1,1,1]})

我可以这样创建一个列:


df['added'] = df['v1'] + df['v2']*2 + df['v3']*4 + df['v4']*8

我的想法是迭代这样的变量列表(应该注意,在我的真正问题上,我不知道列数):


variables = ['v1', 'v2', 'v3', 'v4']

df['added'] = df['v1']

for ind, var in enumerate(variables[1:]) :

   df['added'] = df['added'] + df[var] << ind

但是,这会引发错误:“TypeError:<<:'Series' 和 'int' 不支持的操作数类型。


我可以用 pandas.DataFrame.apply() 解决我的问题:


variables = ['v1', 'v2', 'v3', 'v4']

df['added'] = df['v1']

for ind, var in enumerate(variables[1:]) :

   df['added'] = df['added'] + df[var].apply(lambda x : x << ind )

但是, apply (通常)很慢。我怎样才能更有效地做事?


RISEBY
浏览 206回答 3
3回答

肥皂起泡泡

使用这个解决方案,只是简化,因为排序已经交换:df['new'] = df.values.dot(1 << np.arange(df.shape[-1]))print (df)&nbsp; &nbsp;v1&nbsp; v2&nbsp; v3&nbsp; v4&nbsp; new0&nbsp; &nbsp;0&nbsp; &nbsp;0&nbsp; &nbsp;0&nbsp; &nbsp;0&nbsp; &nbsp; 01&nbsp; &nbsp;1&nbsp; &nbsp;0&nbsp; &nbsp;1&nbsp; &nbsp;1&nbsp; &nbsp;132&nbsp; &nbsp;0&nbsp; &nbsp;0&nbsp; &nbsp;1&nbsp; &nbsp;1&nbsp; &nbsp;123&nbsp; &nbsp;0&nbsp; &nbsp;1&nbsp; &nbsp;0&nbsp; &nbsp;1&nbsp; &nbsp;104&nbsp; &nbsp;1&nbsp; &nbsp;1&nbsp; &nbsp;1&nbsp; &nbsp;1&nbsp; &nbsp;151000行和 4 列的性能:np.random.seed(2019)N= 1000df = pd.DataFrame(np.random.choice([0,1], size=(N, 4)))df.columns = [f'v{x+1}' for x in df.columns]In [60]: %%timeit&nbsp; &nbsp; ...: df['new'] = df.values.dot(1 << np.arange(df.shape[-1]))113 µs ± 1.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)尤卡解决方案:In [65]: %%timeit&nbsp; &nbsp; ...: variables = ['v1', 'v2', 'v3', 'v4']&nbsp; &nbsp; ...: df['added'] = df['v1']&nbsp; &nbsp; ...: for ind, var in enumerate(variables[1:]) :&nbsp; &nbsp; ...:&nbsp; &nbsp; &nbsp;df['added'] = df['added'] + [x<<ind for x in df[var]]&nbsp; &nbsp; ...:&nbsp;1.82 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)原解决方案:In [66]: %%timeit&nbsp; &nbsp; ...: variables = ['v1', 'v2', 'v3', 'v4']&nbsp; &nbsp; ...: df['added'] = df['v1']&nbsp; &nbsp; ...: for ind, var in enumerate(variables[1:]) :&nbsp; &nbsp; ...:&nbsp; &nbsp; df['added'] = df['added'] + df[var].apply(lambda x : x << ind )&nbsp; &nbsp; ...:&nbsp;3.14 ms ± 8.52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

守着一只汪

获得唯一的行是相同的操作的drop_duplicates。(通过找到所有重复的行并删除它们,它只留下唯一的行。)df[["v2","v3","v4"]].drop_duplicates()

慕桂英3389331

在回答您关于更有效替代方案的问题时,我发现列表理解确实对您有所帮助:variables = ['v1', 'v2', 'v3', 'v4']df['added'] = df['v1']for ind, var in enumerate(variables[1:]) :&nbsp; &nbsp; %timeit df['added'] = df['added'] + [x<<ind for x in df[var]]308 µs ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)322 µs ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)316 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)所以 315 µs 与:variables = ['v1', 'v2', 'v3', 'v4']df['added'] = df['v1']for ind, var in enumerate(variables[1:]) :&nbsp; &nbsp; %timeit df['added'] = df['added'] + df[var].apply(lambda x : x << ind )500 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)503 µs ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)481 µs ± 32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)作为免责声明,我不同意总和的价值,但这是一个不同的话题:)
随时随地看视频慕课网APP

相关分类

Python
我要回答