明月笑刀无情
get_dummies那我们试试dotdf.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1]Out[307]: cat 1,3,4dog 1,2,4dolphin 3,5hamster 5dtype: object如果会考虑列表添加reindexdf.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1].reindex(animals)Out[308]: cat 1,3,4dog 1,2,4hamster 5dolphin 3,5dtype: object
牧羊人nacy
基于 NumPy 的 perf。-def list_occ(df): id_col='id' item_col='animals' sidx = np.argsort(animals) s = [i.split(',') for i in df[item_col]] d = np.concatenate(s) p = sidx[np.searchsorted(animals, d, sorter=sidx)] C = np.bincount(p, minlength=len(animals)) l = list(map(len,s)) r = np.repeat(np.arange(len(l)), l) v = df[id_col].values[r[np.lexsort((r,p))]] out = pd.DataFrame({'ids':np.split(v, C[:-1].cumsum())}, index=animals) return out样品运行 -In [41]: dfOut[41]: id animals0 1 dog,cat1 2 dog2 3 cat,dolphin3 4 cat,dog4 5 hamster,dolphinIn [42]: animalsOut[42]: ['cat', 'dog', 'hamster', 'dolphin']In [43]: list_occ(df)Out[43]: idscat [1, 3, 4]dog [1, 2, 4]hamster [5]dolphin [3, 5]对标使用给定的样本并简单地增加项目的数量。# SetupN = 100 # scale factors = [i.split(',') for i in df['animals']]df_big = pd.DataFrame({'animals':[[j+str(ID) for j in i] for i in s for ID in range(1,N+1)]})df_big['id'] = range(1, len(df_big)+1)animals = np.unique(np.concatenate(df_big.animals)).tolist()df_big['animals'] = [','.join(i) for i in df_big.animals]df = df_big时间 -# Using given df & scaling it up by replicating elems with progressive IDsIn [9]: N = 100 # scale factor ...: s = [i.split(',') for i in df['animals']] ...: df_big = pd.DataFrame({'animals':[[j+str(ID) for j in i] for i in s for ID in range(1,N+1)]}) ...: df_big['id'] = range(1, len(df_big)+1) ...: animals = np.unique(np.concatenate(df_big.animals)).tolist() ...: df_big['animals'] = [','.join(i) for i in df_big.animals] ...: df = df_big# @BEN_YO's soln-1In [10]: %timeit df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1]163 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)# @BEN_YO's soln-2In [11]: %timeit df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1].reindex(animals)166 ms ± 4.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)# @Andy L.'s soln%timeit (df.astype(str).assign(animals=df.animals.str.split(',')).explode('animals').groupby('animals').id.agg(','.join).reset_index())13.4 ms ± 74 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)In [12]: %timeit list_occ(df)2.81 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)