如何使 pandas df 列中的文本处理对于大型文本数据更快？

3回答

千万里不及你

您可以将split逗号分隔的列转换为列表，explode将列表列转换为数据帧，groupby分解列表中的名称和值，unstack或者将数据帧转换为所需的格式，并使用, ,pivot_table对多索引列进行最终清理ETC。droplevel()reset_index()以下所有内容都是矢量化的 pandas 方法，所以希望它很快。注意：当我从剪贴板读取并通过时，下面的代码中的三列是 [0,1,2]headers=None输入：df = pd.DataFrame({0: {0: 'john', 1: 'tom', 2: 'mary', 3: 'john', 4: 'mary'}, 1: {0: '12-02-1999',  1: '12-02-1999',  2: '12-03-1999',  3: '12-02-2000',  4: '12-03-2000'}, 2: {0: 'hello#,there#,how#,are#,you#,tom$ ',  1: 'hey#,john$,hows#, it#, goin#',  2: "hello#,boys#,fancy#,meetin#,ya'll#,here#",  3: 'well#,its#,been#,nice#,catching#,up#,with#,you#,and#, mary$',  4: 'catch#,you#,on#,the#,flipside#,tom$,and#,john$'}})代码：df[2] = df[2].replace(['\#', '\$'],'', regex=True).str.split(',')df = (df.explode(2)      .groupby([0, 2])[2].count()      .rename('Count')      .reset_index()      .set_index([0,2])      .unstack(1)      .fillna(0))df.columns = df.columns.droplevel()df = df.reset_index()dfOut[1]: 2     0   goin   it   mary  and  are  been  boys  catch  catching  ...   on  \0  john    0.0  0.0    1.0  1.0  1.0   1.0   0.0    0.0       1.0  ...  0.0   1  mary    0.0  0.0    0.0  1.0  0.0   0.0   1.0    1.0       0.0  ...  1.0   2   tom    1.0  1.0    0.0  0.0  0.0   0.0   0.0    0.0       0.0  ...  0.0   2  the  there  tom  tom    up  well  with  ya'll  you  0  0.0    1.0  0.0   1.0  1.0   1.0   1.0    0.0  2.0  1  1.0    0.0  1.0   0.0  0.0   0.0   0.0    1.0  1.0  您还可以使用.pivot_table代替.unstack()，这样可以节省这行代码：df.columns = df.columns.droplevel()：df[2] = df[2].replace(['\#', '\$'],'', regex=True).str.split(',')df = (df.explode(2)      .groupby([0, 2])[2].count()      .rename('Count')      .reset_index()      .pivot_table(index=0, columns=2, values='Count')      .fillna(0)      .astype(int)      .reset_index())dfOut[45]: 2     0   goin   it   mary  and  are  been  boys  catch  catching  ...  on  \0  john      0    0      1    1    1     1     0      0         1  ...   0   1  mary      0    0      0    1    0     0     1      1         0  ...   1   2   tom      1    1      0    0    0     0     0      0         0  ...   0   2  the  there  tom  tom   up  well  with  ya'll  you  0    0      1    0     1   1     1     1      0    2  1    1      0    1     0   0     0     0      1    1  2    0      0    0     0   0     0     0      0    0  [3 rows x 31 columns]

0 0

九州编程

如果你会使用scikit-learn，那就很容易了CountVectorizerfrom sklearn.feature_extraction.text import CountVectorizers = df['words'].str.replace("#|\$|\s+", "")model = CountVectorizer(tokenizer=lambda x: x.split(','))df_final = pd.DataFrame(model.fit_transform(s).toarray(),                        columns=model.get_feature_names(),                        index=df.user).sum(level=0)Out[279]:      and  are  been  boys  catch  catching  fancy  flipside  goin  hello  \userjohn    1    1     1     0      0         1      0         0     0      1tom     0    0     0     0      0         0      0         0     1      0mary    1    0     0     1      1         0      1         1     0      1      here  hey  how  hows  it  its  john  mary  meetin  nice  on  the  there  \userjohn     0    0    1     0   0    1     0     1       0     1   0    0      1tom      0    1    0     1   1    0     1     0       0     0   0    0      0mary     1    0    0     0   0    0     1     0       1     0   1    1      0      tom  up  well  with  ya'll  youuserjohn    1   1     1     1      0    2tom     0   0     0     0      0    0mary    1   0     0     0      1    1

0 0

江户川乱折腾

我不确定这种方法在大型 DataFrame 上的速度有多快，但您可以尝试一下。首先，删除特殊字符并将字符串拆分为单词列表，从而形成另一列：from itertools import chain from collections import Counter df['lists'] = df['words'].str.replace("#|\$", "").str.split(",")现在，按用户分组），将列表收集到一个列表中，并使用以下命令计算出现次数Counter：df.groupby('user')['lists'].apply(chain.from_iterable)\                            .apply(Counter)\                            .apply(pd.Series)\                            .fillna(0).astype(int)

0 0