考虑各种列条件对独特元素进行分类和计数

你好,我正在使用 python 对一些数据进行分类:


Articles                                       Filename

A New Marine Ascomycete from Brunei.    Invasive Species.csv

A new genus and four new species        Forestry.csv

A new genus and four new species        Invasive Species.csv

我想知道每个“文件名”有多少个独特的“文章”。


所以我想要的输出是这样的:


Filename                             Count_Unique

Invasive Species.csv                 1

Forestry.csv                         0

另一件事,我也想得到这个输出:


Filename1                        Filename2                         Count_Common articles

Forestry.csv                     Invasive Species.csv               1

我连接了数据集并最终计算了每个“文件名”中存在的元素。


有谁愿意帮忙吗?我已经尝试过unique(), drop_duplicates()等,但似乎我无法得到我想要的输出。


无论如何,这是我的代码的最后几行:


concatenated = pd.concat(data, ignore_index =True)

concatenatedconcatenated.groupby(['Title','Filename']).count().reset_index()

res = {col:concatenated[col].value_counts() for col in concatenated.columns}

res ['Filename']


一只斗牛犬
浏览 113回答 1
1回答

胡说叔叔

没有魔法。只是一些常规操作。(1) 统计文件中“独特”的文章编辑:添加(快速而肮脏)代码以包含计数为零的文件名# prevent repetitive countingdf = df.drop_duplicates()# articles to be removed (the ones appeared more than once)dup_articles = df["Articles"].value_counts()dup_articles = dup_articles[dup_articles > 1].index# remove duplicate articles and countmask_dup_articles = df["Articles"].isin(dup_articles)df_unique = df[~mask_dup_articles]df_unique["Filename"].value_counts()# N.B. all filenames not shown here of course has 0 count.#      I will add this part later on.Out[68]: Invasive Species.csv    1Name: Filename, dtype: int64# unique article count with zerosdf_unique_nonzero_count = df_unique["Filename"].value_counts().to_frame().reset_index()df_unique_nonzero_count.columns = ["Filename", "count"]df_all_filenames = pd.DataFrame(    data={"Filename": df["Filename"].unique()})# join: all filenames with counted filenamesdf_unique_count = df_all_filenames.merge(df_unique_nonzero_count, on="Filename", how="outer")# postprocessdf_unique_count.fillna(0, inplace=True)df_unique_count["count"] = df_unique_count["count"].astype(int)# printdf_unique_countOut[119]:                Filename  count0  Invasive Species.csv      11          Forestry.csv      0(2)统计文件之间的共同文章# pick out records containing duplicate articlesdf_dup = df[mask_dup_articles]# merge on articles and then discard self- and duplicate pairsdf_merge = df_dup.merge(df_dup, on=["Articles"], suffixes=("1", "2"))df_merge = df_merge[df_merge["Filename1"] > df_merge["Filename2"]] # alphabetical ordering# countdf_ans2 = df_merge.groupby(["Filename1", "Filename2"]).count()df_ans2.reset_index(inplace=True)  # optionaldf_ans2Out[70]:               Filename1     Filename2  Articles0  Invasive Species.csv  Forestry.csv         1
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python