使用列的值（字符串数据类型）过滤熊猫组

首页课程实战体系课手记专栏慕课教程

使用列的值（字符串数据类型）过滤熊猫组

我一直在研究一个大型基因组学数据集，该数据集包含每个样本的多次读取，以确保我们获得数据，但是在分析它时，我们需要将其降到一行，这样我们就不会扭曲数据（计算基因存在 6 次，而实际上它是一个实例多次读取）。每行都有一个 ID，所以我在 ID 上使用了 pandasdf.groupby()函数。这是一张表来尝试说明我想要做什么：

# ID | functionality | v_region_score | constant_region

# -----------------------------------------------------------------

# 123 | productive | 820 | NaN

# | unknown | 720 | NaN

# | unknown | 720 | IgM

# 456 | unknown | 690 | NaN

# | unknown | 670 | NaN

# 789 | productive | 780 | IgM

# | productive | 780 | NaN

（编辑）这是示例数据框的代码：

df1 = pd.DataFrame([

[789, "productive", 780, "IgM"],

[123, "unknown", 720, np.nan],

[123, "unknown", 720, "IgM"],

[789, "productive", 780, np.nan],

[123, "productive", 820, np.nan],

[456, "unknown", 690, np.nan],

[456, "unknown", 670, np.nan]],

columns=["ID", "functionality", "v_region_score", "constant_region"])

这将是选择正确行的最终输出：

df2 = pd.DataFrame([

[789, "productive", 780, "IgM"],

[123, "productive", 820, np.nan],

[456, "unknown", 690, np.nan]],

columns=["ID", "functionality", "v_region_score", "constant_region"])

因此，分组后，对于每个组，如果它在功能上具有“生产性”值，我想保留该行，如果它是“未知”，我将采用最高的 v_region_score，如果有多个“生产性”值，我会采用一个在它的 constant_region 中有一些价值。

我尝试了几种访问这些值的方法：

id, frame = next(iter(df_grouped))

if frame["functionality"].equals("productive"):

# do something

只看一组：

x = df_grouped.get_group("1:1101:10897:22442")

for index, value in x["functionality"].items():

print(value)

# returns the correct value and type "str"

甚至将每个组放入列表中：

new_groups = []

for id, frame in df_grouped:

new_groups.append(frame)

# access a specific index returns a dataframe

new_groups[30]

我得到的所有这些错误是“系列的真值不明确”，我现在明白为什么这不起作用，但我不能使用a.any(), a.all(), 或者a.bool()因为条件有多复杂。

有什么方法可以根据每个组的列值选择每个组中的特定行吗？对不起，这么复杂的问题，提前谢谢！:)

有只小跳蛙

浏览 148回答 1

1回答

眼眸繁星

您可以从不同的角度解决您的问题：根据您的条件对值进行排序通过...分组ID保留每个分组的第一个结果ID例如：df1 = df1.sort_values(['ID','functionality','v_region_score','constant_region'], ascending=[True,True,False,True], na_position='last')df1.groupby('ID').first().reset_index()Out[0]:    ID functionality  v_region_score constant_region0  123    productive             820             IgM1  456       unknown             690             NaN2  789    productive             780             IgM此外，如果你想合并从constant_regionwhen it's 开始的值null，你可以使用fillna(method='ffill')这样你保持存在的值：## sorted heredf1['constant_region'] = df1.groupby('ID')['constant_region'].fillna(method='ffill')df1Out[1]:     ID functionality  v_region_score constant_region4  123    productive             820             NaN2  123       unknown             720             IgM1  123       unknown             720             IgM5  456       unknown             690             NaN6  456       unknown             670             NaN0  789    productive             780             IgM3  789    productive             780             IgM## Group by here

0 0

随时随地看视频慕课网APP

相关分类

Python