如何使用 R 或 Python 合并两个 id 列,识别哪些行属于同一组相关 ID

我有 2 个独立创建/收集的 ID 列。我试图通过根据两个 ID 列中的任何一个确定哪些行是同一相关组 id 的一部分,将这两个 ID 列合并为一个列。我会根据一些规则考虑相关的行:


1:如果一个LOAN在多行中具有相同的值,则它们属于同一个组(示例中仅供参考)。我称之为loan_group。这里没有问题。


2:如果一个 COLLATERAL 在多行中具有相同的值,则它们属于临时组。我称它为lateral_group(与#1 相同的规则。)这里没有问题。


3:最后,我不确定如何准确表达这一点,但任何时候属于同一组的值之间存在重叠(跨贷款和抵押列),这些组应该进一步合并。例如:


df <- data.frame('LOAN' = c('L1', 'L2', 'L5', 'L2', 'L6', 'L7', 'L8'),

                 'COLLATERAL' = c('C1', 'C1', 'C8', 'C4', 'C8', 'C9', 'C4'))

df$laon_group <- as.numeric(factor(df$LOAN))

df$collateral_group <- as.numeric(factor(df$COLLATERAL))

df$final_grouping <- NA

LOAN  COLLATERAL  loan_group  collateral_group  final_grouping

----  ----------- ----------  ----------------  --------------

L1    C1*         1           1                 **1**

L2**  C1*         2           1                 **1**

L5    C8          3           2                 2

L2**  C4***       2           3                 **1**

L6    C8          4           2                 2

L7    C9          5           4                 3

L8    C4***       6           3                 **1**

*因为第 1 行和第 2 行的值都为 C1,所以它们将被分配到相同的最终分组中


**因为第 2 行的 LOAN 值为 L2,这意味着我们可以为第 4 行分配“1”的最终分组,因为该行可以通过 L2/C1 链接链接回第 1 行


***最后,因为第 4 行包含 COLLATERAL 值 C4,这意味着我们可以在合并的最终分组中包含第 7 行。该行可以通过 L2/C4 和 L2/C1 链接链接回第一行


该数据集大约是 15m 个独特的 LOAN + COLLATERAL 组合。在某些边缘情况下,这些组可能会交叉几千个(可能 +1 万个)ID。我在 BQ 测试一些解决方案时遇到了一些资源问题,包括我原来的问题中的建议,这就是为什么我想尝试在 R/Python 中执行此操作


一只甜甜圈
浏览 101回答 1
1回答

天涯尽头无女友

如果您将此视为图形问题,则可以执行以下操作:library(igraph)g <- make_empty_graph(directed = FALSE, n = nrow(tab))for (loan_id in unique(tab$loan)) {&nbsp; &nbsp; loan_idx = which(tab$loan == loan_id)&nbsp; &nbsp; if (length(loan_idx) >= 2) {&nbsp; &nbsp; &nbsp; &nbsp; g <- g + path(loan_idx)&nbsp; &nbsp; }}for (collateral_id in unique(tab$collateral)) {&nbsp; &nbsp; collateral_idx = which(tab$collateral == collateral_id)&nbsp; &nbsp; if (length(collateral_idx) >= 2) {&nbsp; &nbsp; &nbsp; &nbsp; g <- g + path(collateral_idx)&nbsp; &nbsp; }}tab$grouping = components(g)$membership即,您制作一个图表并在具有匹配贷款或抵押品 ID 的任何行之间添加边。我不确定这是如何优化的,因为forR 中的循环很少是正确的答案。输出与您的预期输出匹配:> tab&nbsp; loan collateral loan_group collateral_group final grouping1&nbsp; &nbsp;L1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;C1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp; &nbsp; 12&nbsp; &nbsp;L2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;C1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp; &nbsp; 13&nbsp; &nbsp;L5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;C8&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp;2&nbsp; &nbsp; &nbsp; &nbsp; 24&nbsp; &nbsp;L2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;C4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp; &nbsp; 15&nbsp; &nbsp;L6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;C8&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp;2&nbsp; &nbsp; &nbsp; &nbsp; 26&nbsp; &nbsp;L7&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;C9&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4&nbsp; &nbsp; &nbsp;3&nbsp; &nbsp; &nbsp; &nbsp; 37&nbsp; &nbsp;L8&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;C4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp; &nbsp; 1
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python