如何匹配两个数据集中的模糊匹配字符串?

我一直在研究一种基于不完善的字符串(例如公司名称)来联接两个数据集的方法。过去,我必须匹配两个非常脏的列表,一个列表包含名称和财务信息,另一个列表包含名称和地址。都没有唯一的ID可以匹配!假设已经应用了清洁,并且可能存在打字和插入错误。


到目前为止,AGREP是我发现最有效的工具。我可以在AGREP包中使用levenshtein距离,该距离用于测量两个字符串之间的删除,插入和替换的数量。AGREP将返回距离最小(最相似)的字符串。


但是,我一直无法将命令从单个值转换为将其应用于整个数据帧。我已经粗略地使用了for循环来重复AGREP函数,但是总有一种更简单的方法。


请参见以下代码:


a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))

b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))


for (i in 1:6){

    a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))

    a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))

}


收到一只叮咚
浏览 1020回答 3
3回答

慕田峪7331174

这是使用该fuzzyjoin包装的解决方案。它使用类似dplyr语法,并stringdist作为模糊匹配的可能类型之一。如C8H10N4O2 所建议,stringdistmethod =“ jw”为您的示例创建最佳匹配。作为建议由dgrtwo,fuzzyjoin的开发商,我用了一个大max_dist,然后使用dplyr::group_by和dplyr::top_n只得到最小距离的最佳匹配。a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; price = c(10, 13, 2, 1, 15, 1))b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; qty = c(9, 99, 10))library(fuzzyjoin); library(dplyr);stringdist_join(a, b,&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; by = "name",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; mode = "left",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ignore_case = FALSE,&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; method = "jw",&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; max_dist = 99,&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; distance_col = "dist") %>%&nbsp; group_by(name.x) %>%&nbsp; top_n(1, -dist)#> # A tibble: 6 x 5#> # Groups:&nbsp; &nbsp;name.x [6]#>&nbsp; &nbsp;name.x price&nbsp; &nbsp; &nbsp;name.y&nbsp; &nbsp;qty&nbsp; &nbsp; &nbsp; &nbsp;dist#>&nbsp; &nbsp;<fctr> <dbl>&nbsp; &nbsp; &nbsp;<fctr> <dbl>&nbsp; &nbsp; &nbsp; <dbl>#> 1 Ace Co&nbsp; &nbsp; 10&nbsp; &nbsp; Ace Co.&nbsp; &nbsp; &nbsp;9 0.04761905#> 2&nbsp; Bayes&nbsp; &nbsp; 13 Bayes Inc.&nbsp; &nbsp; 99 0.16666667#> 3&nbsp; &nbsp; asd&nbsp; &nbsp; &nbsp;2&nbsp; &nbsp; &nbsp; &nbsp;asdf&nbsp; &nbsp; 10 0.08333333#> 4&nbsp; &nbsp; Bcy&nbsp; &nbsp; &nbsp;1 Bayes Inc.&nbsp; &nbsp; 99 0.37777778#> 5&nbsp; &nbsp;Baes&nbsp; &nbsp; 15 Bayes Inc.&nbsp; &nbsp; 99 0.20000000#> 6&nbsp; &nbsp;Bays&nbsp; &nbsp; &nbsp;1 Bayes Inc.&nbsp; &nbsp; 99 0.20000000
打开App,查看更多内容
随时随地看视频慕课网APP