我有一个SequenceMatcher函数可以找到最接近的匹配项:
细绳
字符串列表
代码:
def seq_match(text, values, min_match=10):
highest = (None, 0)
for v in values:
sm = SequenceMatcher(a=text, b=v, autojunk=False)
ratio = int(sm.quick_ratio() * 100)
print(f'{text} : {v} : {ratio}')
if ratio > min_match and ratio > highest[1]:
highest = v, ratio
return highest
我还有一个数据集:
# (text, value1, value2, value3...): expected_output
test_map = {
# 1
('super delicious cat food', 'decent', 'delicious', 'super delicious'): 'super delicious',
# 2
('salmon: does not contain real salmon', 'chicken', 'salmon', 'arctic salmon'): 'arctic salmon',
}
当#1数据被正确匹配时,#2匹配假设更长的字符串artic salmon比仅仅salmon. 换句话说,我希望salmon能更好地匹配等于或更大的 mathan artic salmon。
以下是全部比赛结果:
# correct
super delicious cat food : decent : 33
super delicious cat food : delicious : 54
super delicious cat food : super delicious : 76
salmon: does not contain real salmon : chicken : 18
salmon: does not contain real salmon : salmon : 28
# incorrect
salmon: does not contain real salmon : arctic salmon : 48
# expected
salmon: does not contain real salmon : arctic salmon : 28 or less
我可以SequenceMatcher在这里强迫行为更理智吗?我怎样才能得到我想要的结果?为什么arctic还要产生分数?
我试过关闭自动垃圾邮件,但它似乎没有影响。
FFIVE
相关分类