猿问

在 pandas 单列上运行 fuzzywuzzy 比率

我有一大堆全名示例:


datafile.csv:

full_name, dob,

Jerry Smith,21/01/2010

Morty Smith,18/06/2008

Rick Sanchez,27/04/1993

Jery Smith,27/12/2012

Morti Smith,13/03/2012

我试图用它来fuzz.ration查看 column['fullname'] 中的名称是否有任何相似之处,但代码需要很长时间,主要是因为嵌套的 for 循环。


示例代码:


dataframe = pd.read_csv('datafile.csv')

_list = []

for row1 in dataframe['fullname']:

    for row2 in dataframe['fullname']:

        x = fuzz.ratio(row1, row2)

        if x > 90:

            _list.append([row1, row2, x])


print(_list)

是否有更好的方法来迭代单个 pandas 列以获得潜在重复数据的比率?


慕婉清6462132
浏览 170回答 4
4回答

宝慕林4294392

import pandas as pdfrom io import StringIOfrom fuzzywuzzy import processs = """full_name,dobJerry Smith,21/01/2010Morty Smith,18/06/2008Rick Sanchez,27/04/1993Jery Smith,27/12/2012Morti Smith,13/03/2012"""df = pd.read_csv(StringIO(s))# 1 - use fuzzywuzzy.process.extract with list comprehension# 2 - You still have to iterate once but this method avoids the use of apply, which can be very slow# 3 - convert the list comprehension results to a dataframe # Note that I am limiting the results to one match. You can adjust the code as you see fitdf2 = pd.DataFrame([process.extract(df['full_name'][i], df[~df.index.isin([i])]['full_name'], limit=1)[0] for i in range(len(df))],                   index=df.index, columns=['match_name', 'match_percent', 'match_index'])# join the new dataframe to the originalfinal = df.join(df2)      full_name         dob   match_name  match_percent  match_index0   Jerry Smith  21/01/2010   Jery Smith             95            31   Morty Smith  18/06/2008  Morti Smith             91            42  Rick Sanchez  27/04/1993  Morti Smith             43            43    Jery Smith  27/12/2012  Jerry Smith             95            04   Morti Smith  13/03/2012  Morty Smith             91            1

GCT1015

通常有两个部分可以帮助您提高性能:减少比较次数使用更快的方式来匹配字符串在你的实现中,你执行了很多不必要的比较,因为你总是比较 A <-> B,然后比较 B <-> A。你也比较 A <-> A,通常总是 100。所以你可以减少数量的比较超过50%。由于您只想添加分数超过 90 的匹配项,因此此信息可用于加快比较速度。您的代码可以通过以下方式来实现这两个更改,这应该会快得多。在我的机器上测试时,您的代码运行大约 12 秒,而这个改进版本只需要 1.7 秒。import pandas as pdfrom io import StringIOfrom rapidfuzz import fuzz# generate a bigger list of examples to show the performance benefitss = "fullname,dob"s+='''Jerry Smith,21/01/2010Morty Smith,18/06/2008Rick Sanchez,27/04/1993Jery Smith,27/12/2012Morti Smith,13/03/2012'''*500dataframe = pd.read_csv(StringIO(s))# only create the data series oncefull_names = dataframe['fullname']for index, row1 in full_names.items():    # skip elements that are already compared    for row2 in full_names.iloc[index+1::]:        # use a score_cutoff to improve the runtime for bad matches        score = fuzz.ratio(row1, row2, score_cutoff=90)        if score:            _list.append([row1, row2, score])

慕码人8056858

您可以创建第一个模糊数据:import pandas as pdfrom io import StringIOfrom fuzzywuzzy import fuzzdata = StringIO("""Jerry SmithMorty SmithRick SanchezJery SmithMorti Smith""")df = pd.read_csv(data, names=['full_name'])for index, row in df.iterrows():&nbsp; &nbsp; df[row['full_name']] = df['full_name'].apply(lambda x:fuzz.ratio(row['full_name'], x))print(df.to_string())输出:&nbsp; &nbsp; &nbsp; full_name&nbsp; Jerry Smith&nbsp; Morty Smith&nbsp; Rick Sanchez&nbsp; Jery Smith&nbsp; Morti Smith0&nbsp; &nbsp;Jerry Smith&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 100&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;73&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 26&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 95&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;641&nbsp; &nbsp;Morty Smith&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;73&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 100&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 26&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 76&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;912&nbsp; Rick Sanchez&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;26&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;26&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;100&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 27&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;353&nbsp; &nbsp; Jery Smith&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;95&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;76&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 27&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;100&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;674&nbsp; &nbsp;Morti Smith&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;64&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;91&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 35&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 67&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 100然后找到所选名称的最佳匹配:data_rows = df[df['Jerry Smith'] > 90]print(data_rows)输出:&nbsp; &nbsp; &nbsp;full_name&nbsp; Jerry Smith&nbsp; Morty Smith&nbsp; Rick Sanchez&nbsp; Jery Smith&nbsp; Morti Smith0&nbsp; Jerry Smith&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 100&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;73&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 26&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 95&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;643&nbsp; &nbsp;Jery Smith&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;95&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;76&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 27&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;100&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;67

千万里不及你

这种比较方法会做双重工作,因为在“Jerry Smith”和“Morti Smith”之间运行 fuzz.ratio 与在“Morti Smith”和“Jerry Smith”之间运行相同。如果您迭代子数组,那么您将能够更快地完成此操作。dataframe = pd.read_csv('datafile.csv')_list = []for i_dataframe in range(len(dataframe)-1):&nbsp; &nbsp; comparison_fullname = dataframe['fullname'][i_dataframe]&nbsp; &nbsp; for entry_fullname, entry_score in process.extract(comparison_fullname, dataframe['fullname'][i_dataframe+1::], scorer=fuzz.ratio):&nbsp; &nbsp; &nbsp; &nbsp; if entry_score >=90:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; _list.append((comparison_fullname, entry_fullname, entry_score)print(_list)这将防止任何重复工作。
随时随地看视频慕课网APP

相关分类

Python
我要回答