在 pandas 单列上运行 fuzzywuzzy 比率

4回答

宝慕林4294392

import pandas as pdfrom io import StringIOfrom fuzzywuzzy import processs = """full_name,dobJerry Smith,21/01/2010Morty Smith,18/06/2008Rick Sanchez,27/04/1993Jery Smith,27/12/2012Morti Smith,13/03/2012"""df = pd.read_csv(StringIO(s))# 1 - use fuzzywuzzy.process.extract with list comprehension# 2 - You still have to iterate once but this method avoids the use of apply, which can be very slow# 3 - convert the list comprehension results to a dataframe # Note that I am limiting the results to one match. You can adjust the code as you see fitdf2 = pd.DataFrame([process.extract(df['full_name'][i], df[~df.index.isin([i])]['full_name'], limit=1)[0] for i in range(len(df))],                   index=df.index, columns=['match_name', 'match_percent', 'match_index'])# join the new dataframe to the originalfinal = df.join(df2)      full_name         dob   match_name  match_percent  match_index0   Jerry Smith  21/01/2010   Jery Smith             95            31   Morty Smith  18/06/2008  Morti Smith             91            42  Rick Sanchez  27/04/1993  Morti Smith             43            43    Jery Smith  27/12/2012  Jerry Smith             95            04   Morti Smith  13/03/2012  Morty Smith             91            1

GCT1015

通常有两个部分可以帮助您提高性能：减少比较次数使用更快的方式来匹配字符串在你的实现中，你执行了很多不必要的比较，因为你总是比较 A <-> B，然后比较 B <-> A。你也比较 A <-> A，通常总是 100。所以你可以减少数量的比较超过50%。由于您只想添加分数超过 90 的匹配项，因此此信息可用于加快比较速度。您的代码可以通过以下方式来实现这两个更改，这应该会快得多。在我的机器上测试时，您的代码运行大约 12 秒，而这个改进版本只需要 1.7 秒。import pandas as pdfrom io import StringIOfrom rapidfuzz import fuzz# generate a bigger list of examples to show the performance benefitss = "fullname,dob"s+='''Jerry Smith,21/01/2010Morty Smith,18/06/2008Rick Sanchez,27/04/1993Jery Smith,27/12/2012Morti Smith,13/03/2012'''*500dataframe = pd.read_csv(StringIO(s))# only create the data series oncefull_names = dataframe['fullname']for index, row1 in full_names.items(): # skip elements that are already compared for row2 in full_names.iloc[index+1::]: # use a score_cutoff to improve the runtime for bad matches score = fuzz.ratio(row1, row2, score_cutoff=90) if score: _list.append([row1, row2, score])

慕码人8056858

您可以创建第一个模糊数据：import pandas as pdfrom io import StringIOfrom fuzzywuzzy import fuzzdata = StringIO("""Jerry SmithMorty SmithRick SanchezJery SmithMorti Smith""")df = pd.read_csv(data, names=['full_name'])for index, row in df.iterrows():    df[row['full_name']] = df['full_name'].apply(lambda x:fuzz.ratio(row['full_name'], x))print(df.to_string())输出：      full_name  Jerry Smith  Morty Smith  Rick Sanchez  Jery Smith  Morti Smith0   Jerry Smith          100           73            26          95           641   Morty Smith           73          100            26          76           912  Rick Sanchez           26           26           100          27           353    Jery Smith           95           76            27         100           674   Morti Smith           64           91            35          67          100然后找到所选名称的最佳匹配：data_rows = df[df['Jerry Smith'] > 90]print(data_rows)输出：     full_name  Jerry Smith  Morty Smith  Rick Sanchez  Jery Smith  Morti Smith0  Jerry Smith          100           73            26          95           643   Jery Smith           95           76            27         100           67

千万里不及你

这种比较方法会做双重工作，因为在“Jerry Smith”和“Morti Smith”之间运行 fuzz.ratio 与在“Morti Smith”和“Jerry Smith”之间运行相同。如果您迭代子数组，那么您将能够更快地完成此操作。dataframe = pd.read_csv('datafile.csv')_list = []for i_dataframe in range(len(dataframe)-1):    comparison_fullname = dataframe['fullname'][i_dataframe]    for entry_fullname, entry_score in process.extract(comparison_fullname, dataframe['fullname'][i_dataframe+1::], scorer=fuzz.ratio):        if entry_score >=90:            _list.append((comparison_fullname, entry_fullname, entry_score)print(_list)这将防止任何重复工作。