猿问

如何用大写字母拆分 pandas 数据框中的字符串

我正在处理一些 NFL 数据,数据框中有一列,如下所示:


0         Lamar JacksonL. Jackson BAL

1     Patrick Mahomes IIP. Mahomes KC

2         Dak PrescottD. Prescott DAL

3              Josh AllenJ. Allen BUF

4         Russell WilsonR. Wilson SEA

每个单元格中有 3 位信息 - FullName,ShortName我Team希望为其创建新列。


预期输出:


         FullName                ShortName        Team

0         Lamar Jackson           L. Jackson        BAL

1         Patrick Mahomes II      P. Mahomes        KC

2         Dak Prescott            D. Prescott       DAL

3         Josh Allen              J. Allen          BUF

4         Russell Wilson          R. Wilson         SEA

我已经设法得到了,Team但我不太确定如何在一行中完成所有这三个操作。


我正在考虑通过查找前一个字符来分割字符串,fullstop但是出现了一些名称,例如:


Anthony McFarland Jr.A. McFarland PIT

有多个句号。


有人知道解决这个问题的最佳方法吗?谢谢!


繁星淼淼
浏览 163回答 4
4回答

qq_遁去的一_1

pandas Series str.extract 方法就是您所寻找的。该正则表达式适用于您提出的所有情况,尽管可能还有一些其他边缘情况。df = pd.DataFrame({&nbsp; &nbsp; "bad_col": ["Lamar JacksonL. Jackson BAL", "Patrick Mahomes IIP. Mahomes KC",&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; "Dak PrescottD. Prescott DAL", "Josh AllenJ. Allen BUF",&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; "Josh AllenJ. Allen SEA", "Anthony McFarland Jr.A. McFarland PIT"],})print(df)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;bad_col0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Lamar JacksonL. Jackson BAL1&nbsp; &nbsp; &nbsp; &nbsp; Patrick Mahomes IIP. Mahomes KC2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Dak PrescottD. Prescott DAL3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Josh AllenJ. Allen BUF4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Josh AllenJ. Allen SEA5&nbsp; Anthony McFarland Jr.A. McFarland PITpattern = r"(?P<full_name>.+)(?=[A-Z]\.)(?P<short_name>[A-Z]\.\s.*)\s(?P<team>[A-Z]+)"new_df = df["bad_col"].str.extract(pattern, expand=True)print(new_df)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;full_name&nbsp; &nbsp; short_name team0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Lamar Jackson&nbsp; &nbsp; L. Jackson&nbsp; BAL1&nbsp; &nbsp; &nbsp;Patrick Mahomes II&nbsp; &nbsp; P. Mahomes&nbsp; &nbsp;KC2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Dak Prescott&nbsp; &nbsp;D. Prescott&nbsp; DAL3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Josh Allen&nbsp; &nbsp; &nbsp; J. Allen&nbsp; BUF4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Josh Allen&nbsp; &nbsp; &nbsp; J. Allen&nbsp; SEA5&nbsp; Anthony McFarland Jr.&nbsp; A. McFarland&nbsp; PIT分解该正则表达式:(?P<full_name>.+)(?=[A-Z]\.)(?P<short_name>[A-Z]\.\s.*)\s(?P<team>[A-Z]+)(?P<full_name>.+)(?=[A-Z]\.)&nbsp;捕获任何字母,直到我们看到大写字母后跟句号/句号,我们使用前瞻 (?=...) 来不消耗大写字母和句号,因为字符串的这一部分属于短名称(?P<short_name>[A-Z]\.\s.*.)\s&nbsp;捕获一个大写字母(玩家的第一个首字母),然后是句号(第一个首字母后面的句点),然后是一个空格(第一个首字母和姓氏之间),然后是所有字符,直到我们点击空格(玩家的姓氏) )。该空间不包含在捕获组中。(?P<team>[A-Z]+)&nbsp;捕获字符串中所有剩余的大写字母(最终成为玩家团队)您可能已经注意到,我使用了由 (?Ppattern) 结构表示的命名捕获组。在 pandas 中,捕获组的名称将成为列的名称,该组中捕获的任何内容将成为该列中的值。现在将新的数据框加入到我们原来的数据框中,完成一圈:df = df.join(new_df)print(df)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;bad_col&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; full_name&nbsp; &nbsp; short_name&nbsp; \0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Lamar JacksonL. Jackson BAL&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Lamar Jackson&nbsp; &nbsp; L. Jackson&nbsp; &nbsp;1&nbsp; &nbsp; &nbsp; &nbsp; Patrick Mahomes IIP. Mahomes KC&nbsp; &nbsp; &nbsp;Patrick Mahomes II&nbsp; &nbsp; P. Mahomes&nbsp; &nbsp;2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Dak PrescottD. Prescott DAL&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Dak Prescott&nbsp; &nbsp;D. Prescott&nbsp; &nbsp;3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Josh AllenJ. Allen BUF&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Josh Allen&nbsp; &nbsp; &nbsp; J. Allen&nbsp; &nbsp;4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Josh AllenJ. Allen SEA&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Josh Allen&nbsp; &nbsp; &nbsp; J. Allen&nbsp; &nbsp;5&nbsp; Anthony McFarland Jr.A. McFarland PIT&nbsp; Anthony McFarland Jr.&nbsp; A. McFarland&nbsp; &nbsp;&nbsp; team&nbsp;&nbsp;0&nbsp; BAL&nbsp;&nbsp;1&nbsp; &nbsp;KC&nbsp;&nbsp;2&nbsp; DAL&nbsp;&nbsp;3&nbsp; BUF&nbsp;&nbsp;4&nbsp; SEA&nbsp;&nbsp;5&nbsp; PIT&nbsp;&nbsp;

哆啦的时光机

我的猜测是短名称不会包含句号。因此,您可以搜索从行尾开始的第一个句号。因此,从句号之前的一个字符到第一个空格都是您的简称。句点前一个字母之前的任何内容都将是全名。

喵喵时光机

这可能会有所帮助。import rename = 'Anthony McFarland Jr.A. McFarland PIT'short_name = re.findall(r'(\w\.\s[\w]+)\s[\w]{3}', name)[0]full_name = name.replace(short_name, "")[:-4]team = name[-3:]print(short_name)print(full_name)print(team)输出:A. McFarlandAnthony McFarland Jr.PIT

泛舟湖上清波郎朗

import pandas as pdimport numpy as npdf = pd.DataFrame({'players':['Lamar JacksonL. Jackson BAL', 'Patrick Mahomes IIP. Mahomes KC',&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'Anthony McFarland Jr.A. McFarland PIT']})def splitName(name):&nbsp; &nbsp; last_period_pos = np.max(np.where(np.array(list(name)) == '.'))&nbsp; &nbsp; full_name = name[:(last_period_pos - 1)]&nbsp; &nbsp; short_name_team = name[(last_period_pos - 1):]&nbsp; &nbsp; team_pos = np.max(np.where(np.array(list(short_name_team)) == ' '))&nbsp; &nbsp; short_name = short_name_team[:team_pos]&nbsp; &nbsp; team = short_name_team[(team_pos + 1):]&nbsp; &nbsp; return full_name, short_name, teamdf['full_name'], df['short_name'], df['team'] = zip(*df.players.apply(splitName))
随时随地看视频慕课网APP

相关分类

Python
我要回答