如何拆分熊猫字符串以提取中间名？

3回答

智慧大石

一个str.extract电话将在这里工作：p = r'^(?P<Last_Name>.*), (?P<First_Name>\S+)\b\s*(?P<Middle_Name>.*)' u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)pd.concat([df, u], axis=1).fillna('')   ID               Complete_Name Type     Last_Name First_Name   Middle_Name0   1                  JERRY, Ben    I         JERRY        Ben              1   2          VON HELSINKI, Olga    I  VON HELSINKI       Olga              2   3  JENSEN, James Goodboy Dean    I        JENSEN      James  Goodboy Dean3   4                 THE COMPANY    C                                       4   5         CRUZ, Juan S. de la    I          CRUZ       Juan      S. de la正则表达式分解^                # Start-of-line(?P<Last_Name>   # First named capture group - Last Name    .*           # Match anything until...),                # ...we see a comma\s               # whitespace (?P<First_Name>  # Second capture group - First Name    \S+          # Match all non-whitespace characters)\b               # Word boundary \s*              # Optional whitespace chars (mostly housekeeping) (?P<Middle_Name> # Third capture group - Zero of more middle names     .*           # Match everything till the end of string)

0 0

有只小跳蛙

我认为你可以这样做：# take the complete_name column and split it multiple timesdf2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str       .split(',', expand=True)       .fillna(''))# remove extra spaces for x in df2.columns:    df2[x] = [x.strip() for x in df2[x]]# split the name on first space and join itdf2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)df2.columns = ['last','first','middle']# join the data framesdf = pd.concat([df[['ID','Complete_Name']], df2], axis=1)# rearrange columns - not necessary thoughdf = df[['ID','Complete_Name','first','middle','last']]# remove none valuesdf = df.replace([None], '')   ID                  Complete_Name Type  first        middle          last0   1   JERRY, Ben                      I    Ben                       JERRY1   2   VON HELSINKI, Olga              I   Olga                VON HELSINKI2   3   JENSEN, James Goodboy Dean      I  James  Goodboy Dean        JENSEN3   4   THE COMPANY                     C                                   4   5   CRUZ, Juan S. de la             I   Juan      S. de la          CRUZ

0 0

MM们

这是使用一些简单的 lambda 功能的另一个答案。import numpy as npimport pandas as pd""" Create data and data frame """info_dict = {    'ID': [1,2,3,4,5,],    'Complete_Name':[        'JERRY, Ben',        'VON HELSINKI, Olga',        'JENSEN, James Goodboy Dean',        'THE COMPANY',        'CRUZ, Juan S. de la',        ],    'Type':['I','I','I','C','I',],    }data = pd.DataFrame(info_dict, columns = info_dict.keys())""" List of columns to add """name_cols = [    'First Name',    'Middle Name',    'Last Name',    ]"""Use partition() to separate first and middle names into Pandas series.Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only thevalues that we want."""NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())# We can use index positions to quickly add columns to the dataframe.# The partition() function will keep the delimited value in the 1 index, so we'll use# the 0 and 2 index positions for first and middle names.data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]# Finally, we'll add our Last Names columndata[name_cols[2]] = LAST_NAMES# Optional: We can replace all blank values with numpy.NaN values using regular expressions.data = data.replace(r'^$', np.NaN, regex=True)然后你应该得到这样的结果：   ID               Complete_Name Type First Name   Middle Name     Last Name0   1                  JERRY, Ben    I        Ben           NaN         JERRY1   2          VON HELSINKI, Olga    I       Olga           NaN  VON HELSINKI2   3  JENSEN, James Goodboy Dean    I      James  Goodboy Dean        JENSEN3   4                 THE COMPANY    C        NaN           NaN           NaN4   5         CRUZ, Juan S. de la    I       Juan      S. de la          CRUZ或者，用空字符串替换 NaN 值：data = data.replace(np.NaN, r'', regex=False)然后你有：   ID               Complete_Name Type First Name   Middle Name     Last Name0   1                  JERRY, Ben    I        Ben                       JERRY1   2          VON HELSINKI, Olga    I       Olga                VON HELSINKI2   3  JENSEN, James Goodboy Dean    I      James  Goodboy Dean        JENSEN3   4                 THE COMPANY    C                                       4   5         CRUZ, Juan S. de la    I       Juan      S. de la          CRUZ

0 0