猿问

协助将数据框拆分为新列

我在用 _ 分割数据框并从中创建新列时遇到问题。


原来的股


AMAT_0000006951_10Q_20200726_Item1A_excerpt.txt    as section

我当前的代码


df = pd.DataFrame(myList,columns=['section','text'])

#df['text'] = df['text'].str.replace('•','')

df['section'] = df['section'].str.replace('Item1A', 'Filing Section: Risk Factors')

df['section'] = df['section'].str.replace('Item2_', 'Filing Section: Management Discussion and Analysis')

df['section'] = df['section'].str.replace('excerpt.txt', '').str.replace(r'\d{10}_|\d{8}_', '')

df.to_csv("./SECParse.csv", encoding='utf-8-sig', sep=',',index=False)

输出:


section                                 text

AMAT_10Q_Filing Section: Risk Factors_  The COVID-19 pandemic and global measures taken in response 

                                        thereto have adversely impacted, and may continue to adversely 

                                        impact, Applied’s operations and financial results.

AMAT_10Q_Filing Section: Risk Factors_  The COVID-19 pandemic and measures taken in response by 

                                        governments and businesses worldwide to contain its spread, 

                                        

AMAT_10Q_Filing Section: Risk Factors_  The degree to which the pandemic ultimately impacts Applied’s 

                                        financial condition and results of operations and the global 

                                        economy will depend on future developments beyond our control

我真的很想以某种方式拆分“部分”,将其放入基于“_”的新列中我已经尝试了许多不同的正则表达式变体来拆分“部分”,并且所有这些都给了我没有填充的标题或者他们在部分和文本之后添加了列,这是没有用的。我还应该补充一下,大约有 100,000 个观察结果。


期望的结果:


Ticker  Filing type  Section                       Text

AMAT    10Q          Filing Section: Risk Factors  The COVID-19 pandemic and global measures taken in response 

任何指导将不胜感激。


素胚勾勒不出你
浏览 109回答 1
1回答

jeck猫

如果您始终知道分割数,您可以执行以下操作:import pandas as pddf = pd.DataFrame({ "a": [ "test_a_b", "test2_c_d" ] })# Split column by "_"items = df["a"].str.split("_")# Get last item from splitted column and place it on "b"df["b"] = items.apply(list.pop)# Get next last item from splitted column and place it on "c"df["c"] = items.apply(list.pop)# Get final item from splitted column and place it on "d"df["d"] = items.apply(list.pop)这样,数据框将变成           a  b  c      d0   test_a_b  b  a   test1  test2_c_d  d  c  test2由于您希望列按特定顺序排列,因此可以对数据框的列重新排序,如下所示:>>> df = df[[ "d", "c", "b", "a" ]]>>> df       d  c  b          a0   test  a  b   test_a_b1  test2  c  d  test2_c_d
随时随地看视频慕课网APP

相关分类

Python
我要回答