我在用 _ 分割数据框并从中创建新列时遇到问题。
原来的股
AMAT_0000006951_10Q_20200726_Item1A_excerpt.txt as section
我当前的代码
df = pd.DataFrame(myList,columns=['section','text'])
#df['text'] = df['text'].str.replace('•','')
df['section'] = df['section'].str.replace('Item1A', 'Filing Section: Risk Factors')
df['section'] = df['section'].str.replace('Item2_', 'Filing Section: Management Discussion and Analysis')
df['section'] = df['section'].str.replace('excerpt.txt', '').str.replace(r'\d{10}_|\d{8}_', '')
df.to_csv("./SECParse.csv", encoding='utf-8-sig', sep=',',index=False)
输出:
section text
AMAT_10Q_Filing Section: Risk Factors_ The COVID-19 pandemic and global measures taken in response
thereto have adversely impacted, and may continue to adversely
impact, Applied’s operations and financial results.
AMAT_10Q_Filing Section: Risk Factors_ The COVID-19 pandemic and measures taken in response by
governments and businesses worldwide to contain its spread,
AMAT_10Q_Filing Section: Risk Factors_ The degree to which the pandemic ultimately impacts Applied’s
financial condition and results of operations and the global
economy will depend on future developments beyond our control
我真的很想以某种方式拆分“部分”,将其放入基于“_”的新列中我已经尝试了许多不同的正则表达式变体来拆分“部分”,并且所有这些都给了我没有填充的标题或者他们在部分和文本之后添加了列,这是没有用的。我还应该补充一下,大约有 100,000 个观察结果。
期望的结果:
Ticker Filing type Section Text
AMAT 10Q Filing Section: Risk Factors The COVID-19 pandemic and global measures taken in response
任何指导将不胜感激。
jeck猫
相关分类