使用 re.compile().split() 遍历数据帧行

我有一个由 1 列和几行组成的数据框。这些行中的每一行都以相同的方式构造:-timestamp- value1 value2 value3 -timestamp- value 4 value5 value6 ...


时间戳具有以下格式:YYYY-MM-DD HH:MM:SS值是带 2 位小数的数字。我想制作一个新的数据框,其中一行有单独的时间戳,下一行有相关值。


我设法使用正则表达式按行获得预期结果,但不是针对整个数据框。


到目前为止我的代码:


#input dataframe

data.head()


                  values

0   2020-05-12 10:00:00 12.07 13 11.56 ... 2020-05-12 10:00:01 11.49 17 5.67...

1   2020-05-12 10:01:00 11.49 17 5.67 ... 2020-05-12 10:01:01 12.07 13 11.56...

2   2020-05-12 10:02:00 14.29 18 11.28 ... 2020-05-12 10:02:01 13.77 18 7.43...



test = data['values'].iloc[0] #first row of data

row1 = re.compile("(\d\d\d\d\S\d\d\S\d\d\s\d\d\S\d\d\S\d\d)").split(test)

df_row1 = pd.DataFrame(row1)


df_row1.head()


             values 

0   2020-05-12 10:00:00

1   12.07 13.79 15.45 17.17 18.91 14.91 12.35 14....

2   2020-05-12 10:00:01

3   12.48 13.96 13.88 15.57 18.46 15.0 13.65 14.6...


#trying the same for the entire dataframe 

for row in data:

    df_new = re.compile("(\d\d\d\d\S\d\d\S\d\d\s\d\d\S\d\d\S\d\d)").split(row)


print(df_new)

['values']


我现在的问题是如何循环遍历数据框的行并获得预期的结果?


月关宝盒
浏览 165回答 1
1回答

慕勒3428872

如果您想首先拆分行并将值提取到列中,请注意您可以使用str.extract. 在您的正则表达式中使用命名分组,它将自动为您的数据框分配列split_line = r"\s+(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})"extract_values = r"(?P<date>\d{4}-\d{2}-\d{2})\s(?P<time>\d{2}:\d{2}:\d{2})\s(?P<value_one>.*?)\s(?P<value_two>.*?)\s(?P<value_three>.*?)$"df = pd.DataFrame([{&nbsp; &nbsp; "value": "2020-05-12 10:00:00 12.07 13 11.56 2020-06-12 11:00:00 13.07 16 11.16 2020-05-12 10:00:01 11.49 17 5.67",&nbsp;},{&nbsp; &nbsp; "value": "2020-05-13 10:00:00 14.07 13 15.56 2020-05-16 10:00:02 11.51 18 5.69",&nbsp;}])df = df["value"].str.split(split_line).explode().str.extract(extract_values, expand=True)print(df)#&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; date&nbsp; &nbsp; &nbsp; time value_one value_two value_three# 0&nbsp; 2020-05-12&nbsp; 10:00:00&nbsp; &nbsp; &nbsp;12.07&nbsp; &nbsp; &nbsp; &nbsp; 13&nbsp; &nbsp; &nbsp; &nbsp;11.56# 0&nbsp; 2020-06-12&nbsp; 11:00:00&nbsp; &nbsp; &nbsp;13.07&nbsp; &nbsp; &nbsp; &nbsp; 16&nbsp; &nbsp; &nbsp; &nbsp;11.16# 0&nbsp; 2020-05-12&nbsp; 10:00:01&nbsp; &nbsp; &nbsp;11.49&nbsp; &nbsp; &nbsp; &nbsp; 17&nbsp; &nbsp; &nbsp; &nbsp; 5.67# 1&nbsp; 2020-05-13&nbsp; 10:00:00&nbsp; &nbsp; &nbsp;14.07&nbsp; &nbsp; &nbsp; &nbsp; 13&nbsp; &nbsp; &nbsp; &nbsp;15.56# 1&nbsp; 2020-05-16&nbsp; 10:00:02&nbsp; &nbsp; &nbsp;11.51&nbsp; &nbsp; &nbsp; &nbsp; 18&nbsp; &nbsp; &nbsp; &nbsp; 5.69如果您不知道日期和时间后的组数,请使用split而不是正则表达式。我会建议这样的事情:split_line = r"\s+(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})"df = pd.DataFrame([{&nbsp; &nbsp; "value": "2020-05-12 10:00:00 12.07 13 11.56 2020-06-12 11:00:00 13.07 16 11.16 2020-05-12 10:00:01 11.49 17 5.67",&nbsp;},{&nbsp; &nbsp; "value": "2020-05-13 10:00:00 14.07 13 14 15 15.56 2020-05-16 10:00:02 11.51 18 5.69",&nbsp;}])df = df["value"].str.split(split_line).explode().reset_index()df = df['value'].str.split(" ").apply(pd.Series)df.columns = [f"col_{col}" for col in df.columns]print(df)#&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;col_0&nbsp; &nbsp; &nbsp;col_1&nbsp; col_2 col_3&nbsp; col_4 col_5&nbsp; col_6# 0&nbsp; 2020-05-12&nbsp; 10:00:00&nbsp; 12.07&nbsp; &nbsp; 13&nbsp; 11.56&nbsp; &nbsp;NaN&nbsp; &nbsp; NaN# 1&nbsp; 2020-06-12&nbsp; 11:00:00&nbsp; 13.07&nbsp; &nbsp; 16&nbsp; 11.16&nbsp; &nbsp;NaN&nbsp; &nbsp; NaN# 2&nbsp; 2020-05-12&nbsp; 10:00:01&nbsp; 11.49&nbsp; &nbsp; 17&nbsp; &nbsp;5.67&nbsp; &nbsp;NaN&nbsp; &nbsp; NaN# 3&nbsp; 2020-05-13&nbsp; 10:00:00&nbsp; 14.07&nbsp; &nbsp; 13&nbsp; &nbsp; &nbsp;14&nbsp; &nbsp; 15&nbsp; 15.56# 4&nbsp; 2020-05-16&nbsp; 10:00:02&nbsp; 11.51&nbsp; &nbsp; 18&nbsp; &nbsp;5.69&nbsp; &nbsp;NaN&nbsp; &nbsp; NaN
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python