更正 Pandas DataFrame 中的混乱日期

我有一个百万行的时间序列数据框,其中 Date 列中的某些值具有混乱的日/月值。


我如何有效地理清它们而又不破坏那些正确的东西?


# this creates a dataframe with muddled dates


import pandas as pd

import numpy as np

from pandas import Timestamp


start = Timestamp(2013,1,1)

dates = pd.date_range(start, periods=942)[::-1]


muddler = {}

for d in dates:

    if d.day < 13:

        muddler[d] = Timestamp(d.year, d.day, d.month)

    else:

        muddler[d] = Timestamp(d.year, d.month, d.day)


df = pd.DataFrame()

df['Date'] = dates

df['Date'] =  df['Date'].map(muddler)


# now what? (assuming I don't know how the dates are muddled)


潇湘沐
浏览 227回答 2
2回答

小唯快跑啊

一个选项可能是计算时间戳的拟合度,并修改那些偏离拟合度大于特定阈值的时间戳。例子:import pandas as pdimport numpy as npstart = pd.Timestamp(2013,1,1)dates = pd.date_range(start, periods=942)[::-1]muddler = {}for d in dates:&nbsp; &nbsp; if d.day < 13:&nbsp; &nbsp; &nbsp; &nbsp; muddler[d] = pd.Timestamp(d.year, d.day, d.month)&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; muddler[d] = pd.Timestamp(d.year, d.month, d.day)df = pd.DataFrame()df['Date'] = datesdf['Date'] =&nbsp; df['Date'].map(muddler)# convert date col to posix timestampdf['ts'] = df['Date'].values.astype(np.float) / 10**9# calculate a linear fit for ts colx = np.linspace(df['ts'].iloc[0], df['ts'].iloc[-1], df['ts'].size)df['ts_linfit'] = np.polyval(np.polyfit(x, df['ts'], 1), x)# set a thresh and derive a mask that masks differences between&nbsp;# fit and timestamp greater than thresh:thresh = 1.2e6 # you might want to tweak this...m = np.absolute(df['ts']-df['ts_linfit']) > thresh# create new date col as copy of originaldf['Date_filtered'] = df['Date']# modify values that were caught in the maskdf.loc[m, 'Date_filtered'] = df['Date_filtered'][m].apply(lambda x: pd.Timestamp(x.year, x.day, x.month))# also to posix timestampdf['ts_filtered'] = df['Date_filtered'].values.astype(np.float) / 10**9ax = df['ts'].plot(label='original')ax = df['ts_filtered'].plot(label='filtered')ax.legend()

翻翻过去那场雪

在尝试创建一个最小的可重现示例时,我实际上已经解决了我的问题——但我希望有一种更有效的方法来做我想做的事情……# i first define a function to examine the datesdef disordered_muddle(date_series, future_first=True):&nbsp; &nbsp; """Check whether a series of dates is disordered or just muddled"""&nbsp; &nbsp; disordered = []&nbsp; &nbsp; muddle = []&nbsp; &nbsp; dates = date_series&nbsp; &nbsp; different_dates = pd.Series(dates.unique())&nbsp; &nbsp; date = different_dates[0]&nbsp; &nbsp; for i, d in enumerate(different_dates[1:]):&nbsp; &nbsp; &nbsp; &nbsp; # we expect the date's dayofyear to decrease by one&nbsp; &nbsp; &nbsp; &nbsp; if d.dayofyear!=date.dayofyear-1:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # unless the year is changing&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if d.year!=date.year-1:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; try:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # we check if the day and month are muddled&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # if d.day > 12 this will cause an Exception&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; unmuddle = Timestamp(d.year,d.day,d.month)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if unmuddle.dayofyear==date.dayofyear-1:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; muddle.append(d)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; d = unmuddle&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; elif unmuddle.year==date.year-1:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; muddle.append(d)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; d = unmuddle&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; disordered.append(d)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; except:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; disordered.append(d)&nbsp; &nbsp; &nbsp; &nbsp; date=d&nbsp; &nbsp; if len(disordered)==0 and len(muddle)==0:&nbsp; &nbsp; &nbsp; &nbsp; return False&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; return disordered, muddledisorder, muddle = disordered_muddle(df['Date'])# finally unmuddle the datesdate_correction = {}for d in df['Date']:&nbsp; &nbsp; if d in muddle:&nbsp; &nbsp; &nbsp; &nbsp; date_correction[d] = Timestamp(d.year, d.day, d.month)&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; date_correction[d] = Timestamp(d.year, d.month, d.day)df['CorrectedDate'] = df['Date'].map(date_correction)disordered_muddle(df['CorrectedDate'])
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python