如何在不计算 Python 中的重复值的情况下进行分组和求和

我想将时间格式从 12:45 更改为日期时间格式,同时保持该格式并计算活动的时间差(结果为 activity_duration)。其次,我想总结由 activity_station 分组的 activity_duration


我将时间更改为日期时间格式,但我得到随机的年、月、日等。我知道如何分组,但不知道如何在应用分组时消除重复项。


df = pd.DataFrame({ 

    'Shift_id' :[ 123,123,123,123,123,123,123,123,123,123,123,123,123,123,123,

                345,345,345,345,345,345,345,345,345,345,345,345,345,345,345,345],

    'activity_id' : [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,

                    6,7,8,9,6,7,8,9,6,7,8,9,6,7,8,9],

    'activity_begin_time' : ['09:00','09:05','12:00','12:30','17:25','09:00','09:05','12:00','12:30','17:25','09:00','09:05','12:00','12:30','17:25',

                            '09:00','09:05','12:00','12:30','09:00','09:05','12:00','12:30','09:00','09:05','12:00','12:30','09:00','09:05','12:00','12:30'],

    'activity_end_time' : ['09:05','12:00','12:30', '17:25','17:30','09:05','12:00','12:30', '17:25','17:30','09:05','12:00','12:30', '17:25','17:30',

                          '09:05','12:00','12:30', '17:25','09:05','12:00','12:30', '17:25','09:05','12:00','12:30', '17:25','09:05','12:00','12:30', '17:25'],

    'activity_station' : ['None', 'Za','None','Ba','None','None', 'Za','None','Ba','None','None', 'Za','None','Ba','None',

                         'None','Za','Ba','Ra','None','Za','Ba','Ra','None','Za','Ba','Ra','None','Za','Ba','Ra']

})



df['activity_begin_time'] = pd.to_datetime(df['activity_begin_time'])

df['activity_end_time'] = pd.to_datetime(df['activity_end_time'])

df['activity_duration'] = df['activity_end_time'] - df['activity_begin_time']

df['activity_duration'] = df['activity_duration']/np.timedelta64(1,'h')

我想对由 activity_station 分组的 acitivity_duration 求和,同时消除重复值


慕盖茨4494581
浏览 151回答 1
1回答

慕码人8056858

这是我的解决方案:df = pd.DataFrame({     'Shift_id' :[ 123,123,123,123,123,123,123,123,123,123,123,123,123,123,123,                345,345,345,345,345,345,345,345,345,345,345,345,345,345,345,345],    'activity_id' : [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,                    6,7,8,9,6,7,8,9,6,7,8,9,6,7,8,9],    'activity_begin_time' : ['09:00','09:05','12:00','12:30','17:25','09:00','09:05','12:00','12:30','17:25','09:00','09:05','12:00','12:30','17:25',                            '09:00','09:05','12:00','12:30','09:00','09:05','12:00','12:30','09:00','09:05','12:00','12:30','09:00','09:05','12:00','12:30'],    'activity_end_time' : ['09:05','12:00','12:30', '17:25','17:30','09:05','12:00','12:30', '17:25','17:30','09:05','12:00','12:30', '17:25','17:30',                          '09:05','12:00','12:30', '17:25','09:05','12:00','12:30', '17:25','09:05','12:00','12:30', '17:25','09:05','12:00','12:30', '17:25'],    'activity_station' : ['None', 'Za','None','Ba','None','None', 'Za','None','Ba','None','None', 'Za','None','Ba','None',                         'None','Za','Ba','Ra','None','Za','Ba','Ra','None','Za','Ba','Ra','None','Za','Ba','Ra']})首先删除重复项:df = df.drop_duplicates()使用pandas.to_timedelta:df['activity_begin_time'] = pd.to_timedelta(df['activity_begin_time']+':00')df['activity_end_time'] = pd.to_timedelta(df['activity_end_time']+':00')df['activity_duration'] = df['activity_end_time'] - df['activity_begin_time']然后您可以通过groupby使用每列的特定聚合:df.groupby('activity_station').agg({'activity_duration': np.sum})产生:                   activity_durationactivity_station    Ba                 05:25:00None               00:45:00Ra                 04:55:00Za                 05:50:00
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python