根据特定条件对数据进行分组,并在 R 或 Python 中查找持续时间

我有一个数据集df,如下所示:


 subject    recipient                  length   folder    message  date                       edit

                                        80      out                1/2/2020 1:00:01 AM        T                                    

                                        80      out                1/2/2020 1:00:05 AM        T                   

hey        sarah@mail.com,g@mail.com    80      out                1/2/2020 1:00:10 AM        T

hey        sarah@mail.com,g@mail.com    80      out                1/2/2020 1:00:15 AM        T

hey        sarah@mail.com,g@mail.com    80      out                1/2/2020 1:00:30 AM        T

some       k                           900      in       jjjjj     1/2/2020 1:00:35 AM        F

some       k                           900      in       jjjjj     1/2/2020 1:00:36 AM        F 

some       k                           900      in       jjjjj     1/2/2020 1:00:37 AM        F

hey        sarah@mail.com,g@mail.com    80    draft                1/2/2020 1:02:00 AM        T

hey        sarah@mail.com,g@mail.com    80    draft                1/2/2020 1:02:05 AM        T    

no         a                          900       in        iii      1/2/2020 1:02:10 AM        F

no         a                          900       in        iii      1/2/2020 1:02:15 AM        F

no         a                          900       in        iii      1/2/2020 1:02:20 AM        F

no         a                          900       in        iii      1/2/2020 1:02:25 AM        F



数据集表示用户何时编辑消息、离开并继续执行该消息。我正在尝试捕获手头消息的总持续时间。我知道我必须首先对消息进行分组。我希望根据以下条件对消息进行分组:


如果“文件夹”列为 == “out” 或 “draft”,如果“消息”列为 == “”,并且 Edit == “T”,则“长度”列也应连续相同。因此,一旦我有了这些组,我希望找到这些组的持续时间(开始和结束)。例如,第一组持续时间为 29 秒,因为它从 1/2/2020 1:00:01 AM 开始,到 1/2/2020 1:00:30 AM 结束。第二组将于1/2/2020 1:02:00开始,并于凌晨1:02:05结束。最后,第三组从1/2/2020 1:03:00 AM开始,到1:03:20 AM结束。此外,由于这些组都属于同一邮件,因此我想使用以下逻辑将这些组完全链接在一起:组最后一行中的“主题”、“收件人”和“长度”内容与下一个组的第一行“主题”、“收件人”和“长度”匹配,则这些都属于同一组。



陪伴而非守候
浏览 147回答 1
1回答

POPMUISE

df %>%&nbsp;&nbsp; # The original data was loaded as factors, which have their uses, but&nbsp; #&nbsp; &nbsp;converting those to characters will be simpler to work with here.&nbsp; mutate_if(is.factor, as.character) %>%&nbsp;&nbsp; # I'm replacing NA in Subj & Recip with an empty string, and trimming&nbsp;&nbsp; #&nbsp; &nbsp; excess spaces from the start and end. One of the recipients is " "&nbsp; #&nbsp; &nbsp; but I assume that's functionally the same as blank.&nbsp; mutate_at(c("Subject", "Recipient"), ~if_else(is.na(.), "", stringr::str_trim(.))) %>%&nbsp; filter(Subject != '') %>%&nbsp; mutate(Date = as.POSIXct(Date, format = '%m/%d/%Y %H:%M:%OS')) %>%&nbsp; mutate(cond = Edit & Folder %in% c('out', 'draft') & Message == '') %>%&nbsp;&nbsp; mutate(segment = cumsum(!cond)) %>%&nbsp; filter(cond) %>%&nbsp; &nbsp;# EDIT: Added to exclude rows matching cond&nbsp; # Get summary stats for each segment&nbsp; group_by(Subject, Recipient, Length, segment) %>%&nbsp; summarize(Start = min(Date),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; End = max(Date),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Duration = End - Start) %>%&nbsp; # This counts the number of times that these columns don't match their&nbsp; #&nbsp; &nbsp;predecessor. TRUE = 1, so we get 1 when anything changes.&nbsp; #&nbsp; &nbsp;Look at ?lag for more on what those parameters mean.&nbsp; mutate(new_group = (Subject&nbsp; &nbsp;!= lag(Subject, 1, "")) *&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;(Recipient != lag(Recipient, 1, "")) *&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;(Length&nbsp; &nbsp; != lag(Length, 1, ""))) %>%&nbsp; ungroup() %>%&nbsp; mutate(group = LETTERS[cumsum(new_group)])# A tibble: 3 x 9&nbsp; Subject Recipient&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Length segment Start&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;End&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Duration new_group group&nbsp; <chr>&nbsp; &nbsp;<chr>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <int>&nbsp; &nbsp;<int> <dttm>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <dttm>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <drtn>&nbsp; &nbsp; &nbsp; &nbsp;<int> <chr>1 hey&nbsp; &nbsp; &nbsp;sarah@mail.com,gee@mail.com&nbsp; &nbsp; &nbsp;80&nbsp; &nbsp; &nbsp; &nbsp;0 2020-01-02 01:00:10 2020-01-02 01:00:30 20 secs&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1 A&nbsp; &nbsp;&nbsp;2 hey&nbsp; &nbsp; &nbsp;sarah@mail.com,gee@mail.com&nbsp; &nbsp; &nbsp;80&nbsp; &nbsp; &nbsp; &nbsp;3 2020-01-02 01:02:00 2020-01-02 01:02:05&nbsp; 5 secs&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0 A&nbsp; &nbsp;&nbsp;3 hey&nbsp; &nbsp; &nbsp;sarah@mail.com,gee@mail.com&nbsp; &nbsp; &nbsp;80&nbsp; &nbsp; &nbsp; &nbsp;7 2020-01-02 01:03:00 2020-01-02 01:03:20 20 secs&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0 A&nbsp; &nbsp;&nbsp;
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python