如何根据某些条件将行组合在一起?(R 或 Python)

目的:


如果“主题”、“Re”和“长度”列具有相同的连续值,并且“文件夹”== “out”,则将数据的各个部分组合在一起|“草稿”,消息 == “”,编辑为 == “T”并获取其持续时间。


Subject Re                    Length         Folder      Message   Date                   Edit     

        a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:01 AM     T                               

        a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:05 AM     T                        

hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:10 AM     T                        

hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:15 AM     T                        

hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:30 AM     T 




hey     a@mail.com,b@mail.com 80            draft                  1/2/2020 1:02:00 AM     T                        

hey     a@mail.com,b@mail.com 80            draft                  1/2/2020 1:02:05 AM     T                        





hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:03:10 AM     T                        

hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:03:20 AM     T                        

所需输出


 Start                  End                        Duration          Group

 1/2/2020 1:00:10 AM    1/2/2020 1:00:30 AM        20                A

 1/2/2020 1:02:00 AM    1/2/2020 1:02:05 AM        5                 A

 1/2/2020 1:03:10 AM    1/2/2020 1:03:20 AM        10                A

我知道我可以这样过滤:


   df1<-df2 %>%

   mutate(Date = lubridate::mdy_hms(Date), 

    cond = Edit == "T" & ItemFolderName == "out" | Folder == "drafts" &     Message == "" & Subject     ==  ?   & Re ==   ?     & Length == ?   

但不确定如何合并“如果有连续值”。我会继续研究,任何帮助或建议都非常感谢。


皈依舞
浏览 107回答 1
1回答

胡子哥哥

您的外观与您发布的数据框略有不同:structure> df&nbsp; &nbsp;Subject&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Recipient Length Folder Message&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Date Edit1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 80&nbsp; &nbsp; out&nbsp; &nbsp; &nbsp; NA 1/2/2020 1:00:01 AM TRUE2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 80&nbsp; &nbsp; out&nbsp; &nbsp; &nbsp; NA 1/2/2020 1:00:05 AM TRUE3&nbsp; &nbsp; &nbsp; hey sarah@mail.com,gee@mail.com&nbsp; &nbsp; &nbsp;80&nbsp; &nbsp; out&nbsp; &nbsp; &nbsp; NA 1/2/2020 1:00:10 AM TRUE4&nbsp; &nbsp; &nbsp; hey sarah@mail.com,gee@mail.com&nbsp; &nbsp; &nbsp;80&nbsp; &nbsp; out&nbsp; &nbsp; &nbsp; NA 1/2/2020 1:00:15 AM TRUE5&nbsp; &nbsp; &nbsp; hey sarah@mail.com,gee@mail.com&nbsp; &nbsp; &nbsp;80&nbsp; &nbsp; out&nbsp; &nbsp; &nbsp; NA 1/2/2020 1:00:30 AM TRUE6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; NA&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NA&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NA7&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; NA&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NA&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NA8&nbsp; &nbsp; &nbsp; hey sarah@mail.com,gee@mail.com&nbsp; &nbsp; &nbsp;80&nbsp; draft&nbsp; &nbsp; &nbsp; NA 1/2/2020 1:02:00 AM TRUE9&nbsp; &nbsp; &nbsp; hey sarah@mail.com,gee@mail.com&nbsp; &nbsp; &nbsp;80&nbsp; draft&nbsp; &nbsp; &nbsp; NA 1/2/2020 1:02:05 AM TRUE10&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NA&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NA&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NA11&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NA&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NA&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NA12&nbsp; &nbsp; &nbsp;hey sarah@mail.com,gee@mail.com&nbsp; &nbsp; 100&nbsp; draft&nbsp; &nbsp; &nbsp; NA 1/2/2020 1:03:00 AM TRUE13&nbsp; &nbsp; &nbsp;hey sarah@mail.com,gee@mail.com&nbsp; &nbsp; 100&nbsp; draft&nbsp; &nbsp; &nbsp; NA 1/2/2020 1:03:20 AM TRUE此外,您所需的输出表明您希望按其他类别拆分组,但这不是您的描述所说的,因此我没有按 分组。不过,如果您愿意,这很容易改变。FolderFolder您可以使用运行长度编码来消除排序数据中相同连续值的组的歧义,但在 R 中,转换为数据框列有点棘手。我用这个答案来实现这一点。rlelibrary(lubridate)library(dplyr)df %>%&nbsp; mutate(Date = mdy_hms(Date),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Key = paste(Subject, Recipient, Length, sep = "_")) %>%&nbsp; arrange(Date) %>%&nbsp; filter(Folder == "out" | Folder == "draft" & Edit == TRUE) %>%&nbsp; mutate(RLE = {RLE = rle(Key) ; rep(seq_along(RLE$lengths), RLE$lengths)}) %>%&nbsp; group_by(RLE) %>%&nbsp; summarize(Start = first(Date),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; End = last(Date),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Duration = as.numeric(End) - as.numeric(Start))这将从第 1:2 行、3:5+8:9 和 12:13 行创建组。这些组给出以下持续时间:# A tibble: 3 x 4&nbsp; &nbsp; RLE Start&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;End&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Duration&nbsp; <int> <dttm>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <dttm>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<dbl>1&nbsp; &nbsp; &nbsp;1 2020-01-02 01:00:01 2020-01-02 01:00:05&nbsp; &nbsp; &nbsp; &nbsp; 42&nbsp; &nbsp; &nbsp;2 2020-01-02 01:00:10 2020-01-02 01:02:05&nbsp; &nbsp; &nbsp; 1153&nbsp; &nbsp; &nbsp;3 2020-01-02 01:03:00 2020-01-02 01:03:20&nbsp; &nbsp; &nbsp; &nbsp;20如果要包含在分组中,请将其添加到创建 中包含的内容中。这使得小组1:2,3:5,8:9和12:13。这样做会得到这样的结果:FolderKey# A tibble: 4 x 4&nbsp; &nbsp; RLE Start&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;End&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Duration&nbsp; <int> <dttm>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <dttm>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<dbl>1&nbsp; &nbsp; &nbsp;1 2020-01-02 01:00:01 2020-01-02 01:00:05&nbsp; &nbsp; &nbsp; &nbsp; 42&nbsp; &nbsp; &nbsp;2 2020-01-02 01:00:10 2020-01-02 01:00:30&nbsp; &nbsp; &nbsp; &nbsp;203&nbsp; &nbsp; &nbsp;3 2020-01-02 01:02:00 2020-01-02 01:02:05&nbsp; &nbsp; &nbsp; &nbsp; 54&nbsp; &nbsp; &nbsp;4 2020-01-02 01:03:00 2020-01-02 01:03:20&nbsp; &nbsp; &nbsp; &nbsp;20
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python