如何在 Spark 中将两个 DataFrame 与组合列连接起来?

我不明白如何才能将这样的 2 DataFrame 彼此加入。


第一个 DataFrame 存储有关用户向服务中心请求时间的信息。


我们称之为 DataFrame df1:


+-----------+---------------------+

| USER_NAME | REQUEST_DATE        |

+-----------+---------------------+

| Alex      | 2018-03-01 00:00:00 |

| Alex      | 2018-09-01 00:00:00 |

| Bob       | 2018-03-01 00:00:00 |

| Mark      | 2018-02-01 00:00:00 |

| Mark      | 2018-07-01 00:00:00 |

| Kate      | 2018-02-01 00:00:00 |

+-----------+---------------------+

第二个 DataFrame 存储有关用户可以使用服务中心服务的可能期限(许可期限)的信息。


让我们称之为df2。


+-----------+---------------------+---------------------+------------+

| USER_NAME | START_SERVICE       | END_SERVICE         | STATUS     |

+-----------+---------------------+---------------------+------------+

| Alex      | 2018-01-01 00:00:00 | 2018-06-01 00:00:00 | Active     |

| Bob       | 2018-01-01 00:00:00 | 2018-02-01 00:00:00 | Not Active |

| Mark      | 2018-01-01 00:00:00 | 2018-05-01 23:59:59 | Active     |

| Mark      | 2018-05-01 00:00:00 | 2018-08-01 23:59:59 | VIP        |

+-----------+---------------------+---------------------+------------+

如何加入这 2 个 DataFrame 并返回这样的结果?治疗时如何获取用户许可证类型列表?


+-----------+---------------------+----------------+

| USER_NAME | REQUEST_DATE        | STATUS         |

+-----------+---------------------+----------------+

| Alex      | 2018-03-01 00:00:00 | Active         |

| Alex      | 2018-09-01 00:00:00 | No information |

| Bob       | 2018-03-01 00:00:00 | Not Active     |

| Mark      | 2018-02-01 00:00:00 | Active         |

| Mark      | 2018-07-01 00:00:00 | VIP            |

| Kate      | 2018-02-01 00:00:00 | No information |

+-----------+---------------------+----------------+

代码:


import org.apache.spark.sql.DataFrame


val df1: DataFrame  = Seq(

    ("Alex", "2018-03-01 00:00:00"),

    ("Alex", "2018-09-01 00:00:00"),

    ("Bob", "2018-03-01 00:00:00"),

    ("Mark", "2018-02-01 00:00:00"),

    ("Mark", "2018-07-01 00:00:00"),

    ("Kate", "2018-07-01 00:00:00")

).toDF("USER_NAME", "REQUEST_DATE")



饮歌长啸
浏览 281回答 2
2回答

繁花如伊

如何加入这 2 个 DataFrame 并返回这样的结果?df_joined = df1.join(df2, Seq('USER_NAME'), 'left' )如何获取许可证仍然相关的所有用户的列表?df_relevant = df_joined.withColumn('STATUS', when(col('REQUEST_DATE') > col('START_SERVICE') and col('REQUEST_DATE') < col('END_SERVICE'), col('STATUS')).otherwise('No information')&nbsp;.select('USER_NAME', 'REQUEST_DATE', 'STATUS' )

烙印99

您在不正确的字符串值上比较 <= 和 >=。在进行此类比较之前,您应该将它们转换为时间戳。下面的代码对我有用。顺便说一句..您使用的过滤条件没有给出您在问题中发布的结果。请再次检查。scala> val df= Seq(("Alex","2018-03-01 00:00:00"),("Alex","2018-09-01 00:00:00"),("Bob","2018-03-01 00:00:00"),("Mark","2018-02-01 00:00:00"),("Mark","2018-07-01 00:00:00"),("Kate","2018-02-01 00:00:00")).toDF("USER_NAME","REQUEST_DATE").withColumn("REQUEST_DATE",to_timestamp('REQUEST_DATE))df: org.apache.spark.sql.DataFrame = [USER_NAME: string, REQUEST_DATE: timestamp]scala> df.printSchemaroot&nbsp;|-- USER_NAME: string (nullable = true)&nbsp;|-- REQUEST_DATE: timestamp (nullable = true)scala> df.show(false)+---------+-------------------+|USER_NAME|REQUEST_DATE&nbsp; &nbsp; &nbsp; &nbsp;|+---------+-------------------+|Alex&nbsp; &nbsp; &nbsp;|2018-03-01 00:00:00||Alex&nbsp; &nbsp; &nbsp;|2018-09-01 00:00:00||Bob&nbsp; &nbsp; &nbsp; |2018-03-01 00:00:00||Mark&nbsp; &nbsp; &nbsp;|2018-02-01 00:00:00||Mark&nbsp; &nbsp; &nbsp;|2018-07-01 00:00:00||Kate&nbsp; &nbsp; &nbsp;|2018-02-01 00:00:00|+---------+-------------------+scala> val df2 = Seq(( "Alex","2018-01-01 00:00:00","2018-06-01 00:00:00","Active"),("Bob","2018-01-01 00:00:00","2018-02-01 00:00:00","Not Active"),("Mark","2018-01-01 00:00:00","2018-05-01 23:59:59","Active"),("Mark","2018-05-01 00:00:00","2018-08-01 23:59:59","VIP")).toDF("USER_NAME","START_SERVICE","END_SERVICE","STATUS").withColumn("START_SERVICE",to_timestamp('START_SERVICE)).withColumn("END_SERVICE",to_timestamp('END_SERVICE))df2: org.apache.spark.sql.DataFrame = [USER_NAME: string, START_SERVICE: timestamp ... 2 more fields]scala> df2.printSchemaroot&nbsp;|-- USER_NAME: string (nullable = true)&nbsp;|-- START_SERVICE: timestamp (nullable = true)&nbsp;|-- END_SERVICE: timestamp (nullable = true)&nbsp;|-- STATUS: string (nullable = true)scala> df2.show(false)+---------+-------------------+-------------------+----------+|USER_NAME|START_SERVICE&nbsp; &nbsp; &nbsp; |END_SERVICE&nbsp; &nbsp; &nbsp; &nbsp; |STATUS&nbsp; &nbsp; |+---------+-------------------+-------------------+----------+|Alex&nbsp; &nbsp; &nbsp;|2018-01-01 00:00:00|2018-06-01 00:00:00|Active&nbsp; &nbsp; ||Bob&nbsp; &nbsp; &nbsp; |2018-01-01 00:00:00|2018-02-01 00:00:00|Not Active||Mark&nbsp; &nbsp; &nbsp;|2018-01-01 00:00:00|2018-05-01 23:59:59|Active&nbsp; &nbsp; ||Mark&nbsp; &nbsp; &nbsp;|2018-05-01 00:00:00|2018-08-01 23:59:59|VIP&nbsp; &nbsp; &nbsp; &nbsp;|+---------+-------------------+-------------------+----------+scala> df.join(df2,Seq("USER_NAME"),"leftOuter").filter(" REQUEST_DATE >= START_SERVICE and REQUEST_DATE <= END_SERVICE").select(df("*"),df2("status")).show(false)+---------+-------------------+------+|USER_NAME|REQUEST_DATE&nbsp; &nbsp; &nbsp; &nbsp;|status|+---------+-------------------+------+|Alex&nbsp; &nbsp; &nbsp;|2018-03-01 00:00:00|Active||Mark&nbsp; &nbsp; &nbsp;|2018-02-01 00:00:00|Active||Mark&nbsp; &nbsp; &nbsp;|2018-07-01 00:00:00|VIP&nbsp; &nbsp;|+---------+-------------------+------+scala>
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Java