如何将数据帧的每个第 i 个元素映射到 PySpark 中范围定义的另一个数据帧的键

我想做的事

根据 df1 中的聚类定义将输入文件 df0 转换为所需的输出 df2


我拥有的

df0 = spark.createDataFrame(

[('A',0.05),('B',0.01),('C',0.75),('D',1.05),('E',0.00),('F',0.95),('G',0.34), ('H',0.13)],

("items","quotient")

)


df1 = spark.createDataFrame(

[('C0',0.00,0.00),('C1',0.01,0.05),('C2',0.06,0.10), ('C3',0.11,0.30), ('C4',0.31,0.50), ('C5',0.51,99.99)],

("cluster","from","to")

)

我想要的是

df2 = spark.createDataFrame(

[('A',0.05,'C1'),('B',0.01,'C1'),('C',0.75,'C5'),('D',1.05,'C5'),('E',0.00,'C0'),('F',0.95,'C3'),('G',0.34,'C2'), ('H',0.13,'C4')],

("items","quotient","cluster")

)

笔记

编码环境是 Palantir 中的 PySpark。


为了简化编码,可以调整 DataFrame df1 的结构和内容:df1 告诉 df0 中的项目应该链接到哪个集群。


提前非常感谢您的时间和反馈!


慕姐8265434
浏览 106回答 1
1回答

白猪掌柜的

这是一个简单的左连接问题。df0.join(df1, df0['quotient'].between(df1['from'], df1['to']), "left") \  .select(*df0.columns, df1['cluster']).show()+-----+--------+-------+|items|quotient|cluster|+-----+--------+-------+|    A|    0.05|     C1||    B|    0.01|     C1||    C|    0.75|     C5||    D|    1.05|     C5||    E|     0.0|     C0||    F|    0.95|     C5||    G|    0.34|     C4||    H|    0.13|     C3|+-----+--------+-------+
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python