Pyspark - 根据来自不同数据帧的值向数据帧添加列

Pyspark - 根据来自不同数据帧的值向数据帧添加列

我有两个数据框。

AA =

+---+----+---+-----+-----+

| id1|id2| nr|cell1|cell2|

+---+----+---+-----+-----+

| 1| 1| 0| ab2 | ac3 |

| 1| 1| 1| dg6 | jf2 |

| 2| 1| 1| 84d | kf6 |

| 2| 2| 1| 89m | k34 |

| 3| 1| 0| 5bd | nc4 |

+---+----+---+-----+-----+

和第二个 dataframe BB，它看起来像：

BB =

+---+----+---+-----+

| a | b|use|cell |

+---+----+---+-----+

| 1| 1| x| ab2 |

| 1| 1| a| dg6 |

| 2| 1| b| 84d |

| 2| 2| t| 89m |

| 3| 1| d| 5bd |

+---+----+---+-----+

其中，在BB单元格部分中，我拥有所有可能出现在AA cell1和cell2部分中的单元格（cell1 - cell2是一个间隔）。

我想将两列添加到BB,val1和val2。条件如下。

val1 has 1 values when:

id1 == id2 (in AA) ,

and cell (in B) == cell1 or cell2 (in AA)

and nr = 1 in AA.

and 0 otherwise.

另一列是根据以下内容构建的：

val 2 has 1 values when:

id1 != id2 in (AA)

and cell (in B) == cell1 or cell 2 in (AA)

and nr = 1 in AA.

it also has 0 values otherwise.

我的尝试：我尝试与：

from pyspark.sql.functions import when, col

condition = col("id1") == col("id2")

result = df.withColumn("val1", when(condition, 1)

result.show()

但很快就发现这项任务远远超过了我的 pyspark 技能水平。

心有法竹

浏览 188回答 1

1回答

随时随地看视频慕课网APP

相关分类

Python