Pyspark-用不同的列值替换列中的空值

如何用category其中的不同值填充这一列空值?


+---++--------+----------+

| id||category|      Date|

+---+---------+----------+

| A1|     Null|2010-01-02|

| A1|     Null|2010-01-03|

| A1|    Nixon|2010-01-04|

| A1|     Null|2010-01-05|

| A9|     Null|2010-05-02|

| A9|  Leonard|2010-05-03|

| A9|     Null|2010-05-04|

| A9|     Null|2010-05-05|

+---+---------+----------+

所需的数据框:


+---++--------+----------+

| id||category|      Date|

+---+---------+----------+

| A1|    Nixon|2010-01-02|

| A1|    Nixon|2010-01-03|

| A1|    Nixon|2010-01-04|

| A1|    Nixon|2010-01-05|

| A9|  Leonard|2010-05-02|

| A9|  Leonard|2010-05-03|

| A9|  Leonard|2010-05-04|

| A9|  Leonard|2010-05-05|

+---+---------+----------+

我试过:


w = Window().partitionBy("ID").orderBy("Date")

df = df.withColumn("category", F.when(col("category").isNull(), col("category")\

.distinct().over(w))\

.otherwise(col("category")))

我也尝试过:


df = df.fillna({'category': col('category').distinct()})

我也尝试过:


df = df.withColumn('category', when(df.category.isNull(), df.category.distinct()).otherwise(df.category))


红糖糍粑
浏览 129回答 1
1回答

斯蒂芬大帝

您可以使用first()参数ignorenullsas True。另外,请rowsBetween(-sys.maxsize, sys.maxsize)在窗户上使用。from pyspark.sql import functions as Ffrom pyspark.sql.functions import *from pyspark.sql.window import Windowimport sysw = Window().partitionBy("id").orderBy("Date")df.withColumn("new", F.first('category', True).over(w.rowsBetween(-sys.maxsize, sys.maxsize)))\        .orderBy("id", "Date").show()+---+--------+----------+| id|category|      Date|+---+--------+----------+| A1|   Nixon|2010-01-02|| A1|   Nixon|2010-01-03|| A1|   Nixon|2010-01-04|| A1|   Nixon|2010-01-05|| A9| Leonard|2010-05-02|| A9| Leonard|2010-05-03|| A9| Leonard|2010-05-04|| A9| Leonard|2010-05-05|+---+--------+----------+
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python