如何获取 Spark DataFrame 中每行列表中最大值的索引？

您可以创建一个用户定义的函数来获取最大值的索引from pyspark.sql import functions as ffrom pyspark.sql.types import IntegerTypemax_index = f.udf(lambda x: x.index(max(x)), IntegerType())df = df.withColumn("topicID", max_index("topicDistribution"))例子>>> from pyspark.sql import functions as f>>> from pyspark.sql.types import IntegerType >>> df = spark.createDataFrame([{"topicDistribution": [0.2, 0.3, 0.5]}])>>> df.show()+-----------------+|topicDistribution|+-----------------+|  [0.2, 0.3, 0.5]|+-----------------+>>> max_index = f.udf(lambda x: x.index(max(x)), IntegerType())>>> df.withColumn("topicID", max_index("topicDistribution")).show()+-----------------+-------+|topicDistribution|topicID|+-----------------+-------+|  [0.2, 0.3, 0.5]|      2|+-----------------+-------+编辑：由于您提到其中的列表topicDistribution是 numpy 数组，因此您可以更新max_index udf如下：max_index = f.udf(lambda x: x.tolist().index(max(x)), IntegerType())

如何获取 Spark DataFrame 中每行列表中最大值的索引？

1回答