PySpark：连接数据类型为“Struc”的两列 --> 错误：由于数据类型不匹配而无法解析

你的代码就快到了。假设您的架构如下：df.printSchema()#root# |-- word_verb: struct (nullable = true)# |    |-- _1: string (nullable = true)# |    |-- _2: string (nullable = true)# |-- word_noun: struct (nullable = true)# |    |-- _1: string (nullable = true)# |    |-- _2: string (nullable = true)您只需要访问_1每一列的字段值：import pyspark.sql.functions as Fdf.withColumn(    "word_chunk_final",     F.concat_ws(' ', F.col('word_verb')['_1'], F.col('word_noun')['_1'])).show()#+-----------------+------------+----------------+#|        word_verb|   word_noun|word_chunk_final|#+-----------------+------------+----------------+#|        [cook,VB]|[chicken,NN]|    cook chicken|#|       [pack,VBN]|  [lunch,NN]|      pack lunch|#|[reconnected,VBN]|   [wifi,NN]|reconnected wifi|#+-----------------+------------+----------------+此外，您应该使用concat_ws("concatenate with separator") 而不是concat将字符串添加在一起，并在它们之间留一个空格。它类似于str.join在 python 中的工作方式。

PySpark：连接数据类型为“Struc”的两列 --> 错误：由于数据类型不匹配而无法解析

2回答