在PySpark中爆炸

我想从包含单词列表的DataFrame转换为每个单词都在其自己行中的DataFrame。

如何在DataFrame中的列上爆炸？

这是我尝试的一些示例，您可以在其中取消注释每个代码行并获取以下注释中列出的错误。我在带有Spark 1.6.1的Python 2.7中使用PySpark。

from pyspark.sql.functions import split, explode

DF = sqlContext.createDataFrame([('cat \n\n elephant rat \n rat cat', )], ['word'])

print 'Dataset:'

DF.show()

print '\n\n Trying to do explode: \n'

DFsplit_explode = (

.select(split(DF['word'], ' '))

# .select(explode(DF['word'])) # AnalysisException: u"cannot resolve 'explode(word)' due to data type mismatch: input to function explode should be array or map type, not StringType;"

# .map(explode) # AttributeError: 'PipelinedRDD' object has no attribute 'show'

# .explode() # AttributeError: 'DataFrame' object has no attribute 'explode'

).show()

# Trying without split

print '\n\n Only explode: \n'

DFsplit_explode = (

.select(explode(DF['word'])) # AnalysisException: u"cannot resolve 'explode(word)' due to data type mismatch: input to function explode should be array or map type, not StringType;"

).show()

请指教

牧羊人nacy

浏览 753回答 3