在PySpark中爆炸

我想从包含单词列表的DataFrame转换为每个单词都在其自己行中的DataFrame。


如何在DataFrame中的列上爆炸?


这是我尝试的一些示例,您可以在其中取消注释每个代码行并获取以下注释中列出的错误。我在带有Spark 1.6.1的Python 2.7中使用PySpark。


from pyspark.sql.functions import split, explode

DF = sqlContext.createDataFrame([('cat \n\n elephant rat \n rat cat', )], ['word'])

print 'Dataset:'

DF.show()

print '\n\n Trying to do explode: \n'

DFsplit_explode = (

 DF

 .select(split(DF['word'], ' '))

#  .select(explode(DF['word']))  # AnalysisException: u"cannot resolve 'explode(word)' due to data type mismatch: input to function explode should be array or map type, not StringType;"

#   .map(explode)  # AttributeError: 'PipelinedRDD' object has no attribute 'show'

#   .explode()  # AttributeError: 'DataFrame' object has no attribute 'explode'

).show()


# Trying without split

print '\n\n Only explode: \n'


DFsplit_explode = (

 DF 

 .select(explode(DF['word']))  # AnalysisException: u"cannot resolve 'explode(word)' due to data type mismatch: input to function explode should be array or map type, not StringType;"

).show()

请指教


牧羊人nacy
浏览 698回答 3
3回答
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python