无法使用pyspark从xml加载数据

在 jupyter 中使用以下命令下载数据。


 !7z x stackoverflow.com-Posts.7z -oposts

# load xml file into spark data frame.

posts = spark.read.format("xml").option("rowTag", "row").load("./posts/Posts.xml")

出现以下错误:


Py4JJavaError: An error occurred while calling o532.load.

: java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.html

    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)

    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)

    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)

    at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)


慕村9548890
浏览 99回答 1
1回答

绝地无双

您需要将 jar 传递给 sparkContextpyspark --jars /home/Downloads/spark_jars/spark-xml_2.11-0.9.0.jar df = spark.read.format("com.databricks.spark.xml").option("rowTag", "row").load("./posts/Posts.xml")
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python