使用 PySpark 数据框解析 json 字符串列表

使用from_json函数通过defining schema.Example:from pyspark.sql.functions import *from pyspark.sql.types import *sampleJson = [ ('{"user":100, "ips" : ["191.168.192.101", "191.168.192.103", "191.168.192.96", "191.168.192.99"]}',),  ('{"user":101, "ips" : ["191.168.192.102", "191.168.192.105", "191.168.192.103", "191.168.192.107"]}',),  ('{"user":102, "ips" : ["191.168.192.105", "191.168.192.101", "191.168.192.105", "191.168.192.107"]}',),  ('{"user":103, "ips" : ["191.168.192.96", "191.168.192.100", "191.168.192.107", "191.168.192.101"]}',),  ('{"user":104, "ips" : ["191.168.192.99", "191.168.192.99", "191.168.192.102", "191.168.192.99"]}',),  ('{"user":105, "ips" : ["191.168.192.99", "191.168.192.99", "191.168.192.100", "191.168.192.96"]}',),  ]df1=spark.createDataFrame(sampleJson)sch=StructType([StructField('user', StringType(), False),StructField('ips',ArrayType(StringType()))])df1.withColumn("n",from_json(col("_1"),sch)).select("n.*").show(10,False)#+----+--------------------------------------------------------------------+#|user|ips                                                                 |#+----+--------------------------------------------------------------------+#|100 |[191.168.192.101, 191.168.192.103, 191.168.192.96, 191.168.192.99]  |#|101 |[191.168.192.102, 191.168.192.105, 191.168.192.103, 191.168.192.107]|#|102 |[191.168.192.105, 191.168.192.101, 191.168.192.105, 191.168.192.107]|#|103 |[191.168.192.96, 191.168.192.100, 191.168.192.107, 191.168.192.101] |#|104 |[191.168.192.99, 191.168.192.99, 191.168.192.102, 191.168.192.99]   |#|105 |[191.168.192.99, 191.168.192.99, 191.168.192.100, 191.168.192.96]   |#+----+--------------------------------------------------------------------+#schemadf1.withColumn("n",from_json(col("_1"),sch)).select("n.*").printSchema()#root# |-- user: string (nullable = true)# |-- ips: array (nullable = true)# |    |-- element: string (containsNull = true)

使用 PySpark 数据框解析 json 字符串列表

1回答