PySpark 2.2 爆炸删除空行（如何实现explode

我正在 PySpark 数据框中处理一些深度嵌套的数据。当我试图将结构展平为行和列时，我注意到当我调用withColumn该行是否包含null在源列中时，该行将从我的结果数据框中删除。相反，我想找到一种方法来保留该行并null在结果列中包含该行。

要使用的示例数据框：

from pyspark.sql.functions import explode, first, col, monotonically_increasing_id

from pyspark.sql import Row

df = spark.createDataFrame([

Row(dataCells=[Row(posx=0, posy=1, posz=.5, value=1.5, shape=[Row(_type='square', _len=1)]),

Row(posx=1, posy=3, posz=.5, value=4.5, shape=[]),

Row(posx=2, posy=5, posz=.5, value=7.5, shape=[Row(_type='circle', _len=.5)])

])

我还有一个用来压平结构的函数：

def flatten_struct_cols(df):

flat_cols = [column[0] for column in df.dtypes if 'struct' not in column[1][:6]]

struct_columns = [column[0] for column in df.dtypes if 'struct' in column[1][:6]]

df = df.select(flat_cols +

[col(sc + '.' + c).alias(sc + '_' + c)

for sc in struct_columns

for c in df.select(sc + '.*').columns])

return df

架构如下所示：

df.printSchema()

root

|-- dataCells: array (nullable = true)

| |-- element: struct (containsNull = true)

| | |-- posx: long (nullable = true)

| | |-- posy: long (nullable = true)

| | |-- posz: double (nullable = true)

| | |-- shape: array (nullable = true)

| | | |-- element: struct (containsNull = true)

| | |-- value: double (nullable = true)

尚方宝剑之说

浏览 345回答 2

随时随地看视频慕课网APP