将Spark Dataframe字符串列拆分为多列

我见过很多人建议Dataframe.explode这样做是一种有用的方法，但是它导致的行数比原始数据帧多，这根本不是我想要的。我只想做非常简单的Dataframe等效项：

rdd.map(lambda row: row + [row.my_str_col.split('-')])

它看起来像：

col1 | my_str_col

-----+-----------

18 | 856-yygrm

201 | 777-psgdg

并将其转换为：

col1 | my_str_col | _col3 | _col4

-----+------------+-------+------

18 | 856-yygrm | 856 | yygrm

201 | 777-psgdg | 777 | psgdg

我知道pyspark.sql.functions.split()，但是它导致嵌套的数组列，而不是像我想要的两个顶级列。

理想情况下，我也希望这些新列也被命名。

胡子哥哥

浏览 4298回答 3