我有一个 pyspark 数据框,它包含 4 列。我想从一列中提取一些字符串,它的类型是Array of strings. 我使用regexp_extract了函数,但它返回了一个错误,因为regexp_extract它只接受一个字符串。
示例数据框:
id | last_name | age | Identificator
------------------------------------------------------------------
12 | AA | 23 | "[""AZE","POI","76759","T86420","ADAPT"]"
------------------------------------------------------------------
24 | BB | 24 | "[""SDN","34","35","AZE","21054","20126"]"
------------------------------------------------------------------
我想提取所有数字:
- contain 4, 5 or 6 digits
- it should not attached to a letters.
- if attached to letter Z ok, I should extract it.
- save it in a new column in my Dataframe.
我开始这样做,但它不起作用,因为标题是一个字符串数组。
expression = r'([0-9]){4,6}'
df = df.withColumn("extract", F.regexp_extract(F.col("Identificator"), expression, 1))
如何使用 regexp_extract 或其他解决方案提取这些数字?谢谢
蛊毒传说
相关分类