pyspark 使用正则表达式搜索关键字,然后加入其他数据框

我有两个数据框


数据帧A


name       groceries 

Mike       apple, orange, banana, noodle, red wine

Kate       white wine, green beans, extra pineapple hawaiian pizza

Leah       red wine, juice, rice, grapes, green beans

Ben        water, spaghetti

数据帧B


id       item

0001     red wine

0002     green beans

我逐行浏览 B,并使用正则表达式搜索数据框 A 的杂货店中是否存在项目


df = None

for keyword in B.select('item').rdd.flatMap(lambda x : x).collect():

    if keyword == None:

        continue

    pattern = '(?i)^'

    start = '(?=.*\\b'

    end = '\\b)'

    for word in re.split('\\s+', keyword):

        pattern = pattern + start + word + end

    pattern = pattern + '.*$'

    

    if df == None:

        df = A.filter(A['groceries'].rlike(pattern)).withColumn('item', F.lit(keyword))

    else:

        df = df.unionAll(A.filter(A['groceries'].rlike(pattern)).withColumn('item', F.lit(keyword)))

我想要的输出是 A 中的行,其中包含 B 中的项目,但也将 item 关键字作为新列插入


name       groceries                                                     item

Mike       apple, orange, banana, noodle, red wine                       red wine

Leah       red wine, juice, rice, grapes, green beans                    red wine

Kate       white wine, green beans, extra pineapple hawaiian pizza       green beans

Leah       red wine, juice, rice, grapes, green beans                    green beans

实际输出不是我想要的,我不明白这种方法有什么不对。


我还想知道是否有一种方法可以使用 rlike 直接连接 A 和 B,这样只有当 A 中的项目存在于 B 的杂货店中时,行才会连接。谢谢!


一只名叫tom的猫
浏览 86回答 1
1回答

慕尼黑的夜晚无繁华

使用 F.expr() 可以进行类连接。在您的情况下,您需要将它与内部联接一起使用。尝试这个,    #%%import pyspark.sql.functions as Ftest1 =sqlContext.createDataFrame([("Mike","apple,greenbeans,redwine,the little prince 70th anniversary gift set (book/cd/downloadable audio)" ),("kate","Whitewine,greenbeans,pineapple"),("Ben","Water,Spaghetti")],schema=["name","groceries"])test2 = sqlContext.createDataFrame([("001","redwine"),("002","greenbeans"),("003","cd")],schema=["id","item"])#%%test_join =test1.join(test2,F.expr("""groceries rlike item"""),how='inner')结果: test_join.show(truncate=False)   +----+-------------------------------------------------------------------------------------------------+---+----------+|name|groceries                                                                                        |id |item      |+----+-------------------------------------------------------------------------------------------------+---+----------+|Mike|apple,greenbeans,redwine,the little prince 70th anniversary gift set (book/cd/downloadable audio)|001|redwine   ||Mike|apple,greenbeans,redwine,the little prince 70th anniversary gift set (book/cd/downloadable audio)|002|greenbeans||Mike|apple,greenbeans,redwine,the little prince 70th anniversary gift set (book/cd/downloadable audio)|003|cd        ||kate|Whitewine,greenbeans,pineapple                                                                   |002|greenbeans|+----+-------------------------------------------------------------------------------------------------+---+----------+对于您的复杂数据集,contains() 函数必须有效import pyspark.sql.functions as Ftest1 = spark.createDataFrame([("Mike","apple, oranges, red wine,green beans"),("Kate","Whitewine, green beans waterrr, pineapple, red wine"), ("Leah", "red wine, juice, rice, grapes, green beans"),("Ben","Water,Spaghetti, the little prince 70th anniversary gift set (book/cd/downloadable audio)")],schema=["name","groceries"])test2 = spark.createDataFrame([("001","red wine"),("002","green beans waterrr"), ("003", "the little prince 70th anniversary gift set (book/cd/downloadable audio)")],schema=["id","item"])#%%test_join =test1.join(test2,F.col('groceries').contains(F.col('item')),how='inner')结果:+----+-----------------------------------------------------------------------------------------+---+------------------------------------------------------------------------+|name|groceries                                                                                |id |item                                                                    |+----+-----------------------------------------------------------------------------------------+---+------------------------------------------------------------------------+|Mike|apple, oranges, red wine,green beans                                                     |001|red wine                                                                ||Kate|Whitewine, green beans waterrr, pineapple, red wine                                      |001|red wine                                                                ||Kate|Whitewine, green beans waterrr, pineapple, red wine                                      |002|green beans waterrr                                                     ||Leah|red wine, juice, rice, grapes, green beans                                               |001|red wine                                                                ||Ben |Water,Spaghetti, the little prince 70th anniversary gift set (book/cd/downloadable audio)|003|the little prince 70th anniversary gift set (book/cd/downloadable audio)|+----+-----------------------------------------------------------------------------------------+---+------------------------------------------------------------------------+
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python