Python（Pyspark）嵌套列表reduceByKey，Python列表追加创建嵌套列表

首页课程实战体系课手记专栏慕课教程

Python（Pyspark）嵌套列表reduceByKey，Python列表追加创建嵌套列表

我有一个格式如下的 RDD 输入：

[('2002', ['cougar', 1]),

('2002', ['the', 10]),

('2002', ['network', 4]),

('2002', ['is', 1]),

('2002', ['database', 13])]

“2002”是关键。所以，我的键值对如下：

('year', ['word', count])

Count 是整数，我想用 reduceByKey 得到以下结果：

[('2002, [['cougar', 1], ['the', 10], ['network', 4], ['is', 1], ['database', 13]]')]

我很难得到上面的巢列表。主要问题是获取嵌套列表。例如，我有三个列表 a、b 和 c

a = ['cougar', 1]

b = ['the', 10]

c = ['network', 4]

a.append(b)

将返回一个

['cougar', 1, ['the', 10]]

和

x = []

x.append(a)

x.append(b)

将返回 x 作为

[['cougar', 1], ['the', 10]]

然而，如果那时

c.append(x)

将返回 c 作为

['network', 4, [['cougar', 1], ['the', 10]]]

以上所有操作都没有得到我想要的结果。

我想得到

[('2002', [[word1, c1],[word2, c2], [word3, c3], ...]),

('2003'[[w1, count1],[w2, count2], [w3, count3], ...])]

即嵌套列表应该是：

[a, b, c]

其中 a, b, c 本身是包含两个元素的列表。

我希望问题很清楚，有什么建议吗？

慕哥9229398

浏览 275回答 2

2回答

牛魔王的故事

对于这个问题，不需要使用 ReduceByKey。定义 RDDrdd = sc.parallelize([('2002', ['cougar', 1]),('2002', ['the', 10]),('2002', ['network', 4]),('2002', ['is', 1]),('2002', ['database', 13])])查看 RDD 值 rdd.collect()：[('2002', ['cougar', 1]),  ('2002', ['the', 10]),  ('2002', ['network', 4]),  ('2002', ['is', 1]),  ('2002', ['database', 13])]应用 groupByKey 函数并将值映射为列表，如您在Apache Spark 文档中所见。rdd_nested = rdd.groupByKey().mapValues(list)请参阅 RDD 分组值 rdd_nested.collect()：[('2002', [['cougar', 1], ['the', 10], ['network', 4], ['is', 1], ['database', 13]])]

0 0

POPMUISE

我提出了一种解决方案：def wagg(a,b):      if type(a[0]) == list:         if type(b[0]) == list:            a.extend(b)        else:             a.append(b)        w = a    elif type(b[0]) == list:         if type(a[0]) == list:            b.extend(a)        else:                b.append(a)        w = b    else:         w = []        w.append(a)        w.append(b)    return w  rdd2 = rdd1.reduceByKey(lambda a,b: wagg(a,b)) 有没有人有更好的解决方案？

0 0

随时随地看视频慕课网APP