PCollection 与自身的笛卡尔积

我将提出一个使用Python的解决方案。首先，让我们实现算法，然后解决内存限制的问题。import itertools# Let's build a list with your pairscollection_items = [("foo", 0), ("bar", 1), ("baz", 2)]"""A Python generator is a function that produces a sequence of results. It works by maintaining its local state, so that the function can resume again exactly where it left off when called subsequent times. Same generator can't be used twice.I will explain a little later why I use generators"""collection_generator1 = (el for el in collection_items)  # Create the first generator# For example; calling next(collection_generator1) => ("foo", 0); next(collection_generator1) => ("bar", 1),# next(collection_generator1) => ("bar": 2)collection_generator2 = (el for el in collection_items) # Create the second generatorcartesian_product = itertools.product(collection_generator1, collection_generator2) # Create the cartesian productfor pair in cartesian_product:    first_el, second_el = pair    str_pair1, val_pair1 = first_el    str_pair2, val_pair2 = first_el    name = "{str_pair1}+{str_pair2}".format(str_pair1=str_pair1, str_pair2=str_pair2)    item = (name, [first_el, second_el]) # Compose the item    print(item)# OUTPUT('foo+foo', [('foo', 0), ('foo', 0)])('foo+foo', [('foo', 0), ('bar', 1)])('foo+foo', [('foo', 0), ('baz', 2)])('bar+bar', [('bar', 1), ('foo', 0)])('bar+bar', [('bar', 1), ('bar', 1)])('bar+bar', [('bar', 1), ('baz', 2)])('baz+baz', [('baz', 2), ('foo', 0)])('baz+baz', [('baz', 2), ('bar', 1)])('baz+baz', [('baz', 2), ('baz', 2)])现在让我们解决内存问题由于您有很多数据，因此最好将它们存储在文件中，在每行上写入一对（如示例中所示），现在让我们读取文件（“输入.txt”）并创建一个包含其数据的生成器。file_generator_1 = (line.strip() for line in open("input.txt"))file_generator_2 = (line.strip() for line in open("input.txt").readlines())现在，您唯一需要做的修改是替换变量名称collection_generator1，collection_generator2 file_generator_1，file_generator_2

PCollection 与自身的笛卡尔积

2回答