有效删除元组列表中的部分重复项

我有一个元组列表,该列表的长度可以在 ~8 - 1000 之间变化,具体取决于元组的长度。列表中的每个元组都是唯一的。元组的长度为 N,其中每个条目都是通用单词。

示例元组的长度可以是 N(Word 1, Word 2, Word 3, ..., Word N)

对于列表中的任何元组,所述元组中的元素 j 将是''Word j

一个非常简单的字母示例是

l = [('A', 'B', '', ''), ('A', 'B', 'C', ''), 
     ('', '', '', 'D'), ('A', '', '', 'D'), 
     ('', 'B', '', '')]

每个元组的每个位置要么具有相同的值,要么为空。''我想删除所有在同一位置的另一个元组中具有所有非值的元组。例如,其中(A,B,'','')包含所有非''(A,B,C,''),因此应将其删除。

filtered_l = [(A,B,C,''),(A,'','',D)]

元组的长度始终相同(不一定是 4)。元组的长度在 2-10 之间。

最快的方法是什么?


慕后森
浏览 124回答 5
5回答

冉冉说

让我们将每个元组概念化为一个二进制数组,其中 1 是“包含某些内容”,2 是“包含一个空字符串”。由于每个位置的项目都是相同的,因此我们不需要关心每个位置有什么,只关心有什么。l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]l_bin = [sum(2**i if k else 0 for i,k in enumerate(tup)) for tup in l]# [3, 7, 8, 9, 2]# [0b0011, 0b0111, 0b1000, 0b1001, 0b0010]# that it's backwards doesn't really matter, since it's consistent现在,我们可以遍历该列表并构建一个没有“重复项”的新数据结构。由于我们将元组编码为二进制,因此我们可以通过执行按位运算来确定重复的、被另一个“包围”的元组 - 给定aand b, if a | b == a,thena必须包含b。codes = {}for tup, b in zip(l, l_bin):    # check if any existing code contains the potential new one    # in this case, skip adding the new one    if any(a | b == a for a in codes):        continue    # check if the new code contains a potential existing one or more    # in which case, replace the existing code(s) with the new code    for a in list(codes):        if b | a == b:            codes.pop(a)    # and finally, add this code to our datastructure    codes[b] = tup现在我们可以撤回我们的“过滤”元组列表:output = list(codes.values())# [('A', 'B', 'C', ''), ('A', '', '', 'D')]请注意,(A, B, C, '')包含(A, B, '', '')和('', B, '', ''),并且(A, '', '', D')包含('', '', '', D),所以这应该是正确的。从 python 3.8 开始,dict保留插入顺序,因此输出应该与元组最初出现在列表中的顺序相同。这个解决方案不会非常有效,因为代码的数量可能会堆积起来,但它应该在 O(n) 和 O(n^2) 之间,具体取决于最后剩下的唯一代码的数量(并且因为每个元组的长度明显小于 的长度l,它应该更接近 O(n) 而不是 O(n^2)。

缥缈止盈

特别是对于该限制,明显的解决方案是将每个元组转换为位掩码,将它们累积在计数器数组中,执行子集和转换,然后过滤数组l。详细代码说明见评论。时间复杂度显然是n + m * 2^m,其中n是元组的数量,m是每个元组的长度。对于n == 1000和m == 10,这显然比 更快n^2。l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]# assumes that l is not empty. (to access l[0])# The case where l is empty is trivial to handle.def tuple_to_mask(tuple_):&nbsp; &nbsp; # convert the information whether each value in (tuple_) is empty to a bit mask&nbsp; &nbsp; # (1 is empty, 0 is not empty)&nbsp; &nbsp; return sum((value == '') << index for index, value in enumerate(tuple_))count = [0] * (1 << len(l[0]))for tuple_ in l:&nbsp; &nbsp; # tuple_ is a tuple.&nbsp; &nbsp; count[tuple_to_mask(tuple_)] += 1# now count[mask] is the number of tuples in l with that mask# transform the count array.for dimension in range(len(l[0])):&nbsp; &nbsp; for mask in range(len(count)):&nbsp; &nbsp; &nbsp; &nbsp; if mask >> dimension & 1:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; count[mask] += count[mask - (1 << dimension)]# now count[mask] is the number of tuples in l with a mask (mask_) such that (mask) contains (mask_)# (i.e. all the bits that are set in mask_ are also set in mask)filtered_l = [tuple_ for tuple_ in l if count[tuple_to_mask(tuple_)] == 1]print(filtered_l)

喵喵时光机

我不确定这是否是最有效的或Python式的方法,但这将是直接的方法(同样,也许其他人会提供更复杂的列表理解方法):看看这个:l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]def item_in_list(item, l):&nbsp; &nbsp; for item2comp in l:&nbsp; &nbsp; &nbsp; &nbsp; if item!=item2comp:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; found = True&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; for part,rhs_part in zip(item, item2comp):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if part!='' and part!=rhs_part:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; found = False&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if found:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return True&nbsp; &nbsp; return False&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;new_arr = []for item in l:&nbsp; &nbsp; if not item_in_list(item, l):&nbsp; &nbsp; &nbsp; &nbsp; new_arr.append(item)print(new_arr)输出:[('A', 'B', 'C', ''), ('A', '', '', 'D')]我认为时间复杂度是 - O((N**2)*M)N - 列表中的元素数量M - 每个元素中的零件数

哈士奇WWW

这些字符串始终位于同一位置,因此我将它们替换为布尔值,以便更轻松地比较它们。首先,我进行排序,然后仅保留与所有其他元素相比,前一个元素在任何地方都为 true 或与后一个元素相同的元素。然后,当比较完成后,我会将其从列表中删除。f = sorted(map(lambda x: list(map(bool, x)), l), key=sum, reverse=True)to_keep = []while len(f) > 1:    if all(map(lambda x, y: True if x == y or x else False, f[0], f[1])):        to_keep.append(len(l) - len(f) + 1)    f = f[1:]print([l[i] for i in to_keep])[('A', 'B', 'C', ''), ('A', '', '', 'D')]速度为 43.7 µs,也是投票最高答案的两倍。

莫回无

L = [('A', 'B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]keys = collections.defaultdict(lambda: collections.defaultdict(set))# maintain a record of tuple-indices that contain each character in each positionfor i,t in enumerate(L):&nbsp; &nbsp; for c,e in enumerate(t):&nbsp; &nbsp; &nbsp; &nbsp; if not e: continue&nbsp; &nbsp; &nbsp; &nbsp; keys[e][c].add(i)delme = set()for i,t in enumerate(L):&nbsp; &nbsp; collocs = set.intersection(*[keys[e][c] for c,e in enumerate(t) if e])&nbsp; &nbsp; if len(collocs)>1:&nbsp; # if all characters appear in this position in >1 index&nbsp; &nbsp; &nbsp; &nbsp; # ignore the collocation with the most non-empty characters&nbsp; &nbsp; &nbsp; &nbsp; # mark the rest for deletion&nbsp; &nbsp; &nbsp; &nbsp; C = max(collocs, key=lambda i: sum(bool(e) for bool in L[i]))&nbsp; &nbsp; &nbsp; &nbsp; for c in collocs:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if c!=C: delme.add(c)filtered = [t for i,t in enumerate(L) if i not in delme]
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python