如何避免在 heapq 中使用 _siftup 或 _siftdown

我不知道如何在不使用_siftupor 的情况下有效地解决以下问题_siftdown

当一个元素无序时,如何恢复堆不变式?

换句话说,更新old_valueheapnew_value,并保持heap工作。您可以假设old_value堆中只有一个。功能定义如下:

def update_value_in_heap(heap, old_value, new_value):

这是我的真实场景,有兴趣的可以看看。

  • 你可以想象它是一个小型的自动完成系统。我需要统计单词出现的频率,并保持前k个max-count单词,随时准备输出。所以我heap在这里使用。当一个字数++时,如果它在堆中,我需要更新它。

  • 所有的词和计数都存储在 trie-tree 的叶子中,heaps
    存储在 trie-tree 的中间节点中。如果你关心
    堆外这个词,别担心,我可以从trie-tree的叶子节点中得到它。

  • 当用户输入一个单词时,它将首先从堆中读取然后更新
    它。为了获得更好的性能,我们可以考虑通过批量更新来降低更新频率。

那么当一个特定的字数增加时,如何更新堆呢?

这是 _siftup 或 _siftdown 版本的简单示例(不是我的场景):

>>> from heapq import _siftup, _siftdown, heapify, heappop


>>> data = [10, 5, 18, 2, 37, 3, 8, 7, 19, 1]

>>> heapify(data)

>>> old, new = 8, 22              # increase the 8 to 22

>>> i = data.index(old)

>>> data[i] = new

>>> _siftup(data, i)

>>> [heappop(data) for i in range(len(data))]

[1, 2, 3, 5, 7, 10, 18, 19, 22, 37]


>>> data = [10, 5, 18, 2, 37, 3, 8, 7, 19, 1]

>>> heapify(data)

>>> old, new = 8, 4              # decrease the 8 to 4

>>> i = data.index(old)

>>> data[i] = new

>>> _siftdown(data, 0, i)

>>> [heappop(data) for i in range(len(data))]

[1, 2, 3, 4, 5, 7, 10, 18, 19, 37]

索引的成本为 O(n),更新的成本为 O(logn)。heapify是另一种解决方案,但效率低于_siftup或_siftdown。


但是_siftupand_siftdown是heapq中的protected成员,所以不建议从外部访问。


那么有没有更好更有效的方法来解决这个问题呢?这种情况的最佳实践?


感谢您的阅读,我真的很感谢它帮助我。:)


慕桂英3389331
浏览 277回答 2
2回答

慕勒3428872

使用heapify。您必须记住的一件重要事情是理论复杂性和性能是两个不同的东西(即使它们是相关的)。换句话说,实施也很重要。渐近复杂性为您提供了一些下限,您可以将其视为保证,例如 O(n) 中的算法确保在最坏的情况下,您将执行许多输入大小线性的指令。这里有两件重要的事情:常量被忽略,但常量在现实生活中很重要;最坏的情况取决于您考虑的算法,而不仅仅是输入。根据您考虑的主题/问题,第一点可能非常重要。在某些域中,隐藏在渐近复杂性中的常量非常大,以至于您甚至无法构建大于常量的输入(或者考虑该输入是不现实的)。情况并非如此,但这是您始终必须牢记的事情。给出这两个观察结果,你真的不能说:实现 B 比 A 快,因为 A 派生自 O(n) 算法而 B 派生自 O(log n) 算法。即使这是一个很好的论据,但它并不总是足够的。当所有输入发生的可能性相同时,理论复杂性特别适合比较算法。换句话说,当你的算法非常通用时。如果您知道您的用例和输入是什么,您可以直接测试性能。使用测试和渐近复杂度将使您对算法的执行方式有一个很好的了解(在极端情况下和任意实际情况下)。话虽如此,让我们对以下将实施三种不同策略的类运行一些性能测试(这里实际上有四种策略,但Invalidate 和 Reinsert在您的情况下似乎不正确,因为您会使每个项目无效的次数与你看到一个给定的词)。我将包括我的大部分代码,以便您可以仔细检查我没有搞砸(您甚至可以检查完整的笔记本):from heapq import _siftup, _siftdown, heapify, heappopclass Heap(list):  def __init__(self, values, sort=False, heap=False):    super().__init__(values)    heapify(self)    self._broken = False    self.sort = sort    self.heap = heap or not sort  # Solution 1) repair using the knowledge we have after every update:          def update(self, key, value):    old, self[key] = self[key], value    if value > old:        _siftup(self, key)    else:        _siftdown(self, 0, key)      # Solution 2 and 3) repair using sort/heapify in a lazzy way:  def __setitem__(self, key, value):    super().__setitem__(key, value)    self._broken = True      def __getitem__(self, key):    if self._broken:        self._repair()        self._broken = False    return super().__getitem__(key)  def _repair(self):      if self.sort:        self.sort()    elif self.heap:        heapify(self)  # … you'll also need to delegate all other heap functions, for example:  def pop(self):    self._repair()    return heappop(self)我们可以首先检查所有三种方法是否有效:data = [10, 5, 18, 2, 37, 3, 8, 7, 19, 1]heap = Heap(data[:])heap.update(8, 22)heap.update(7, 4)print(heap)heap = Heap(data[:], sort_fix=True)heap[8] = 22heap[7] = 4print(heap)heap = Heap(data[:], heap_fix=True)heap[8] = 22heap[7] = 4print(heap)然后我们可以使用以下函数运行一些性能测试:import timeimport randomdef rand_update(heap, lazzy_fix=False, **kwargs):    index = random.randint(0, len(heap)-1)    new_value = random.randint(max_int+1, max_int*2)    if lazzy_fix:        heap[index] = new_value    else:        heap.update(index, new_value)    def rand_updates(n, heap, lazzy_fix=False, **kwargs):    for _ in range(n):        rand_update(heap, lazzy_fix)        def run_perf_test(n, data, **kwargs):    test_heap = Heap(data[:], **kwargs)    t0 = time.time()    rand_updates(n, test_heap, **kwargs)    test_heap[0]    return (time.time() - t0)*1e3results = []max_int = 500nb_updates = 1for i in range(3, 7):    test_size = 10**i    test_data = [random.randint(0, max_int) for _ in range(test_size)]    perf = run_perf_test(nb_updates, test_data)    results.append((test_size, "update", perf))        perf = run_perf_test(nb_updates, test_data, lazzy_fix=True, heap_fix=True)    results.append((test_size, "heapify", perf))    perf = run_perf_test(nb_updates, test_data, lazzy_fix=True, sort_fix=True)    results.append((test_size, "sort", perf))结果如下:import pandas as pdimport seaborn as snsdtf = pd.DataFrame(results, columns=["heap size", "method", "duration (ms)"])print(dtf)sns.lineplot(    data=dtf,     x="heap size",     y="duration (ms)",     hue="method",)从这些测试中我们可以看到,这heapify似乎是最合理的选择,它在最坏的情况下具有相当的复杂性:O(n) 并且在实践中表现更好。另一方面,调查其他选项可能是个好主意(例如拥有专用于该特定问题的数据结构,例如使用垃圾箱将单词放入其中,然后将它们从垃圾箱移动到下一个看起来像可能的轨道调查)。重要说明:这种情况(更新与读取比率为 1:1)对heapify和sort解决方案都不利。所以,如果你管理有AK:1分的比例,这一结论将更加清晰(你可以替换nb_updates = 1与nb_updates = k上面的代码)。数据框详细信息:    heap size   method  duration in ms0        1000   update        0.4351141        1000  heapify        0.0731952        1000     sort        0.1010893       10000   update        1.6689304       10000  heapify        0.4801755       10000     sort        1.1510856      100000   update       13.1940847      100000  heapify        4.8758988      100000     sort       11.9221219     1000000   update      153.58710310    1000000  heapify       51.23710611    1000000     sort      145.306110

RISEBY

他提供的代码片段完全损坏了!它也很难阅读。&nbsp;_siftup()被调用 n//2 次,heapify()所以它不能比_siftup()它自己快。要回答最初的问题,没有更好的方法。如果您担心方法是私有的,请创建自己的方法来做同样的事情。我唯一同意的一点是,如果您长时间不需要从堆中读取数据,那么在需要时将其延迟可能会有所帮助heapify()。问题是您是否应该为此使用堆。让我们来看看他的片段的问题:该heapify()函数被多次调用以进行“更新”运行。导致这种情况的错误链如下:他通过了heap_fix,但期望heap和同样适用于sort如果self.sort总是False,self.heap则总是True他重新定义__getitem__()和__setitem__()被称为每次_siftup()的_siftdown()分配或读的东西(注:这两个是不是在C调用,因此他们使用__getitem__()和__setitem__())如果self.heap是True和__getitem__()与__setitem__()被调用时,_repair()函数被调用每一次_siftup()或siftdown()交换的元素。但是调用heapify()是在 C 中完成的,所以__getitem__()不会被调用,也不会以无限循环结束他重新定义了self.sort这样的称呼,就像他试图做的那样,会失败他读了一次,但更新一个项目的nb_updates次数,而不是他声称的 1:1我修正了这个例子,我尽可能地验证它,但我们都会犯错误。随意检查它自己。代码import timeimport randomfrom heapq import _siftup, _siftdown, heapify, heappopclass UpdateHeap(list):&nbsp; &nbsp; def __init__(self, values):&nbsp; &nbsp; &nbsp; &nbsp; super().__init__(values)&nbsp; &nbsp; &nbsp; &nbsp; heapify(self)&nbsp; &nbsp; def update(self, index, value):&nbsp; &nbsp; &nbsp; &nbsp; old, self[index] = self[index], value&nbsp; &nbsp; &nbsp; &nbsp; if value > old:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; _siftup(self, index)&nbsp; &nbsp; &nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; _siftdown(self, 0, index)&nbsp; &nbsp; def pop(self):&nbsp; &nbsp; &nbsp; &nbsp; return heappop(self)class SlowHeap(list):&nbsp; &nbsp; def __init__(self, values):&nbsp; &nbsp; &nbsp; &nbsp; super().__init__(values)&nbsp; &nbsp; &nbsp; &nbsp; heapify(self)&nbsp; &nbsp; &nbsp; &nbsp; self._broken = False&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp; &nbsp; # Solution 2 and 3) repair using sort/heapify in a lazy way:&nbsp; &nbsp; def update(self, index, value):&nbsp; &nbsp; &nbsp; &nbsp; super().__setitem__(index, value)&nbsp; &nbsp; &nbsp; &nbsp; self._broken = True&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; def __getitem__(self, index):&nbsp; &nbsp; &nbsp; &nbsp; if self._broken:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; self._repair()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; self._broken = False&nbsp; &nbsp; &nbsp; &nbsp; return super().__getitem__(index)&nbsp; &nbsp; def _repair(self):&nbsp; &nbsp; &nbsp; &nbsp; ...&nbsp; &nbsp; def pop(self):&nbsp; &nbsp; &nbsp; &nbsp; if self._broken:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; self._repair()&nbsp; &nbsp; &nbsp; &nbsp; return heappop(self)class HeapifyHeap(SlowHeap):&nbsp; &nbsp; def _repair(self):&nbsp; &nbsp; &nbsp; &nbsp; heapify(self)class SortHeap(SlowHeap):&nbsp; &nbsp; def _repair(self):&nbsp; &nbsp; &nbsp; &nbsp; self.sort()def rand_update(heap):&nbsp; &nbsp; index = random.randint(0, len(heap)-1)&nbsp; &nbsp; new_value = random.randint(max_int+1, max_int*2)&nbsp; &nbsp; heap.update(index, new_value)&nbsp; &nbsp;&nbsp;def rand_updates(update_count, heap):&nbsp; &nbsp; for i in range(update_count):&nbsp; &nbsp; &nbsp; &nbsp; rand_update(heap)&nbsp; &nbsp; &nbsp; &nbsp; heap[0]&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;def verify(heap):&nbsp; &nbsp; last = None&nbsp; &nbsp; while heap:&nbsp; &nbsp; &nbsp; &nbsp; item = heap.pop()&nbsp; &nbsp; &nbsp; &nbsp; if last is not None and item < last:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; raise RuntimeError(f"{item} was smaller than last {last}")&nbsp; &nbsp; &nbsp; &nbsp; last = itemdef run_perf_test(update_count, data, heap_class):&nbsp; &nbsp; test_heap = heap_class(data)&nbsp; &nbsp; t0 = time.time()&nbsp; &nbsp; rand_updates(update_count, test_heap)&nbsp; &nbsp; perf = (time.time() - t0)*1e3&nbsp; &nbsp; verify(test_heap)&nbsp; &nbsp; return perfresults = []max_int = 500update_count = 100for i in range(2, 7):&nbsp; &nbsp; test_size = 10**i&nbsp; &nbsp; test_data = [random.randint(0, max_int) for _ in range(test_size)]&nbsp; &nbsp; perf = run_perf_test(update_count, test_data, UpdateHeap)&nbsp; &nbsp; results.append((test_size, "update", perf))&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; perf = run_perf_test(update_count, test_data, HeapifyHeap)&nbsp; &nbsp; results.append((test_size, "heapify", perf))&nbsp; &nbsp; perf = run_perf_test(update_count, test_data, SortHeap)&nbsp; &nbsp; results.append((test_size, "sort", perf))import pandas as pdimport seaborn as snsdtf = pd.DataFrame(results, columns=["heap size", "method", "duration (ms)"])print(dtf)sns.lineplot(&nbsp; &nbsp; data=dtf,&nbsp;&nbsp; &nbsp; x="heap size",&nbsp;&nbsp; &nbsp; y="duration (ms)",&nbsp;&nbsp; &nbsp; hue="method",)结果如您所见,使用_siftdown()和的“更新”方法_siftup()渐近地更快。你应该知道你的代码做了什么,以及运行需要多长时间。如果有疑问,您应该检查一下。@cglaced 检查了执行需要多长时间,但他没有质疑应该需要多长时间。如果他这样做了,他会发现两者不匹配。而其他人则为之倾倒。&nbsp; &nbsp; heap size&nbsp; &nbsp;method&nbsp; duration (ms)0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;100&nbsp; &nbsp;update&nbsp; &nbsp; &nbsp; &nbsp;0.2191071&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;100&nbsp; heapify&nbsp; &nbsp; &nbsp; &nbsp;0.4127032&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;100&nbsp; &nbsp; &nbsp;sort&nbsp; &nbsp; &nbsp; &nbsp;0.2427103&nbsp; &nbsp; &nbsp; &nbsp; 1000&nbsp; &nbsp;update&nbsp; &nbsp; &nbsp; &nbsp;0.1988414&nbsp; &nbsp; &nbsp; &nbsp; 1000&nbsp; heapify&nbsp; &nbsp; &nbsp; &nbsp;2.9473305&nbsp; &nbsp; &nbsp; &nbsp; 1000&nbsp; &nbsp; &nbsp;sort&nbsp; &nbsp; &nbsp; &nbsp;0.6053456&nbsp; &nbsp; &nbsp; &nbsp;10000&nbsp; &nbsp;update&nbsp; &nbsp; &nbsp; &nbsp;0.2038487&nbsp; &nbsp; &nbsp; &nbsp;10000&nbsp; heapify&nbsp; &nbsp; &nbsp; 32.7591908&nbsp; &nbsp; &nbsp; &nbsp;10000&nbsp; &nbsp; &nbsp;sort&nbsp; &nbsp; &nbsp; &nbsp;4.6215069&nbsp; &nbsp; &nbsp; 100000&nbsp; &nbsp;update&nbsp; &nbsp; &nbsp; &nbsp;0.34856810&nbsp; &nbsp; &nbsp;100000&nbsp; heapify&nbsp; &nbsp; &nbsp;327.64697111&nbsp; &nbsp; &nbsp;100000&nbsp; &nbsp; &nbsp;sort&nbsp; &nbsp; &nbsp; 49.48115312&nbsp; &nbsp; 1000000&nbsp; &nbsp;update&nbsp; &nbsp; &nbsp; &nbsp;0.25606213&nbsp; &nbsp; 1000000&nbsp; heapify&nbsp; &nbsp; 3475.24476114&nbsp; &nbsp; 1000000&nbsp; &nbsp; &nbsp;sort&nbsp; &nbsp; 1106.570005
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python