猿问

使用 python 在大 .txt 中进行二进制搜索(按哈希排序)

我想搜索一个非常大的文本文件,其中SHA1哈希使用 Python 按哈希值排序。文本文件有10GB和500 000 000行。每行如下所示:


000009F0DA8DA60DFCC93B39F3DD51A088ED3FD9:27

我由此比较文件中是否出现给定的哈希值。我用BinarySearch试过,但它只适用于一个小测试文件。如果我使用该文件进行10GB搜索需要太长时间,并且该过程有时会因为16GB RAM超出而被终止。


f=open('testfile.txt', 'r')

text=f.readlines()

data=text

#print data

x = '000009F0DA8DA60DFCC93B39F3DD51A088ED3FD9:27'

def binarySearch(data, l, r, x):


  while l <= r:

    mid = l + (r - l)/2;

    # Check if x is present at mid

    if data[mid] == x:

        return mid

    # If x is greater, ignore left half

    elif data[mid] < x:

        l = mid + 1

        #print l

    # If x is smaller, ignore right half

    else:

        r = mid - 1

        #print r

# If we reach here, then the element

# was not present

  return -1



result = binarySearch(data,0, len(data)-1, x)

if result != -1:

   print "Element is present at index % d" % result

else:

   print "Element is not present in array"

有没有办法将10GB文本文件一次加载到 RAM 中并一遍又一遍地访问它?我有16GB RAM可用的。应该够了吧?我还能做些什么来加快搜索速度吗?不幸的是,我不再知道了。



千万里不及你
浏览 103回答 1
1回答

GCT1015

将您的示例输入input.txt如下000000005AD76BD555C1D6D771DE417A4B87E4B4:400000000A8DAE4228F821FB418F59826079BF368:300000000DD7F2A1C68A35673713783CA390C9E93:63000000001E225B908BAC31C56DB04D892E47536E0:500000006BAB7FC3113AA73DE3589630FC08218E7:200000008CD1806EB7B9B46A8F87690B2AC16F617:40000000A0E3B9F25FF41DE4B5AC238C2D545C7A8:150000000A1D4B746FAA3FD526FF6D5BC8052FDB38:160000000CAEF405439D57847A8657218C618160B2:150000000FC1C08E6454BED24F463EA2129E254D43:40并删除计数,使您的文件变为(in.txt如下):000000005AD76BD555C1D6D771DE417A4B87E4B400000000A8DAE4228F821FB418F59826079BF36800000000DD7F2A1C68A35673713783CA390C9E9300000001E225B908BAC31C56DB04D892E47536E000000006BAB7FC3113AA73DE3589630FC08218E700000008CD1806EB7B9B46A8F87690B2AC16F6170000000A0E3B9F25FF41DE4B5AC238C2D545C7A80000000A1D4B746FAA3FD526FF6D5BC8052FDB380000000CAEF405439D57847A8657218C618160B20000000FC1C08E6454BED24F463EA2129E254D43这将确保每个条目都有固定的大小。现在您可以使用基于 mmap 的文件读取方法,如https://docs.python.org/3/library/mmap.htmlimport mmapimport osFIELD_SIZE=40+1&nbsp; # also include newline separatordef binarySearch(mm, l, r, x):&nbsp; &nbsp; while l <= r:&nbsp; &nbsp; &nbsp; &nbsp; mid = int(l + (r - l)/2);&nbsp; &nbsp; &nbsp; &nbsp; # Check if x is present at mid&nbsp; &nbsp; &nbsp; &nbsp; mid_slice = mm[mid*FIELD_SIZE:(mid+1)*FIELD_SIZE]&nbsp; &nbsp; &nbsp; &nbsp; mid_slice = mid_slice.decode('utf-8').strip()&nbsp; &nbsp; &nbsp; &nbsp; # print(mid_slice)&nbsp; &nbsp; &nbsp; &nbsp; if mid_slice == x:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return mid&nbsp; &nbsp; &nbsp; &nbsp; # If x is greater, ignore left half&nbsp; &nbsp; &nbsp; &nbsp; elif mid_slice < x:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; l = mid + 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #print l&nbsp; &nbsp; &nbsp; &nbsp; # If x is smaller, ignore right half&nbsp; &nbsp; &nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; r = mid - 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #print r&nbsp; &nbsp; # If we reach here, then the element&nbsp; &nbsp; # was not present&nbsp; &nbsp; return -1# text=f.readlines()# data=text#print datax = '0000000CAEF405439D57847A8657218C618160B2'with open('in.txt', 'r+b') as f:&nbsp; &nbsp; mm = mmap.mmap(f.fileno(), 0)&nbsp; &nbsp; f.seek(0, os.SEEK_END)&nbsp; &nbsp; size = f.tell()&nbsp; &nbsp; result = binarySearch(mm, 0, size/FIELD_SIZE, x)&nbsp; &nbsp; if result != -1:&nbsp; &nbsp; &nbsp; &nbsp; print("Element is present at index % d" % result)&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; print("Element is not present in array")输出:$ python3 find.py&nbsp;Element is present at index&nbsp; 8由于文件没有完全在内存中读取,因此不会出现内存不足的错误。
随时随地看视频慕课网APP

相关分类

Python
我要回答