猿问

md5 numpy数组的快速方法

我正在用python 2.7中的numpy的一维数组和成千上万的uint64数字。分别计算每个数字的md5最快的方法是什么?

在调用md5函数之前,每个数字都必须转换为字符串。我在很多地方都读到,遍历numpy的数组并用纯python做事情实在太慢了。有什么办法可以避免这种情况?


慕村225694
浏览 380回答 3
3回答

温温酱

您可以为MD5()接受NumPy数组的OpenSSL函数编写包装器。我们的基准将是纯Python实现。使用cffi创建包装器:import cffiffi = cffi.FFI()header = r"""void md5_array(uint64_t* buffer, int len, unsigned char* out);"""source = r"""#include <stdint.h>#include <openssl/md5.h>void md5_array(uint64_t * buffer, int len, unsigned char * out) {&nbsp; &nbsp; int i = 0;&nbsp; &nbsp; for(i=0; i<len; i++) {&nbsp; &nbsp; &nbsp; &nbsp; MD5((const unsigned char *) &buffer[i], 8, out + i*16);&nbsp; &nbsp; }}"""ffi.set_source("_md5", source, libraries=['ssl'])ffi.cdef(header)if __name__ == "__main__":&nbsp; &nbsp; ffi.compile()和import numpy as npimport _md5def md5_array(data):&nbsp; &nbsp; out = np.zeros(data.shape, dtype='|S16')&nbsp; &nbsp; _md5.lib.md5_array(&nbsp; &nbsp; &nbsp; &nbsp; _md5.ffi.from_buffer(data),&nbsp; &nbsp; &nbsp; &nbsp; data.size,&nbsp; &nbsp; &nbsp; &nbsp; _md5.ffi.cast("unsigned char *", _md5.ffi.from_buffer(out))&nbsp; &nbsp; )&nbsp; &nbsp; return out并比较两个:import numpy as npimport hashlibdata = np.arange(16, dtype=np.uint64)out = [hashlib.md5(i).digest() for i in data]print(data)# [ 0&nbsp; 1&nbsp; 2&nbsp; 3&nbsp; 4&nbsp; 5&nbsp; 6&nbsp; 7&nbsp; 8&nbsp; 9 10 11 12 13 14 15]print(out)# [b'}\xea6+?\xac\x8e\x00\x95jIR\xa3\xd4\xf4t', ... , b'w)\r\xf2^\x84\x11w\xbb\xa1\x94\xc1\x8c8XS']out = md5_array(data)print(out)# [b'}\xea6+?\xac\x8e\x00\x95jIR\xa3\xd4\xf4t', ... , b'w)\r\xf2^\x84\x11w\xbb\xa1\x94\xc1\x8c8XS']对于大型阵列,速度要快15倍左右(老实说,我对此感到有些失望...)data = np.arange(100000, dtype=np.uint64)%timeit [hashlib.md5(i).digest() for i in data]169 ms ± 3.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)%timeit md5_array(data)12.1 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

倚天杖

我绝对建议避免转换uint64为字符串。您可以struct用来获取二进制数据,然后可以将其提供给hashlib.md5():>>> import struct, hashlib>>> a = struct.pack( '<Q', 0x423423423423 )>>> a'#4B#4B\x00\x00'>>> hashlib.md5( a ).hexdigest()'de0fc624a1b287881eee581ed83500d1'>>>&nbsp;因为没有转换,只有简单的字节副本,所以这肯定会加快处理速度。另外,hexdigest()可以将gettig替换为digest(),以返回二进制数据,这比将其转换为十六进制字符串的速度更快。根据您以后计划使用该数据的方式,这可能是一个好方法。

慕无忌1623718

>>> import hashlib>>> import numpy as np>>> arr = np.array([1, 2, 3, 4, 5], dtype="uint64")>>> m = hashlib.md5(arr.astype("uint8"))>>> m.hexdigest()'7cfdd07889b3295d6a550914ab35e068'
随时随地看视频慕课网APP

相关分类

Python
我要回答