简单的Python挑战:数据缓冲区上最快的按位异或

挑战:


对两个相等大小的缓冲区执行按位XOR。缓冲区将要求是python str类型,因为传统上这是python中数据缓冲区的类型。将结果值作为返回str。尽快执行此操作。


输入是两个1兆字节(2 ** 20字节)字符串。


目前的挑战是基本使用python的或现有的第三方Python模块打我的低效算法(放宽的规定:或者创建自己的模块)的边际增加是无用的。


from os import urandom

from numpy import frombuffer,bitwise_xor,byte


def slow_xor(aa,bb):

    a=frombuffer(aa,dtype=byte)

    b=frombuffer(bb,dtype=byte)

    c=bitwise_xor(a,b)

    r=c.tostring()

    return r


aa=urandom(2**20)

bb=urandom(2**20)


def test_it():

    for x in xrange(1000):

        slow_xor(aa,bb)


慕莱坞森
浏览 753回答 3
3回答

小怪兽爱吃肉

第一次尝试使用scipy.weave和SSE2内在函数会带来一点改进。第一次调用要慢一些,因为需要从磁盘加载代码并进行缓存,随后的调用会更快:import numpyimport timefrom os import urandomfrom scipy import weaveSIZE = 2**20def faster_slow_xor(aa,bb):&nbsp; &nbsp; b = numpy.fromstring(bb, dtype=numpy.uint64)&nbsp; &nbsp; numpy.bitwise_xor(numpy.frombuffer(aa,dtype=numpy.uint64), b, b)&nbsp; &nbsp; return b.tostring()code = """const __m128i* pa = (__m128i*)a;const __m128i* pend = (__m128i*)(a + arr_size);__m128i* pb = (__m128i*)b;__m128i xmm1, xmm2;while (pa < pend) {&nbsp; xmm1 = _mm_loadu_si128(pa); // must use unaligned access&nbsp;&nbsp; xmm2 = _mm_load_si128(pb); // numpy will align at 16 byte boundaries&nbsp; _mm_store_si128(pb, _mm_xor_si128(xmm1, xmm2));&nbsp; ++pa;&nbsp; ++pb;}"""def inline_xor(aa, bb):&nbsp; &nbsp; a = numpy.frombuffer(aa, dtype=numpy.uint64)&nbsp; &nbsp; b = numpy.fromstring(bb, dtype=numpy.uint64)&nbsp; &nbsp; arr_size = a.shape[0]&nbsp; &nbsp; weave.inline(code, ["a", "b", "arr_size"], headers = ['"emmintrin.h"'])&nbsp; &nbsp; return b.tostring()第二次尝试考虑到注释,我重新检查了代码,以查明是否可以避免复制。原来我读错了字符串对象的文档,所以这是我的第二次尝试:support = """#define ALIGNMENT 16static void memxor(const char* in1, const char* in2, char* out, ssize_t n) {&nbsp; &nbsp; const char* end = in1 + n;&nbsp; &nbsp; while (in1 < end) {&nbsp; &nbsp; &nbsp; &nbsp;*out = *in1 ^ *in2;&nbsp; &nbsp; &nbsp; &nbsp;++in1;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;++in2;&nbsp; &nbsp; &nbsp; &nbsp;++out;&nbsp; &nbsp; }}"""code2 = """PyObject* res = PyString_FromStringAndSize(NULL, real_size);const ssize_t tail = (ssize_t)PyString_AS_STRING(res) % ALIGNMENT;const ssize_t head = (ALIGNMENT - tail) % ALIGNMENT;memxor((const char*)a, (const char*)b, PyString_AS_STRING(res), head);const __m128i* pa = (__m128i*)((char*)a + head);const __m128i* pend = (__m128i*)((char*)a + real_size - tail);const __m128i* pb = (__m128i*)((char*)b + head);__m128i xmm1, xmm2;__m128i* pc = (__m128i*)(PyString_AS_STRING(res) + head);while (pa < pend) {&nbsp; &nbsp; xmm1 = _mm_loadu_si128(pa);&nbsp; &nbsp; xmm2 = _mm_loadu_si128(pb);&nbsp; &nbsp; _mm_stream_si128(pc, _mm_xor_si128(xmm1, xmm2));&nbsp; &nbsp; ++pa;&nbsp; &nbsp; ++pb;&nbsp; &nbsp; ++pc;}memxor((const char*)pa, (const char*)pb, (char*)pc, tail);return_val = res;Py_DECREF(res);"""def inline_xor_nocopy(aa, bb):&nbsp; &nbsp; real_size = len(aa)&nbsp; &nbsp; a = numpy.frombuffer(aa, dtype=numpy.uint64)&nbsp; &nbsp; b = numpy.frombuffer(bb, dtype=numpy.uint64)&nbsp; &nbsp; return weave.inline(code2, ["a", "b", "real_size"],&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; headers = ['"emmintrin.h"'],&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; support_code = support)区别在于字符串是在C代码内部分配的。不可能按照SSE2指令的要求将其对齐在16字节的边界上,因此使用字节访问来复制开头和结尾未对齐的内存区域。无论如何,使用numpy数组传递输入数据,因为weave坚持将Python str对象复制到std::strings。frombuffer不会复制,因此很好,但是内存未对齐16字节,因此我们需要使用_mm_loadu_si128而不是更快的_mm_load_si128。而不是使用_mm_store_si128,而是使用_mm_stream_si128,它可以确保所有写入均尽快流式传输到主存储器中-这样,输出数组不会耗尽宝贵的缓存行。时机至于时间安排,slow_xor第一次编辑中的条目指的是我的改进版本(内联按位xor,uint64),我消除了这种困惑。slow_xor指的是来自原始问题的代码。所有计时都完成了1000次运行。slow_xor:1.85秒(1倍)faster_slow_xor:1.25s(1.48x)inline_xor:0.95秒(1.95倍)inline_xor_nocopy:0.32s(5.78x)该代码是使用gcc 4.4.3编译的,我已经验证了编译器实际上使用了SSE指令。

ibeautiful

这是我的cython结果slow_xor&nbsp; &nbsp;0.456888198853faster_xor 0.400228977203cython_xor 0.232881069183cython_xor_vectorised 0.171468019485在cython中进行矢量化可将计算机上的for循环节省25%左右的时间,但是花了一半以上的时间来构建python字符串(该return语句)-我认为(合法地)避免多余的拷贝是可能的,因为数组可能包含空字节。非法的方法是传入Python字符串并对其进行适当的突变,这会使函数的速度加倍。xor.pyfrom time import timefrom os import urandomfrom numpy import frombuffer,bitwise_xor,byte,uint64import pyximport; pyximport.install()import xor_def slow_xor(aa,bb):&nbsp; &nbsp; a=frombuffer(aa,dtype=byte)&nbsp; &nbsp; b=frombuffer(bb,dtype=byte)&nbsp; &nbsp; c=bitwise_xor(a,b)&nbsp; &nbsp; r=c.tostring()&nbsp; &nbsp; return rdef faster_xor(aa,bb):&nbsp; &nbsp; a=frombuffer(aa,dtype=uint64)&nbsp; &nbsp; b=frombuffer(bb,dtype=uint64)&nbsp; &nbsp; c=bitwise_xor(a,b)&nbsp; &nbsp; r=c.tostring()&nbsp; &nbsp; return raa=urandom(2**20)bb=urandom(2**20)def test_it():&nbsp; &nbsp; t=time()&nbsp; &nbsp; for x in xrange(100):&nbsp; &nbsp; &nbsp; &nbsp; slow_xor(aa,bb)&nbsp; &nbsp; print "slow_xor&nbsp; ",time()-t&nbsp; &nbsp; t=time()&nbsp; &nbsp; for x in xrange(100):&nbsp; &nbsp; &nbsp; &nbsp; faster_xor(aa,bb)&nbsp; &nbsp; print "faster_xor",time()-t&nbsp; &nbsp; t=time()&nbsp; &nbsp; for x in xrange(100):&nbsp; &nbsp; &nbsp; &nbsp; xor_.cython_xor(aa,bb)&nbsp; &nbsp; print "cython_xor",time()-t&nbsp; &nbsp; t=time()&nbsp; &nbsp; for x in xrange(100):&nbsp; &nbsp; &nbsp; &nbsp; xor_.cython_xor_vectorised(aa,bb)&nbsp; &nbsp; print "cython_xor_vectorised",time()-tif __name__=="__main__":&nbsp; &nbsp; test_it()xor_.pyxcdef char c[1048576]def cython_xor(char *a,char *b):&nbsp; &nbsp; cdef int i&nbsp; &nbsp; for i in range(1048576):&nbsp; &nbsp; &nbsp; &nbsp; c[i]=a[i]^b[i]&nbsp; &nbsp; return c[:1048576]def cython_xor_vectorised(char *a,char *b):&nbsp; &nbsp; cdef int i&nbsp; &nbsp; for i in range(131094):&nbsp; &nbsp; &nbsp; &nbsp; (<unsigned long long *>c)[i]=(<unsigned long long *>a)[i]^(<unsigned long long *>b)[i]&nbsp; &nbsp; return c[:1048576]
打开App,查看更多内容
随时随地看视频慕课网APP