如何使用 numba 在 GPU 上推广快速矩阵乘法

可以说该发布的代码中至少有两个错误：这不可能是正确的范围检查：if x >= C.shape[0] and y >= C.shape[1]:为了让我们决定网格中的特定线程不执行任何加载活动，我们要求要么超出x范围，要么y超出范围。应该and是一个or.如果块中的所有线程都无法参与该语句，则在条件代码中使用是非法的。cuda.syncthreads()上面第 1 项中的前一个return语句（即使从andto更正or）几乎保证了对于问题大小不能被线程块大小整除的这种非法行为。因此，要解决这些问题，我们不能仅使用简单的return语句来处理越界线程。相反，在加载点，如果计算出的全局加载索引（forA或B）在边界内（根据定义，共享索引在边界内），我们必须只允许线程从全局加载到共享内存。此外，在写入结果时，我们必须只写入在的范围内的计算结果C。以下代码修复了这些项目。对于您给定的测试用例，它似乎可以正常工作：$ cat t49.pyfrom numba import cuda, float32import numpy as npimport math@cuda.jitdef fast_matmul(A, B, C): # Define an array in the shared memory # The size and type of the arrays must be known at compile time sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32) sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32) x, y = cuda.grid(2) tx = cuda.threadIdx.x ty = cuda.threadIdx.y bpg = cuda.gridDim.x # blocks per grid # Each thread computes one element in the result matrix. # The dot product is chunked into dot products of TPB-long vectors. tmp = float32(0.) for i in range(bpg): # Preload data into shared memory sA[tx, ty] = 0 sB[tx, ty] = 0 if x < A.shape[0] and (ty+i*TPB) < A.shape[1]: sA[tx, ty] = A[x, ty + i * TPB] if y < B.shape[1] and (tx+i*TPB) < B.shape[0]: sB[tx, ty] = B[tx + i * TPB, y] # Wait until all threads finish preloading cuda.syncthreads() # Computes partial product on the shared memory for j in range(TPB): tmp += sA[tx, j] * sB[j, ty] # Wait until all threads finish computing cuda.syncthreads() if x < C.shape[0] and y < C.shape[1]: C[x, y] = tmp#%%x_h = np.arange(16).reshape([4,4])y_h = np.ones([4,4])z_h = np.zeros([4,4])x_d = cuda.to_device(x_h)y_d = cuda.to_device(y_h)z_d = cuda.to_device(z_h)TPB = 3threadsperblock = (TPB, TPB)blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])blockspergrid = (blockspergrid_x, blockspergrid_y)fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)z_h = z_d.copy_to_host()print(z_h)print(x_h@y_h)$ cuda-memcheck python t49.py========= CUDA-MEMCHECK[[ 6. 6. 6. 6.] [22. 22. 22. 22.] [38. 38. 38. 38.] [54. 54. 54. 54.]][[ 6. 6. 6. 6.] [22. 22. 22. 22.] [38. 38. 38. 38.] [54. 54. 54. 54.]]========= ERROR SUMMARY: 0 errors$（请注意，在边界测试中使用and此处是正确的。与测试一组索引是否越界相比，测试一组索引是否为入界在布尔意义上是不同的。在入界测试中，我们要求两者都在界内。在出界测试中，任何一个索引越界都是不合格的）。我并不是说上面的代码没有缺陷或适合任何特定目的。它旨在演示我发现的问题的可能修复方法。正如您所发现的，让共享内存平铺矩阵乘法在每个可以想象的配置中工作并非易事，并且除了此处显示的内容之外，我还没有对它进行测试。（例如，如果您决定使 TPB 大于 32，您会遇到其他问题。此外，原始发布的代码仅用于方阵乘法，这在一般非方阵情况下不起作用。）如上所述，发布的代码和上面带有“修复”的代码将无法正确处理一般的非方形情况。我相信一些简单的修改将使我们能够处理非方形的情况。简而言之，我们必须将网格设置得足够大，以处理两个输入矩阵的维度，同时仍然只写入输出矩阵的界内值的结果。这是一个经过简单测试的示例：$ cat t49.pyfrom numba import cuda, float32import numpy as npimport math@cuda.jitdef fast_matmul(A, B, C): # Define an array in the shared memory # The size and type of the arrays must be known at compile time sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32) sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32) x, y = cuda.grid(2) tx = cuda.threadIdx.x ty = cuda.threadIdx.y bpg = cuda.gridDim.x # blocks per grid # Each thread computes one element in the result matrix. # The dot product is chunked into dot products of TPB-long vectors. tmp = float32(0.) for i in range(bpg): # Preload data into shared memory sA[ty, tx] = 0 sB[ty, tx] = 0 if y < A.shape[0] and (tx+i*TPB) < A.shape[1]: sA[ty, tx] = A[y, tx + i * TPB] if x < B.shape[1] and (ty+i*TPB) < B.shape[0]: sB[ty, tx] = B[ty + i * TPB, x] # Wait until all threads finish preloading cuda.syncthreads() # Computes partial product on the shared memory for j in range(TPB): tmp += sA[ty, j] * sB[j, tx] # Wait until all threads finish computing cuda.syncthreads() if y < C.shape[0] and x < C.shape[1]: C[y, x] = tmp#%%x_h = np.arange(115).reshape([5,23])y_h = np.ones([23,7])z_h = np.zeros([5,7])x_d = cuda.to_device(x_h)y_d = cuda.to_device(y_h)z_d = cuda.to_device(z_h)#TPB must be an integer between 1 and 32TPB = 32threadsperblock = (TPB, TPB)grid_y_max = max(x_h.shape[0],y_h.shape[0])grid_x_max = max(x_h.shape[1],y_h.shape[1])blockspergrid_x = math.ceil(grid_x_max / threadsperblock[0])blockspergrid_y = math.ceil(grid_y_max / threadsperblock[1])blockspergrid = (blockspergrid_x, blockspergrid_y)fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)z_h = z_d.copy_to_host()print(z_h)print(x_h@y_h)$ cuda-memcheck python t49.py========= CUDA-MEMCHECK[[ 253. 253. 253. 253. 253. 253. 253.] [ 782. 782. 782. 782. 782. 782. 782.] [1311. 1311. 1311. 1311. 1311. 1311. 1311.] [1840. 1840. 1840. 1840. 1840. 1840. 1840.] [2369. 2369. 2369. 2369. 2369. 2369. 2369.]][[ 253. 253. 253. 253. 253. 253. 253.] [ 782. 782. 782. 782. 782. 782. 782.] [1311. 1311. 1311. 1311. 1311. 1311. 1311.] [1840. 1840. 1840. 1840. 1840. 1840. 1840.] [2369. 2369. 2369. 2369. 2369. 2369. 2369.]]========= ERROR SUMMARY: 0 errors$x我还重新排序了and的含义（以及andy的用法）以修复上述代码中的性能问题。原始发布的文档代码中也存在相同的性能问题。txty再次强调，不声称无缺陷。此外，我确信可以得出“更优化”的代码。然而，优化矩阵乘法是一项应该很快就会导致使用库实现的练习。此处使用cupyGPU 方法应该是一种相当简单的方法，可以在 GPU 上利用高质量的矩阵乘法例程。

如何使用 numba 在 GPU 上推广快速矩阵乘法

1回答