Parallel Programming With CUDA Tutorial (Part-2: Basics)

Sequential Version:

void add(int N, float *X, float *Y)
{
for (int i=0; i<N; i++)
{
Y[i] = X[i] + Y[i];
}
}

Parallel Version:

Algorithm:

  • We will take T threads.
  • For thread number t where 0≤t<T it will work on index t+aT where a≥0 and t+aT<N.
  • An example will make it clear. Suppose we take 32 thread. So T=32.
  • Thread number 0 will work on index 0+0*32, 0+1* 32 , 0+2*32 i.e. 0,32,64,96,128… and so on while t+aT<N.
  • Thread number 1 will work on index 1+0*32, 1+1* 32 , 1+2*32 i.e. 1,33,65,97,129… and so on while t+aT<N.
  • This way we will divide the array among threads equally.
ADD(N,X,Y):
t = Current thread number
T = Total number of threads
i = t
while i<N:
Y[i] = X[i]+Y[i]
i=i+T

Organization of Threads:

Grids of Thread Blocks
Size Specification

Implementation:

  1. How to tell the GPU that a function is a Kernel?
  • threadIdx : Id of the current thread.
  • blockIdx : Id of the current block.
  • blockDim : Size of each dimension of the current block.
  • gridDim : Size of each dimension of the current grid.
__global__
void add(int n, float *x, float *y)
{
int t = threadIdx.x;
int T = blockDim.x;
for (int i = t; i < n; i += T)
y[i] = x[i] + y[i];
}
  • cudaMallocManaged(): Allocate Unified Memory accessible from CPU or GPU.
  • cudaDeviceSynchronize(): Wait for GPU to finish executing kernel.
  • add<<<dim3(1,1,1), dim3(T,1,1)>>>(N, X, Y): To call add kernel. The first dim3(x,y,z) specifies the dimensions of the grid and the second one specifies the dimensions of the block. Since we are going to use one block in our grid and one-dimensional block with T threads we used <<<dim3(1,1,1),dim3(T,1,1)>>>.

Performance Analysis:

__global__
void add(int N, float *X, float *Y)
{
int t= blockIdx.x * blockDim.x + threadIdx.x;
int T = blockDim.x * gridDim.x;
for (int i = t; i < n; i += T)
Y[i] = X[i] + Y[i];
}
  • With 1 block each with 1024 threads, it is 46. 54 ms.
  • With 2 blocks each with 1024 threads, it becomes 27.8 ms.
  • With 4 blocks each with 1024 threads, it becomes 19.7 ms.
  • With 8 blocks each with 1024 threads, it becomes 17.7 ms.
  • With 16 blocks each with 1024 threads, it becomes 17.6 ms.
  • With 32 blocks each with 1024 threads, it becomes 17.4 ms.
  • With 256 blocks each with 1024 threads, it becomes 17.5 ms.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store