Parallel Programming With CUDA Tutorial (Part-3: Matrix Multiplication)

Saaduddin Mahmud
6 min readJan 31, 2019

In this tutorial, we will tackle a well-suited problem for Parallel Programming and quite a useful one, unlike the previous one :P. We will do Matrix Multiplication. Hurray !!! Solution to many problems in CS is formulated with Matrices. So the ability to perform fast matrix multiplication is really important. Let’s define the problem statement concretely:

Problem: Given two N*N matrix X and Y of floating point numbers, perform matrix multiplication and store the result on a N*N matrix ans.

First, I will explain why this is a well-suited problem for Parallel Programming. After that, I will show you a sequential version of matrix multiplication and will evaluate performance for the sequential version. After that, I will move on to the parallel version and its performance evaluation. Finally, we will compare the results.

Why This Problem?

Shared Resource:

In our current context, shared resources are any entities that are accessed and used by multiple threads. So array indices that are modified or read by several threads are shared resources.

Problem With Shared Resource:

It is not a problem when a memory location is read by several threads. Problems arise when a memory location is updated by several threads. For example, suppose you have a variable X=0. You want to increment X with 5 different threads simultaneously. And each thread will increment X by 50. What will be the value of X after all the threads are done? Logically it should be 5*50 = 250. But if you run this program several times you will get several different answers. Why? Because when two different thread tries to update the same memory location at the same time one of the updates might get overwritten. And many similar problems might arise from shared resources.

How To Solve This Problem?

In our code, parts, where we access shared resources to read or modify, is called critical sections. There are two ways to solve this problem:

  1. Avoid critical section altogether. This is the ideal solution.
  2. But sometimes it is not possible to avoid critical sections. In that case, we would require thread synchronization.

Finally, why this problem…?

Well, for matrix multiplication it is possible to avoid critical sections. That is why I have chosen this problem. For our next tutorial, I will show how to synchronize threads with CUDA.

Sequential Matrix Multiplication

Below is a code for matrix multiplication using C++. It is the standard O(N³) procedure. Here x, y, and ans are three N² size matrices(N² sized 1D array. We are using 1D arrays as of it were 2D).

void cpuMatMul(int N,double *x, double *y, double *ans)
for(int i=0;i<N;i++)
for(int j=0;j<N;j++)
for(int k=0;k<N;k++)

The graph below shows how the runtime scales up as N gets larger using a single core of Core i5–7500 CPU @ 3.40GHz; 16 GB DDR4 2400mhz.

For a matrix of size 1024 X 1024, it takes about 10 seconds to complete the task. For a matrix of size 4096 X 4096, it takes about 2500 seconds(larger matrices add extra overhead. So it does not scale up exactly in O(N³) after 1024) to complete the task. Above (4096 X 4096), it just gets infeasible to run the experiment.

Parallel Matrix Multiplication

Following is the parallel version of the same code we have seen above.

void GPUmatmul(int N, double *x, double *y, double *ans)
//calculates unique thread ID in the block
int t= (blockDim.x*blockDim.y)*threadIdx.z
//calculates unique block ID in the grid
int b= (gridDim.x*gridDim.y)*blockIdx.z
//block size (this is redundant though)
int T= blockDim.x*blockDim.y*blockDim.z;
//grid size (this is redundant though)
int B= gridDim.x*gridDim.y*gridDim.z;
//Each cell in the matrix is assigned to a different thread.
//Each thread do O(N*number of asssigned cell) computation.
//Assigned cells of different threads does not overlape with
//each other. And so no need for synchronization.
for (int i=b;i<N;i+=B)
for(int j=t;j<N;j+=T)
for(int k=0;k<N;k++)


t is the thread number of a thread inside a particular block. 
b is the block number of a block inside the grid.
B is total number of block.
T is total number of threads per block.

I have explained how thread id, block id and 3D structure of threads and blocks work in part-2.

We had to calculate in this way since the thread IDs are given in CUDA as 3D tuple but we needed a single integer. With t and b we can identify each thread uniquely. After that each thread iterates over only the cells they were assigned and do O(N) computation for that cell. Let’s define the number of the assigned cell to a thread as AC. So in total, a thread does O(N*AC) computation. We can calculate:

AC = (N*N)/(T*B) = (N/T)*(N/B)

We assign a cell to a thread just by writing the nested for loops following way:

for (int i=b;i<N;i+=B)
for(int j=t;j<N;j+=T)
Example: Let B=2 and T=2. Size of ans is 3*3.
(Thread 0 of block 0): will do computation on (0,0)(0,2)(2,0)(2,2).
(Thread 1 of block 0): will do computation on (0,1)(2,1).
(Thread 0 of block 1): will do computation on (1,0)(1,2).
(Thread 1 of block 1): will do computation on (1,1).
As we can see if N is perfectly devisible by both T and B each threads will be assigned equal number of cells.

This makes sure no two thread does computation on the same cell. It is really important. Because if two or more threads did computation on the same cell the results might become inconsistent.

Finally, we call this function using the following commands. Here, dim3(16,16,16) says how the blocks are structured in the grid. And here dim3(16,8,8) says how threads are structured in each block. We are using in total 16*16*16 blocks and 16*8*8 thread per block.

GPUmatmul<<<dim3(16,16,16), dim3(16,8,8)>>>(N, x, y,ans);

Here is the performance of the above code. Even though it is a naive code it does the wonders. Using GPU we were able to multiply two (8192*8192) matrix with the O(N³) algorithm(There are better algorithms like Karatsuba’s algorithm but too complicated to explain or implement here.) under a minute (47 sec). Which is in order of (5.5 * 10¹¹) computation. And we definitely did more computation since we had to calculate t, b, T, B. There is definitely room for improvement. Do it yourself 👊 .


As we discussed previously, Core i5 and GTX1050ti are comparable as they are similarly priced. With a single core of core i5, it took us 2500 sec (42 min) to compute matrix multiplication for (4096 X 4096 ) sized matrices. Even if we were optimistic; it seems like it would take at least 10 (min) if we would have used all four cores. But as we can see from the graph for (4096 X4096 ) the GPU took only 5.5 sec. So the GPU was 113 times faster even with this naive approach. We also were able to (8192 X 8192 ) in 47 sec but it was infeasible for the CPU.


The full code can be found in the following link.

Here is the gist for this part:

Update: Here is the link to part 4