Parallel Programming With CUDA Tutorial (Part-4: The Floyd-Warshall Algorithm)

Saaduddin Mahmud
5 min readJul 29, 2020

--

In this tutorial, we will tackle the famous Floyd–Warshall algorithm for finding shortest paths in a weighted graph with positive edge weights (or negative weights but with no negative cycles; we will not consider them here anyway). Why study this? Well, first of all, finding the shortest path between two nodes is a fundamental problem in graph theory. And second, it is not much big of a jump from my previous tutorial (Matrix Multiplication). So it will be really easy :D! In this tutorial I will assume you are already familiar with the Floyd–Warshall (FW) algorithm and mainly focus on how to implement it in CUDA. This is because it is a really popular algorithm and there are plenty of resources online that talks about it ( for example https://cp-algorithms.com/graph/all-pair-shortest-path-floyd-warshall.html). As usual, lets first define the problem:

Problem: Given a graph G(V, E) where V is a set of vertices and E is a set of edges we have to find the length of the shortest path between each pair of vertices.

In this tutorial, we will first see the standard single thread implementation of the FW. Then we will identify parts that can be parallelized without thread synchronization in the single thread version. We will also see a neat trick that can get rid of the branch statement in the standard implementation and make our code completely branchless. Using this knowledge we will finally implement the FW using CUDA (It will be a good idea to read Part-3, trust me!). Finally, we will benchmark it with single thread CPU version.

The Floyd–Warshall algorithm (Single Thread):

FloydWarshall(dis)
{
for (int k = 0; k < |V|; k++)
{
for (int i = 0; i < |V|; i++)
{
for (int j = 0; j < |V|; j++)
{
if(dis[i][j]<dis[i][k] + dis[k][j])
dis[i][j] = dis[i][k] + dis[k][j];
}
}
}
}

Above is a classic O(|V|³) implementation of the FW. Here dis[i][j] contains the distance between vertex i and j. We can see from the pseudocode above that for a given value of k the entire dis Matrix (|V|X|V|) is updated. Reading and writing are happening in this matrix simultaneously and the actual order of these operations does not matter. This makes it a really great candidate for GPU computation. So we will run the inner two loops in the GPU. The inner two loops have O(|V|²) complexity and running them on a GPU will yield O(|V|²/|P|) complexity where |P| is the number of threads. With a typical GPU, we will get O(10³) threads. So when |V|~O(10³) we will essentially have O(|V|) complexity for the inner loops and O(|V|²) overall complexity.

Step 1:

InnerLoops(dis,k) 
{
for (int i = 0; i < |V|; i++)
{
for (int j = 0; j < |V|; j++)
{
if(dis[i][j]<dis[i][k] + dis[k][j])
dis[i][j] = dis[i][k] + dis[k][j];
}
}
}
FloydWarshall(dis)
{
for (int k = 0; k < |V|; k++)
{
InnerLoops(dis,k);
}
}

Now, in the first step, we will just separate the loops in two different functions. In the InnerLoops function, we have a branching condition that we need to get rid of because GPUs hate branching.

Step 2:

InnerLoops(dis,k)
{
for (int i = 0; i < |V|; i++)
{
for (int j = 0; j < |V|; j++)
{
t=dis[i][k] + dis[k][j];
dis[i][j]=t*(t<dis[i][j])+dis[i][j]*(t>=dis[i][j]);
}
}
}
FloydWarshall(dis)
{
for (int k = 0; k < |V|; k++)
{
InnerLoops(dis,k);
}
}

Bamm! We got rid of the branch condition using the simple trick shown above. This works because CPP treats logical true as 1 and logical false as 0. If you are still confused try out some values on that expression and compare it with code from step 1. It will start to make sense.

Step 3:

__global__
void GPUInnerLoops(dis,k)
{
//calculates unique thread ID in the block
int t= (blockDim.x*blockDim.y)*threadIdx.z+ (threadIdx.y*blockDim.x)+(threadIdx.x);

//calculates unique block ID in the grid
int b= (gridDim.x*gridDim.y)*blockIdx.z+(blockIdx.y*gridDim.x)+(blockIdx.x);

//block size (this is redundant though)
int T= blockDim.x*blockDim.y*blockDim.z;

//grid size (this is redundant though)
int B= gridDim.x*gridDim.y*gridDim.z;

int t;
/*
* Each cell in the matrix is assigned to a different thread.
* Each thread do O(number of asssigned cell) computation.
* Assigned cells of different threads does not overlape with
* each other. And so no need for synchronization.
*/
for (int i=b; i<|V|; i+=B)
{
for(int j=t; j<|V|; j+=T)
{
t=dis[i*V+k]+dis[k*V+j];
dis[i*V+j]=t*(t<dis[i*V+j])+dis[i*V+j]*(tm>= dis[i*V+j]);
}
}
}
FloydWarshall(dis)
{

for (int k = 0; k < |V|; k++)
{
GPUInnerLoops<<<dim3(x,y,z),dim3(a,b,c)>>>(dis,k);
cudaDeviceSynchronize();
}
}

Finally, we just convert these two functions into CUDA. This is the exact same code from Part-3 with these two lines replacing whatever there was for matrix multiplication:

t=dis[i*V+k]+dis[k*V+j];
dis[i*V+j]=t*(t<dis[i*V+j])+dis[i*V+j]*(tm>= dis[i*V+j]);

What it is doing in simple term is that in the single-thread code we were incrementing i and j with 1 in each iteration of the loops. Here we are incrementing i and j with B and T respectively in each iteration of the loops. So previously the nested loops were running for O(|V|²) as discussed earlier. Now it is running O(|V|²/(B*T)) where B*T is |P| i.e the number of threads. And we are done :D!

Benchmark

We will now benchmark GPU FW with CPU FW by varying the number of vertices. For this, we will use Google Colab. This is a departure from the earlier tutorials. I am doing this because Google Colab is more accessible. You can find the notebook in the link below. It already contains necessary script to initialize C++ CUDA in Colab (yes, the code is still in CPP).

Without any surprise, we can see a huge boost in the performance of the FW when using GPU. I had to display the time on the logarithmic scale (base 10) because the difference is so large. To give an idea, for 32000 vertices in the GPU the FW takes ~20 minutes while in CPU it takes ~69.7 hours (predicted by running K<|V| iterations of the outer loop). Colab gave me around ~9GB GPU memory and using this I could run the Floyd-Warshall for up to ~46000 vertices after which Colab ran out of memory (46000*46000*4 byte = 8.46 GB).

--

--