Introduction

Many matrix operations involves massive computations. If the size of the matrix huge, its compitations are usually the bottleneck of the program. GPU comes with a large amount of cores and it is very suitable for parallel programming. In this task, we are to parallel element-wise reciprocal addition for 2D matrics. $$C_{i,j}=\frac{1}{A_{i,j}}+\frac{1}{B_{i,j}}$$ where the dimensions of A, B, and C is N * N, and N = 6400.

Methods

Declare the host vectors as static array, since static array have continuous memory in host memory and this is crucial to the memory copy from host to device. However, there ways to declare dynamic array with continuous memory in host memory and they can be found in transfer dynamic 2D array memory to CUDA, 2016.
// Variables
const int N = 6400;
float h_A[N][N];   // host vectors
float h_B[N][N];
float h_C[N][N];
float h_D[N][N];
float* d_A;   // device vectors
float* d_B;
float* d_C;
The parameter work is the number of computation each thread will do along a dimension.
// Device code
__global__ void VecAdd(const float A[N][N], const float B[N][N], float C[N][N], int N, int work)
{

    int i = blockDim.x * blockIdx.x + threadIdx.x;
    int j = blockDim.y * blockIdx.y + threadIdx.y;
    if (i < N / work && j < N / work){
        for(int k = 0; k < work; ++k)
            for(int l = 0; l < work; ++l)
                C[i * work + k][j * work + l] = 1.0 / A[i * work + k][j * work + l] + 1.0 / B[i * work + k][j * work + l];
    }
    
    __syncthreads();

}
Random initialize vector elements to between 0 and 1.
// Allocates an array with random float entries.
void RandomInit(float data[N][N])
{
    for(int i = 0; i < N; ++i)
        for(int j = 0; j < N; ++j)
            data[i][j] = rand() / (float)RAND_MAX;
}
Full code can be obtained at here

Tuning block and grid size

Since this code is running on a cluster with HTCondor computing resource management system. GPU cards are GeForce GTX 970. I have written two Python codes to do batch computing to test various parameters (block size, grid size, and number of work for each thread) and combine outputs to a single csv for comparison.

To reproduce the results, firstly run test_params.py. Wait for all jobs to complete. Lastly, run gather_results.py.

test_params.py

import os
import time
testing_params = [(4, 4, 1), (8, 8, 1), (10, 10, 1), (16, 16, 1), (20, 20, 1), (32, 32, 1),
	(16, 16, 2), (16, 16, 4), (16, 16, 16), (16, 16, 80)]
for params in testing_params:
	in_name = '{}_{}_{}_in'.format(params[0], params[1], params[2])
	cmd_name = 'cmd_{}_{}_{}'.format(params[0], params[1], params[2])
	with open(in_name, 'wb') as f:
		f.write('{}\n{}\n{}#\n'.format(params[0], params[1], params[2]))
	with open(cmd_name, 'wb') as f:
		cmd_content = ('Universe=vanilla\nExecutable=/opt/bin/runjob\nMachine_count=1\n\n'
			'Output=condor.out\nError=condor.err\nLog=condor.log\n\n'
			'Requirements=(MTYPE=="TEST")\n\nRequest_cpus=1\n\n'
			'Initialdir=/path/to/working/directory\n\nArguments=./vecAdd_v2 {}_{}_{}_in {}_{}_{}_out\n\n'
			'Queue\n\n').format(
			params[0], params[1], params[2],
			params[0], params[1], params[2])
		f.write(cmd_content)
	os.system('condor_submit {}'.format(cmd_name))
	time.sleep(1)

gather_results.py

import os
import subprocess
result_name = 'results.csv'
with open(result_name, 'wb') as f:
	f.write('work for each thread,'
		'number of thread x,'
		'number of thread y,'
		'number of blocks x,'
		'number of blocks y,'
		'input time for GPU,'
		'processing time for GPU (ms),'
		'GPU Gflops,'
		'output time for GPU (ms),'
		'total time for GPU (ms),'
		'processing time for CPU (ms),'
		'CPU Gflops,'
		'speed up of GPU,'
		'norm(h_C - h_D)')
	for file_name in os.listdir('./'):
		if 'out' in file_name:
			line = subprocess.check_output(['tail', '-1', file_name])
			f.write(line + '\n')
	f.close()

Results

I have tested various parameters settings: number of work for each thread, dimension of threads, and dimension of blocks. Here, some columns are omitted: GPU Gflops, CPU Gflops to fit in this document.

work for each thread

dimension of thread x and y

dimension of blocks x and y

input time for GPU

processing time for GPU (ms)

output time for GPU (ms)

total time for GPU (ms)

processing time for CPU (ms)

speed up of GPU

1

4

1600

61.09626

26.14269

90.63619

177.8751

362.3188

2.036928

1

8

800

61.12278

14.18016

91.81216

167.1151

362.2879

2.167894

1

10

640

61.04918

16.49891

92.22903

169.7771

362.9761

2.137956

1

16

400

61.06547

12.70762

89.49069

163.2638

362.535

2.220548

1

20

320

61.03123

15.13014

89.81895

165.9803

362.6187

2.184709

1

32

200

61.07587

14.31136

89.54723

164.9345

362.4345

2.197445

2

16

200

61.09696

11.56202

91.93604

164.595

362.6427

2.203243

4

16

100

61.10176

17.69344

89.79328

168.5885

362.7742

2.151833

16

16

25

61.11475

63.6985

89.61117

214.4244

362.4164

1.690182

80

16

5

61.04218

78.78506

92.21085

232.0381

362.2871

1.561326

The best result is highlight in yellow. There are several intriguing facts that can be observed from the results:
  1. GPU spends most of the time on memory copying. Input time and output time has dominated the total time of GPU. To be specific, the best result has spent (61.06547+89.49069)/163.2638 = 92% of time on memory copying.
  2. Alternating the number of dimension of blocks don’t have much influence on the speed up of GPU in this simple computation.
  3. Increate the work for each thread can actually reduce speed up of GPU.

Conclusions

From the results above, there are some conclusions can be drawn:
  1. The total time of GPU computing is dominant by the time of memory copying between host and device. So it is critical to minimize the communication overhead.
  2. Since this work only contains vector addition, a relative easy task, the fine-tuning of block sizes has no effect on the speed up. I’m looking forward to solve harder problems to observe the relations between these settings and speed up.

Reference

  1. transfer dynamic 2D array memory to CUDA, 2016
  2. HTCondor
  3. GeForce GTX 970