Many matrix operations involves massive computations. If the size of the matrix huge, its compitations are usually the bottleneck of the program. GPU comes with a large amount of cores and it is very suitable for parallel programming. In this task, we are to parallel element-wise reciprocal addition for 2D matrics. $$C_{i,j}=\frac{1}{A_{i,j}}+\frac{1}{B_{i,j}}$$
where the dimensions of A, B, and C is N * N, and N = 6400.
Methods
Declare the host vectors as static array, since static array have continuous memory in host memory and this is crucial to the memory copy from host to device. However, there ways to declare dynamic array with continuous memory in host memory and they can be found in transfer dynamic 2D array memory to CUDA, 2016.
The parameter work is the number of computation each thread will do along a dimension.
Random initialize vector elements to between 0 and 1.
Since this code is running on a cluster with HTCondor computing resource management system. GPU cards are GeForce GTX 970. I have written two Python codes to do batch computing to test various parameters (block size, grid size, and number of work for each thread) and combine outputs to a single csv for comparison.
To reproduce the results, firstly run test_params.py. Wait for all jobs to complete. Lastly, run gather_results.py.
test_params.py
gather_results.py
Results
I have tested various parameters settings: number of work for each thread, dimension of threads, and dimension of blocks. Here, some columns are omitted: GPU Gflops, CPU Gflops to fit in this document.
work for each thread
dimension of thread x and y
dimension of blocks x and y
input time for GPU
processing time for GPU (ms)
output time for GPU (ms)
total time for GPU (ms)
processing time for CPU (ms)
speed up of GPU
1
4
1600
61.09626
26.14269
90.63619
177.8751
362.3188
2.036928
1
8
800
61.12278
14.18016
91.81216
167.1151
362.2879
2.167894
1
10
640
61.04918
16.49891
92.22903
169.7771
362.9761
2.137956
1
16
400
61.06547
12.70762
89.49069
163.2638
362.535
2.220548
1
20
320
61.03123
15.13014
89.81895
165.9803
362.6187
2.184709
1
32
200
61.07587
14.31136
89.54723
164.9345
362.4345
2.197445
2
16
200
61.09696
11.56202
91.93604
164.595
362.6427
2.203243
4
16
100
61.10176
17.69344
89.79328
168.5885
362.7742
2.151833
16
16
25
61.11475
63.6985
89.61117
214.4244
362.4164
1.690182
80
16
5
61.04218
78.78506
92.21085
232.0381
362.2871
1.561326
The best result is highlight in yellow. There are several intriguing facts that can be observed from the results:
GPU spends most of the time on memory copying. Input time and output time has dominated the total time of GPU. To be specific, the best result has spent (61.06547+89.49069)/163.2638 = 92% of time on memory copying.
Alternating the number of dimension of blocks don’t have much influence on the speed up of GPU in this simple computation.
Increate the work for each thread can actually reduce speed up of GPU.
Conclusions
From the results above, there are some conclusions can be drawn:
The total time of GPU computing is dominant by the time of memory copying between host and device. So it is critical to minimize the communication overhead.
Since this work only contains vector addition, a relative easy task, the fine-tuning of block sizes has no effect on the speed up. I’m looking forward to solve harder problems to observe the relations between these settings and speed up.