Introduction

This time we are going to use CUDA to parallel the iterative computation in Laplace equation. 2-D Laplace's equation is in the form of, $$\nabla^2u=\frac{\partial^2u}{\partial x^2}+\frac{\partial^2u}{\partial y^2}=0$$ If we consider a 2-D heat equation, $$\frac{\partial u}{\partial t}=k\nabla^2u+\frac{Q}{cp}$$ Laplace's equation appears when solving the equilibrium solution (time independent) if there is no source. Let's solve this equation on a rectangle given by, $$0\leq x\leq L, 0\leq y\leq H$$ For this geometry, the Laplace's equation with four boundary conditions will be, $$\nabla^2u=\frac{\partial^2u}{\partial x^2}+\frac{\partial^2u}{\partial y^2}=0$$ $$u(0,y)=g_1(y)\quad u(L,y)=g_2(y)$$ $$u(x,0)=f_1(x)\quad u(x,H)=f_2(x)$$ Here, we are going to use iterative method, Gaussian-Seidel method in particular, to solve this boundary-value problem, in which the solution-function's values are specified on boundary of a domain; the problem is to compute a solution also on its interior. Modeling such problem of potential theory, u is a smooth real-valued function on the real numbers, its second derivative can be approximated by: $$\frac{\partial^2 u}{\partial x^2}=\frac{u(x-h)-2(u)+u(x+h)}{h^2}+O(h^2)$$ Using this in 2 dimensions for a function u of 2 arguments at the point (x, y), and solve for u(x, y) results in: $$u(x,y)=\frac{1}{4}(u(x+h,y)+u(x,y+h)+u(x-h,y)+u(x,y-h)-h^2\nabla^2u(x,y))+O(h^4)$$ To approach the solution numerically on a 2-D grid with grid spacing h, the iterative method assigns the given values of function u to the grid points near the boundary and arbitrary values to the interior grid points, and repeatedly performs the assignment u:=u* on the interior points, where u* is defined by, $$u^*(x,y)=\frac{1}{4}(u(x+h,y)+u(x,y+h)+u(x-h,y)+u(x,y-h)-h^2f(x,y))$$ util convergence.

Methods

Since the condition for convergence depends on the difference of two consecutive iterations, there two texture memories. For each iteration of computation, the assignment of u:=u* are sent to the GPU. And the difference between two iterations are reduced to one value into shared memory by parallel reduction. A detailed explaination of parallel reduction can be found here
__align__(8) texture<float>	texOld;	 // declare the texture
__align__(8) texture<float>	texNew;


__global__ void laplacianTex(float* phi_old, float* phi_new, float* C, bool flag)
{
	extern __shared__ float cache[];	 
	float	t, l, c, r, b;	 // top, left, center, right, bottom
	float	diff; 
	int	site, ym1, xm1, xp1, yp1;

	int Nx = blockDim.x * gridDim.x;
	int Ny = blockDim.y * gridDim.y;
	int x = blockDim.x * blockIdx.x + threadIdx.x;
	int y = blockDim.y * blockIdx.y + threadIdx.y;
	int cacheIndex = threadIdx.x + threadIdx.y * blockDim.x;	

	site = x + y * Nx;

	if((x == 0) || (x == Nx - 1) || (y == 0) || (y == Ny - 1)) {	
	}
	else {
		xm1 = site - 1;	// x-1
		xp1 = site + 1;	// x+1
		ym1 = site - Nx;	 // y-1
		yp1 = site + Nx;	 // y+1
		if(flag){
			b = tex1Dfetch(texOld, ym1);		// read d_old via texOld
			l = tex1Dfetch(texOld, xm1);
			c = tex1Dfetch(texOld, site);
			r = tex1Dfetch(texOld, xp1);
			t = tex1Dfetch(texOld, yp1);

			phi_new[site] = 0.25 * (b + l + r + t);
			diff = phi_new[site] - c;
		}
		else {
			b = tex1Dfetch(texNew, ym1);	 // read d_new via texNew
			l = tex1Dfetch(texNew, xm1);
			c = tex1Dfetch(texNew, site);
			r = tex1Dfetch(texNew, xp1);
			t = tex1Dfetch(texNew, yp1);

			phi_old[site] = 0.25 * (b + l + r + t);
			diff = phi_old[site] - c;
		}
	}

	// each thread saves its error estimate to the shared memory

	cache[cacheIndex] = diff * diff;	
	__syncthreads();

	// parallel reduction in each block 

	int ib = blockDim.x * blockDim.y / 2;	
	while (ib != 0) {	
		if(cacheIndex < ib)	
			cache[cacheIndex] += cache[cacheIndex + ib];
		__syncthreads();
		ib /= 2;	
	}

	// save the partial sum of each block to C

	int blockIndex = blockIdx.x + gridDim.x * blockIdx.y;
	if(cacheIndex == 0)
		C[blockIndex] = cache[0];	
}
Full code can be obtained at here

While the code is compiled on a cluster with HTCondor computing resource management system. GPU cards are GeForce GTX 970. The following two Python codes to do batch computing to test various parameters (boundary conditions, block size, grid size) and combine outputs to a single csv for each boundary condition.

To reproduce the results, firstly run test_params.py. Wait for all jobs to complete. Lastly, run gather_results.py.

test_params.py

import os
import time

lattice_sizes = [32, 64, 128, 256, 512, 640]
block_sizes = [4, 8, 16, 32]
bcs = [(0.0, 0.0, 0.0, 1.0), (1.0, -1.0, -2.0, 5.0)]

for lattice_size in lattice_sizes:
	for block_size in block_sizes:
		for bc in bcs:
			in_name = '{}_{}_{}_{}_{}_{}_in'.format(lattice_size, block_size, bc[0], bc[1], bc[2], bc[3])
			cmd_name = 'cmd_{}_{}_{}_{}_{}_{}'.format(lattice_size, block_size, bc[0], bc[1], bc[2], bc[3])
			with open(in_name, 'wb') as f:
				f.write('{} {}\n{} {}\n{} {} {} {}\n'.format(lattice_size, lattice_size, block_size, block_size, 
					bc[0], bc[1], bc[2], bc[3]))
			with open(cmd_name, 'wb') as f:
				cmd_content = ('Universe=vanilla\nExecutable=/opt/bin/runjob\nMachine_count=1\n\n'
					'Output=condor.out\nError=condor.err\nLog=condor.log\n\n'
					'Requirements=(MTYPE=="TEST")\n\nRequest_cpus=1\n\n'
					'Initialdir=/path/to/working/directory\n\nArguments=./laplace {}_{}_{}_{}_{}_{}_in {}_{}_{}_{}_{}_{}_out\n\n'
					'Queue\n\n').format(
					lattice_size, block_size, bc[0], bc[1], bc[2], bc[3],
					lattice_size, block_size, bc[0], bc[1], bc[2], bc[3])
				f.write(cmd_content)
			os.system('condor_submit {}'.format(cmd_name))
			time.sleep(1)

gather_results.py

import os
import subprocess
bcs = [(0.0, 0.0, 0.0, 1.0), (1.0, -1.0, -2.0, 5.0)]
for bc in bcs:
	result_name = 'results_{}_{}_{}_{}.csv'.format(bc[0], bc[1], bc[2], bc[3])

	with open(result_name, 'wb') as f:
		f.write('Size of the 2D lattice (x),'
			'Number of threads per block (x),'
			'Input time for GPU (Texture) (ms),'
			'Error for GPU (Texture),'
			'Iteration of GPU (Texture),'
			'Processing time for GPU (Texture) (ms),'
			'GPU (Texture) Gflops,'
			'Output time for GPU (Texture) (ms),'
			'Total time for GPU (Texture) (ms),'
			'Input time for GPU (ms),'
			'Error for GPU,'
			'Iteration of GPU,'
			'Processing time for GPU (ms),'
			'GPU Gflops,'
			'Output time for GPU (ms),'
			'Total time for GPU (ms),'
			'Error for CPU,'
			'Iteration of CPU,'
			'Processing time for CPU (ms),'
			'CPU Gflops,'
			'Speed up of GPU (Texture),'
			'Speed up of GPU\n')
		for file_name in os.listdir('./'):
			if '{}_{}_{}_{}_out'.format(bc[0], bc[1], bc[2], bcs[3]) in file_name:
				line = subprocess.check_output(['tail', '-1', file_name])
				if not ',' in line:
					continue
				f.write(line + '\n')
		f.close()

Results

I have tested different lattice sizes: 32, 64, 128, 256, 512, and 640, and block sizes: 4, 8, 16, 32. In this part, I only shows the result of the boundary condition: top is 1.0, and the other is 0.0. As a result, there are totally 24 rows and 22 columns (these columns are stated in the gather results part in methods section). Since the table is really large, it is split into three separate tables with GPU (texture), GPU and CPU.
For the size of 2D lattice less than 128×128, CPU has a better performance than GPU with texture and GPU. Different from the previous homework, this time the memory copying time of input and output only have a little effect on the total, less than 2 milliseconds, comparing to the processing time 60-170000 milliseconds.
While the sizes of 2D lattice are 128×128 and 256×256, GPU with device memory has a better performance the CPU and GPU with texture memory. When the size of 2D lattice is 128×128, the best configuration of number of threads per block is 16×16 and number of blocks per grid will be 8×8. While the size of 2D lattice is 256×256, the best setting of number of threads per block is 8×8 and number of blocks per grid will be 32×32.
However, if the size of the 2d lattice grows over 512×512, texture memory has shown a substantial advantage over GPU and CPU without losing any error. When the size of 2D lattice is 512×512, the besting configuration of number of threads per block is 16×16 and number of blocks per grid will be 32×32. While the size of 2D lattice is 640×640, the best setting of number of threads per block is 8×8 and number of blocks per grid will be 80×80. The best speedup of CPU, GPU, and GPU with texture are highlighted in yellow.
Here is the result for GPU with texture memory. The size of the 2D lattice (y) and the number of threads per block (y) is the same as the value of x, so they are omitted:

Size of the 2D lattice (x)

Number of threads per block (x)

Input time (ms)

Error

Iteration

Processing time (ms)

Gflops

Output time (ms)

Total time (ms)

32

4

0.53472

0

2606

60.45296

0.27158

0.375648

61.36333

32

8

0.534208

0

2606

59.2072

0.277294

0.370848

60.11226

32

16

0.540384

0

2606

58.31558

0.281534

0.369632

59.2256

32

32

0.53168

0

2606

67.27274

0.244048

0.371552

68.17596

64

4

0.472832

0

9745

227.8239

1.15097

0.3424

228.6391

64

8

0.545952

0

9745

221.5026

1.183817

0.381376

222.4299

64

16

0.54032

0

9745

220.2206

1.190708

0.378624

221.1395

64

32

0.53952

0

9745

281.0736

0.932917

0.378848

281.992

128

4

0.575232

0

35073

1161.547

3.355639

0.41488

1162.537

128

8

0.575168

0

35073

1033.322

3.772042

0.421888

1034.319

128

16

0.581408

0

35073

1025.425

3.801092

0.417024

1026.423

128

32

0.517376

0

35073

940.3292

4.145073

0.375488

941.222

256

4

0.727232

0

124611

6201.993

9.073829

0.52672

6203.247

256

8

0.735008

0

124611

4199.054

13.40203

0.559168

4200.348

256

16

0.66064

0

124611

3460.476

16.26245

0.497312

3461.634

256

32

0.746048

0

124611

4435.073

12.68882

0.548864

4436.368

512

4

1.389184

0

423553

75713.64

10.18526

1.18976

75716.22

512

8

1.503232

0

423553

45034.16

17.12395

1.272224

45036.94

512

16

1.501248

0

423553

27503.4

28.03882

1.248928

27506.15

512

32

1.388832

0

423553

33944.7

22.71821

1.164576

33947.26

640

4

1.663968

0

619850

169852.2

10.39812

1.705984

169855.6

640

8

1.682016

0

619850

85307.44

20.70328

1.754656

85310.88

640

16

1.573088

0

619850

71908.79

24.56089

1.502496

71911.86

640

32

1.701472

0

619850

98867.45

17.86375

1.638976

98870.8

Here is the result for GPU with device memory. The size of the 2D lattice (y) and the number of threads per block (y) is the same as the value of x, so they are omitted:

Size of the 2D lattice (x)

Number of threads per block (x)

Input time (ms)

Error

Iteration

Processing time (ms)

Gflops

Output time (ms)

Total  time (ms)

32

4

0.430368

0

2606

44.80208

0.366452

0.344288

45.57673

32

8

0.43776

0

2606

43.6656

0.375989

0.344096

44.44746

32

16

0.426784

0

2606

43.69744

0.375715

0.343456

44.46768

32

32

0.440608

0

2606

64.35703

0.255105

0.344

65.14163

64

4

0.378528

0

9745

225.6186

1.16222

0.317824

226.315

64

8

0.43152

0

9745

167.6807

1.563796

0.35088

168.4631

64

16

0.439424

0

9745

172.7931

1.517529

0.349472

173.582

64

32

0.437696

0

9745

240.7429

1.089205

0.353376

241.534

128

4

0.459712

0

35073

967.6324

4.028113

0.361952

968.454

128

8

0.456672

0

35073

901.0295

4.325866

0.370496

901.8567

128

16

0.46256

0

35073

868.0142

4.490402

0.37264

868.8494

128

32

0.41152

0

35073

785.0016

4.965254

0.3376

785.7507

256

4

0.60864

0

124611

5542.933

10.15272

0.444672

5543.986

256

8

0.613536

0

124611

3489.02

16.12941

0.436608

3490.07

256

16

0.54944

0

124611

3227.097

17.43853

0.406112

3228.053

256

32

0.63232

0

124611

3891.992

14.45939

0.454656

3893.079

512

4

1.337792

0

423553

96573.79

7.98522

0.872256

96576

512

8

1.45424

0

423553

64176.78

12.01623

0.931552

64179.16

512

16

1.672544

0

423553

59243.65

13.0168

0.93696

59246.26

512

32

1.33168

0

423553

66473.89

11.60099

0.84448

66476.06

640

4

1.717184

0

619850

235964.7

7.48478

1.170912

235967.5

640

8

1.751296

0

619850

112914.5

15.64143

1.165984

112917.4

640

16

1.562688

0

619850

109190.2

16.17493

1.04288

109192.8

640

32

1.643424

0

619850

148595.7

11.88556

1.157728

148598.5

Here is the result of CPU and the speedup of GPU with texture and GPU. The size of the 2D lattice (y) and the number of threads per block (y) is the same as the value of x, so they are omitted:

Size of the 2D lattice (x)

Number of threads per block (x)

Error

Iteration

Processing time (ms)

Gflops

Speed up of GPU (Texture)

Speed up of GPU

32

4

0

2606

13.28211

1.236084

0.21645

0.291423

32

8

0

2606

12.53434

1.309826

0.208515

0.282003

32

16

0

2606

12.74694

1.287979

0.215227

0.286656

32

32

0

2606

12.75574

1.287091

0.1871

0.195816

64

4

0

9745

123.4744

2.123667

0.54004

0.545586

64

8

0

9745

150.6378

1.740721

0.677237

0.894189

64

16

0

9745

151.4719

1.731136

0.684961

0.872624

64

32

0

9745

151.1162

1.735211

0.535888

0.625652

128

4

0

35073

2185.981

1.783059

1.880353

2.257186

128

8

0

35073

2187.823

1.781558

2.115231

2.42591

128

16

0

35073

2293.807

1.699242

2.234758

2.640051

128

32

0

35073

1798.081

2.167718

1.910369

2.288361

256

4

0

124611

32237.32

1.745673

5.196846

5.814826

256

8

0

124611

32830.42

1.714136

7.81612

9.40681

256

16

0

124611

26357.79

2.135074

7.614263

8.165229

256

32

0

124611

33139.02

1.698174

7.469854

8.512291

512

4

0

423553

362858.6

2.125244

4.792349

3.757233

512

8

0

423553

425090.2

1.814116

9.4387

6.623492

512

16

0

423553

502324.2

1.53519

18.26225

8.47858

512

32

0

423553

360227.9

2.140764

10.6114

5.418911

640

4

0

619850

997622.3

1.770353

5.873357

4.227795

640

8

0

619850

1178118

1.499123

13.8097

10.43345

640

16

0

619850

834376.6

2.116722

11.60277

7.641317

640

32

0

619850

998127

1.769458

10.09527

6.716939

Conclusions

This homework contains a lot of experiments on the different settings of lattice sizes and number of threads per block. Through the results, we can draw some insights into the optimal parameters of using a GPU to solve Laplace's equation on a 2-D lattice:
  1. For Laplace equation on 2D lattice, the speedup of GPU appears when the size of lattice square is larger than 128×128.
  2. The benefit of texture only appears when the size of lattice square grows larger than 512×512.
  3. The best configuration for size of 2D lattice 512×512 is 16×16 number of threads per blocks in GPU with device memory when the boundary condition is (ϕ_top,ϕ_left,ϕ_right,ϕ_bottom)=(+1,-1,-2,+5).

Reference

  1. 2-D Laplace's equation
  2. Parallel reduction
  3. HTCondor
  4. GeForce GTX 970