Thursday, February 3, 2011

OpenCL simple memory manipulation tests

Investigations for the 2nd homework assignment in the OpenCL-U (2011 iVEC @ UWA summer school). The aim of the assignment was to write an OpenCL kernel that reverses a buffer of bytes. It's a trivial task however there are many ways to achieve it in OpenCL with varying results.

Using a NVIDIA GT 9600 I was able to make some significant performance gains optimising the size of the kernel's memory reads. However surprisingly on a NVIDIA GTS 360M the performance got worse as I optimised the kernel:


The NVIDIA GT 9600 was only able to handle a 16MB memory buffer. To create a 16MB random test file:

dd if=/dev/urandom of=random.bin bs=1M count=16
The "reverse" program source is a modified version of the template Derek provided here. Sample output from a run:

./reverse --input random.bin --timing --repeat 10 --verify
Loading input data from file 'random.bin'
Connecting to NVIDIA Corporation GeForce 9600 GT...
Executing 'ReverseBufferKernel' (global='522240, local='512') for 10 iterations...
Exec Time: 1.49 ms (for 10 iterations)
Verifying results...
Found '16711680' correct results!


The OpenCL kernels were developed and tested on the GT 9600. They are shown below:

Trivial solution, copying 1 char per work-item:
__kernel void ReverseBufferKernel(
__global const char* input_buffer,
__global char* output_buffer )
{
uint i = get_global_id(0);
uint t = get_global_size(0);
output_buffer[i] = input_buffer[t-i-1];
}
Exec Time: 26.03 ms 

Each work-item loads a single byte from global memory, calculates its destination index and then writes it.  On the GT 9600 this fails to use the full width of data bus when performing each memory read/write.

We need to modify the data types to use the full width of the memory data bus. But how many bytes should we try accessing per read/write?  GPU's are designed for rapidly calculating pixel values. The shader units work with RGBA float values ie. 4x4 = 16 bytes per pixel. Intuitively a char16 may be the optimal data width to request.

Using char16 vectors, full temporary variable assignments:
__kernel void ReverseBufferKernel(
__global const char* input_buffer,
__global char* output_buffer ) {
uint id = get_global_id(0);
char16 in = input_buffer[id];
char16 out;
out.s0 = in.sf;
out.s1 = in.se;
out.s2 = in.sd;
out.s3 = in.sc;
out.s4 = in.sb;
out.s5 = in.sa;
out.s6 = in.s9;
out.s7 = in.s8;
out.s8 = in.s7;
out.s9 = in.s6;
out.sa = in.s5;
out.sb = in.s4;
out.sc = in.s3;
out.sd = in.s2;
out.se = in.s1;
out.sf = in.s0;
output_buffer[ get_global_size(
0) - id - 1] = out;
}

Exec Time: 5.50 ms 

Thats a 4.7x speed up, not bad. Trying 16-byte vectors again, however using a swizzle and no explicit temporary vars.

Using char16 vectors, swizzle no temp vars:
__kernel void ReverseBufferKernel(
__global const char16* input_buffer,
__global char16* output_buffer )
{
output_buffer[ get_global_size(0) - get_global_id(0) - 1] =
input_buffer[ get_global_id(0) ].sfedcba9876543210;
}
Exec Time: 5.44 ms 

A 1.1% speed improvement using the swizzle instead of multiple assignment steps. However this may not be optimal as 2.9GB/sec of data throughput represents only 7% of the GPU's memory bandwidth (see below). Increasing the data request size using uint16's (4x16=64 byte reads).

Using uint16 vectors, type casting:
__kernel void ReverseBufferKernel(
__global const uint16* input_buffer,
__global uint16* output_buffer )
{
uint16 in = input_buffer[get_global_id(0)];
uint16 out;
out.s0 = as_uint(as_char4(in.sf).s3210);
out.s1 = as_uint(as_char4(in.se).s3210);
out.s2 = as_uint(as_char4(in.sd).s3210);
out.s3 = as_uint(as_char4(in.sc).s3210);
out.s4 = as_uint(as_char4(in.sb).s3210);
out.s5 = as_uint(as_char4(in.sa).s3210);
out.s6 = as_uint(as_char4(in.s9).s3210);
out.s7 = as_uint(as_char4(in.s8).s3210);
out.s8 = as_uint(as_char4(in.s7).s3210);
out.s9 = as_uint(as_char4(in.s6).s3210);
out.sa = as_uint(as_char4(in.s5).s3210);
out.sb = as_uint(as_char4(in.s4).s3210);
out.sc = as_uint(as_char4(in.s3).s3210);
out.sd = as_uint(as_char4(in.s2).s3210);
out.se = as_uint(as_char4(in.s1).s3210);
out.sf = as_uint(as_char4(in.s0).s3210);
output_buffer[ get_global_size(0) - get_global_id(0) - 1] = out;
}
Exec Time: 1.48 ms

By type-casting each uint to a char4 vector, swizzling the chars and then casting back to uints that are rearranged into an output uint vector 64 bytes can be copied in each work-item. This is a 3.7x speed improvement over the simple char16 vector case. This result was surprising.

While trying to increase vector sizes beyond 64 bytes the work-group sizes started dropping from the maximum 512. Perhaps this was due to limits in the size of the register file?

Summary: 

Processing large 64 byte vectors provided a 17.6x speed-up over the trivial solution. It also copies/reverses at 10.5GB/sec data throughput, 27% of the GT 9600's memory bandwidth.

Not bad. However when testing the same kernels on the 360M, the simple 1 char per work-item solution was fastest! At 5.37GB/sec the simplest solution used 20.3% of the available memory bandwidth, not far from the best result on the GT 9600. The vector optimisations made the 360M slower, an unexpected result.

Update: the 360M has NVIDIA's CL_DEVICE_COMPUTE_CAPABILITY_NV: v1.2. In this device the memory controller coalesces multiple single-word memory transactions as described in section G.3.2 here (thanks Derek).

Hardware specs for the NVIDIA GT 9600:

~/NVIDIA_GPU_Computing_SDK/OpenCL/bin/linux/release/oclDeviceQuery
CL_DEVICE_NAME: GeForce 9600 GT
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 260.19.06
CL_DEVICE_VERSION: OpenCL 1.0 CUDA
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 8
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 512
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1680 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 128 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 511 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte

CL_DEVICE_COMPUTE_CAPABILITY_NV: 1.1
NUMBER OF MULTIPROCESSORS: 8
NUMBER OF CUDA CORES: 64
CL_DEVICE_REGISTERS_PER_BLOCK_NV: 8192
CL_DEVICE_WARP_SIZE_NV: 32


~/NVIDIA_SDK/OpenCL/bin/linux/release/oclBandwidthTest

Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2272.9

Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1638.4

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 39011.3



Hardware specs for the NVIDIA GTS 360M:

~/NVIDIA_GPU_Computing_SDK/OpenCL/bin/linux/release/oclDeviceQuery
CL_DEVICE_NAME: GeForce GTS 360M
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 260.19.06
CL_DEVICE_VERSION: OpenCL 1.0 CUDA
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 12
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 512
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1323 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 255 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 1023 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte

NUMBER OF MULTIPROCESSORS: 12
NUMBER OF CUDA CORES: 96
CL_DEVICE_REGISTERS_PER_BLOCK_NV: 16384
CL_DEVICE_WARP_SIZE_NV: 32

~/NVIDIA_GPU_Computing_SDK/OpenCL/bin/linux/release/oclBandwidthTest

Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4666.4

Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4540.2

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 26466.2

No comments:

Post a Comment