Page 1. OpenCL. Parallel Computing on the GPU and. CPU. Aaftab Munshi.
Page 2. Beyond Programmable Shading: Fundamentals. •Today's processors are
...
OpenCL Parallel Computing on the GPU and CPU
Aaftab Munshi
Opportunity: Processor •Today’s processors are increasingly parallel •CPUs ■
Multiple cores are driving performance increases
•GPUs ■
Transforming into general purpose data-parallel computational coprocessors
■
Improving numerical precision (single and double)
Beyond Programmable Shading: Fundamentals
Challenge: Processor Parallelism •Writing parallel programs different for the CPU and GPU ■
Differing domain-specific techniques
■
Vendor-specific technologies
•Graphics API is not an ideal abstraction for general purpose compute
Beyond Programmable Shading: Fundamentals
Introducing OpenCL •OpenCL – Open Computing Language •Approachable language for accessing heterogeneous computational resources
•Supports parallel execution on single or multiple processors ■
GPU, CPU, GPU + CPU or multiple GPUs
•Desktop and Handheld Profiles •Designed to work with graphics APIs such as OpenGL
Beyond Programmable Shading: Fundamentals
OpenCL = Open Standard •Specification under review ■
Royalty free, cross-platform, vendor neutral
■
Khronos OpenCL working group (www.khronos.org)
•Based on a proposal by Apple ■
Developed in collaboration with industry leaders
■
Performance-enhancing technology in Mac OS X Snow Leopard
Beyond Programmable Shading: Fundamentals
OpenCL Working Group Members Broad Industry Support
© Copyright Khronos Group, 2008 - Page
Beyond Programmable Shading: Fundamentals
OpenCL
—
A Sneak Preview
Design Goals of OpenCL •Use all computational resources in system ■
GPUs and CPUs as peers
■
Data- and task- parallel compute model
•Efficient parallel programming model ■
Based on C
■
Abstract the specifics of underlying hardware
•Specify accuracy of floating-point computations ■
IEEE 754 compliant rounding behavior
■
Define maximum allowable error of math functions
•Drive future hardware requirements
Beyond Programmable Shading: Fundamentals
OpenCL Software Stack •Platform Layer ■
query and select compute devices in the system
■
initialize a compute device(s)
■
create compute contexts and work-queues
•Runtime ■
resource management
■
execute compute kernels
•Compiler ■
A subset of ISO C99 with appropriate language additions
■
Compile and build compute program executables ■
online or offline
Beyond Programmable Shading: Fundamentals
OpenCL Execution Model •Compute Kernel ■
Basic unit of executable code — similar to a C function
■
Data-parallel or task-parallel
•Compute Program ■
Collection of compute kernels and internal functions
■
Analogous to a dynamic library
•Applications queue compute kernel execution instances ■
Queued in-order
■
Executed in-order or out-of-order
■
Events are used to implement appropriate
Beyond Programmable Shading: Fundamentals
OpenCL Data-Parallel Execution •Define N-Dimensional computation domain ■
Each independent element of execution in N-D domain is called a work-item
■
The N-D domain defines the total number of workitems that execute in parallel — global work size.
•Work-items can be grouped together — work-group ■
Work-items in group can communicate with each other
■
Can synchronize execution among work-items in group to coordinate memory access
•Execute multiple work-groups in parallel •Mapping of global work size to work-groups Beyond Programmable Shading: Fundamentals
OpenCL Task-Parallel Execution •Data-parallel execution model must be implemented by all OpenCL compute devices
•Some compute devices such as CPUs can also execute task-parallel compute kernels ■
Executes as a single work-item
■
A compute kernel written in OpenCL
■
A native C / C++ function
Beyond Programmable Shading: Fundamentals
OpenCL Memory Model •Implements a relaxed
consistency, shared memory model
•Multiple distinct address spaces ■
Address spaces can be collapsed
Beyond Programmable Shading: Fundamentals
OpenCL Memory Model •Implements a relaxed
consistency, shared memory model
Private Memory
•Multiple distinct address spaces
WorkItem 1
Private Memory
Private Memory
Private Memory
WorkItem M
WorkItem 1
WorkItem M
■
Compute Unit 1 Address spaces can be collapsed
■
Address Qualifiers ■
__private
Beyond Programmable Shading: Fundamentals
Compute Unit N
OpenCL Memory Model •Implements a relaxed
consistency, shared memory model
Private Memory
•Multiple distinct address spaces
WorkItem 1
Private Memory
Private Memory
Private Memory
WorkItem M
WorkItem 1
WorkItem M
■
Compute Unit 1 Address spaces can be collapsed
■
Address Qualifiers ■ ■
__private __local
Beyond Programmable Shading: Fundamentals
Local Memory
Compute Unit N
Local Memory
OpenCL Memory Model •Implements a relaxed
consistency, shared memory model
Private Memory
•Multiple distinct address spaces
WorkItem 1
Private Memory
Private Memory
Private Memory
WorkItem M
WorkItem 1
WorkItem M
■
Compute Unit 1 Address spaces can be collapsed
■
Address Qualifiers ■ ■ ■
__private __local
Compute Unit N
Local Memory
Local Memory
Global / Constant Memory Data Cache Compute Device
__constant and __global ■
Example: ■
__global float4 *p;
Beyond Programmable Shading: Fundamentals
Global Memory Compute Device Memory
Language for writing compute •Derived from ISO C99 •A few restrictions ■
Recursion, function pointers, functions in C99 standard headers ...
•Preprocessing directives defined by C99 are supported
•Built-in Data Types ■
Scalar and vector data types
■
Structs, Pointers
■
Data-type conversion functions ■
■
convert_type
Image types
Beyond Programmable Shading: Fundamentals
Language for writing compute
Beyond Programmable Shading: Fundamentals
Language for writing compute •Built-in Functions — Required ■
work-item functions
■
math.h
■
read and write image
■
relational
■
geometric functions
■
synchronization functions
Beyond Programmable Shading: Fundamentals
Language for writing compute •Built-in Functions — Required ■
work-item functions
■
math.h
■
read and write image
■
relational
■
geometric functions
■
synchronization functions
•Built-in Functions — Optional ■
double precision
■
atomics to global and local memory
■
selection of rounding mode
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API // create a compute context with GPU device
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0);
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context,
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA);
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA); memobjs[1] = clCreateBuffer(context,
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA); memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE,
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA); memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL);
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API // create the compute program program = clCreateProgramFromSource(context, 1, &fft1D_1024_kernel_src, NULL); // build the compute program executable clBuildProgramExecutable(program, false, NULL, NULL); // create the compute kernel kernel = clCreateKernel(program, “fft1D_1024”);
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API // create N-D range object with work-item dimensions global_work_size[0] = n; local_work_size[0] = 64; range = clCreateNDRangeContainer(context, 0, 1, global_work_size, local_work_size); // set the args values clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem), NULL); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem), NULL); clSetKernelArg(kernel, 2, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); clSetKernelArg(kernel, 3, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); // execute kernel clExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Compute // This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into // calls to a radix 16 function, another radix 16 function and then a radix 4 function // Based on "Fitting FFT onto G80 Architecture". Vasily Volkov & Brian Kazian, UC Berkeley CS258 project report, May 2008
__kernel void fft1D_1024 (__global float2 *in, __global float2 *out, __local float *sMemx, __local float *sMemy) { int tid = get_local_id(0); int blockIdx = get_group_id(0) * 1024 + tid; float2 data[16]; // starting index of data to/from global memory in = in + blockIdx; out = out + blockIdx; globalLoads(data, in, 64); // coalesced global reads fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 1024, 0); // local shuffle using local memory localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4))); fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15))); // four radix-4 function calls fftRadix4Pass(data); fftRadix4Pass(data + 4); fftRadix4Pass(data + 8); fftRadix4Pass(data + 12); // coalesced global writes globalStores(data, out, 64); }
Beyond Programmable Shading: Fundamentals
OpenCL and OpenGL •Sharing OpenGL Resources ■
OpenCL is designed to efficiently share with OpenGL ■
Textures, Buffer Objects and Renderbuffers
■
Data is shared, not copied
•Efficient queuing of OpenCL and OpenGL commands •Apps can select compute device(s) that will run OpenGL and OpenCL
Beyond Programmable Shading: Fundamentals
Summary •A new compute language that works across GPUs and CPUs ■
C99 with extensions
■
Familiar to developers
■
Includes a rich set of built-in functions
■
Makes it easy to develop data- and task- parallel compute programs
•Defines hardware and numerical precision requirements
•Open standard for heterogeneous parallel computing
Beyond Programmable Shading: Fundamentals