OpenCL - Beyond Programmable Shading

38 downloads 72 Views 1MB Size Report
Page 1. OpenCL. Parallel Computing on the GPU and. CPU. Aaftab Munshi. Page 2. Beyond Programmable Shading: Fundamentals. •Today's processors are  ...
OpenCL Parallel Computing on the GPU and CPU

Aaftab Munshi

Opportunity: Processor •Today’s processors are increasingly parallel •CPUs ■

Multiple cores are driving performance increases

•GPUs ■

Transforming into general purpose data-parallel computational coprocessors



Improving numerical precision (single and double)

Beyond Programmable Shading: Fundamentals

Challenge: Processor Parallelism •Writing parallel programs different for the CPU and GPU ■

Differing domain-specific techniques



Vendor-specific technologies

•Graphics API is not an ideal abstraction for general purpose compute

Beyond Programmable Shading: Fundamentals

Introducing OpenCL •OpenCL – Open Computing Language •Approachable language for accessing heterogeneous computational resources

•Supports parallel execution on single or multiple processors ■

GPU, CPU, GPU + CPU or multiple GPUs

•Desktop and Handheld Profiles •Designed to work with graphics APIs such as OpenGL

Beyond Programmable Shading: Fundamentals

OpenCL = Open Standard •Specification under review ■

Royalty free, cross-platform, vendor neutral



Khronos OpenCL working group (www.khronos.org)

•Based on a proposal by Apple ■

Developed in collaboration with industry leaders



Performance-enhancing technology in Mac OS X Snow Leopard

Beyond Programmable Shading: Fundamentals

OpenCL Working Group Members Broad Industry Support

© Copyright Khronos Group, 2008 - Page

Beyond Programmable Shading: Fundamentals

OpenCL



A Sneak Preview

Design Goals of OpenCL •Use all computational resources in system ■

GPUs and CPUs as peers



Data- and task- parallel compute model

•Efficient parallel programming model ■

Based on C



Abstract the specifics of underlying hardware

•Specify accuracy of floating-point computations ■

IEEE 754 compliant rounding behavior



Define maximum allowable error of math functions

•Drive future hardware requirements

Beyond Programmable Shading: Fundamentals

OpenCL Software Stack •Platform Layer ■

query and select compute devices in the system



initialize a compute device(s)



create compute contexts and work-queues

•Runtime ■

resource management



execute compute kernels

•Compiler ■

A subset of ISO C99 with appropriate language additions



Compile and build compute program executables ■

online or offline

Beyond Programmable Shading: Fundamentals

OpenCL Execution Model •Compute Kernel ■

Basic unit of executable code — similar to a C function



Data-parallel or task-parallel

•Compute Program ■

Collection of compute kernels and internal functions



Analogous to a dynamic library

•Applications queue compute kernel execution instances ■

Queued in-order



Executed in-order or out-of-order



Events are used to implement appropriate

Beyond Programmable Shading: Fundamentals

OpenCL Data-Parallel Execution •Define N-Dimensional computation domain ■

Each independent element of execution in N-D domain is called a work-item



The N-D domain defines the total number of workitems that execute in parallel — global work size.

•Work-items can be grouped together — work-group ■

Work-items in group can communicate with each other



Can synchronize execution among work-items in group to coordinate memory access

•Execute multiple work-groups in parallel •Mapping of global work size to work-groups Beyond Programmable Shading: Fundamentals

OpenCL Task-Parallel Execution •Data-parallel execution model must be implemented by all OpenCL compute devices

•Some compute devices such as CPUs can also execute task-parallel compute kernels ■

Executes as a single work-item



A compute kernel written in OpenCL



A native C / C++ function

Beyond Programmable Shading: Fundamentals

OpenCL Memory Model •Implements a relaxed

consistency, shared memory model

•Multiple distinct address spaces ■

Address spaces can be collapsed

Beyond Programmable Shading: Fundamentals

OpenCL Memory Model •Implements a relaxed

consistency, shared memory model

Private Memory

•Multiple distinct address spaces

WorkItem 1

Private Memory

Private Memory

Private Memory

WorkItem M

WorkItem 1

WorkItem M



Compute Unit 1 Address spaces can be collapsed



Address Qualifiers ■

__private

Beyond Programmable Shading: Fundamentals

Compute Unit N

OpenCL Memory Model •Implements a relaxed

consistency, shared memory model

Private Memory

•Multiple distinct address spaces

WorkItem 1

Private Memory

Private Memory

Private Memory

WorkItem M

WorkItem 1

WorkItem M



Compute Unit 1 Address spaces can be collapsed



Address Qualifiers ■ ■

__private __local

Beyond Programmable Shading: Fundamentals

Local Memory

Compute Unit N

Local Memory

OpenCL Memory Model •Implements a relaxed

consistency, shared memory model

Private Memory

•Multiple distinct address spaces

WorkItem 1

Private Memory

Private Memory

Private Memory

WorkItem M

WorkItem 1

WorkItem M



Compute Unit 1 Address spaces can be collapsed



Address Qualifiers ■ ■ ■

__private __local

Compute Unit N

Local Memory

Local Memory

Global / Constant Memory Data Cache Compute Device

__constant and __global ■

Example: ■

__global float4 *p;

Beyond Programmable Shading: Fundamentals

Global Memory Compute Device Memory

Language for writing compute •Derived from ISO C99 •A few restrictions ■

Recursion, function pointers, functions in C99 standard headers ...

•Preprocessing directives defined by C99 are supported

•Built-in Data Types ■

Scalar and vector data types



Structs, Pointers



Data-type conversion functions ■



convert_type

Image types

Beyond Programmable Shading: Fundamentals

Language for writing compute

Beyond Programmable Shading: Fundamentals

Language for writing compute •Built-in Functions — Required ■

work-item functions



math.h



read and write image



relational



geometric functions



synchronization functions

Beyond Programmable Shading: Fundamentals

Language for writing compute •Built-in Functions — Required ■

work-item functions



math.h



read and write image



relational



geometric functions



synchronization functions

•Built-in Functions — Optional ■

double precision



atomics to global and local memory



selection of rounding mode

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API // create a compute context with GPU device

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0);

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context,

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA);

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA); memobjs[1] = clCreateBuffer(context,

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA); memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE,

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA); memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL);

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API // create the compute program program = clCreateProgramFromSource(context, 1, &fft1D_1024_kernel_src, NULL); // build the compute program executable clBuildProgramExecutable(program, false, NULL, NULL); // create the compute kernel kernel = clCreateKernel(program, “fft1D_1024”);

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API // create N-D range object with work-item dimensions global_work_size[0] = n; local_work_size[0] = 64; range = clCreateNDRangeContainer(context, 0, 1, global_work_size, local_work_size); // set the args values clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem), NULL); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem), NULL); clSetKernelArg(kernel, 2, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); clSetKernelArg(kernel, 3, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); // execute kernel clExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Compute // This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into // calls to a radix 16 function, another radix 16 function and then a radix 4 function // Based on "Fitting FFT onto G80 Architecture". Vasily Volkov & Brian Kazian, UC Berkeley CS258 project report, May 2008

__kernel void fft1D_1024 (__global float2 *in, __global float2 *out, __local float *sMemx, __local float *sMemy) { int tid = get_local_id(0); int blockIdx = get_group_id(0) * 1024 + tid; float2 data[16]; // starting index of data to/from global memory in = in + blockIdx; out = out + blockIdx; globalLoads(data, in, 64); // coalesced global reads fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 1024, 0); // local shuffle using local memory localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4))); fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15))); // four radix-4 function calls fftRadix4Pass(data); fftRadix4Pass(data + 4); fftRadix4Pass(data + 8); fftRadix4Pass(data + 12); // coalesced global writes globalStores(data, out, 64); }

Beyond Programmable Shading: Fundamentals

OpenCL and OpenGL •Sharing OpenGL Resources ■

OpenCL is designed to efficiently share with OpenGL ■

Textures, Buffer Objects and Renderbuffers



Data is shared, not copied

•Efficient queuing of OpenCL and OpenGL commands •Apps can select compute device(s) that will run OpenGL and OpenCL

Beyond Programmable Shading: Fundamentals

Summary •A new compute language that works across GPUs and CPUs ■

C99 with extensions



Familiar to developers



Includes a rich set of built-in functions



Makes it easy to develop data- and task- parallel compute programs

•Defines hardware and numerical precision requirements

•Open standard for heterogeneous parallel computing

Beyond Programmable Shading: Fundamentals