Slide #1.

Using The CUDA Programming Model Leveraging GPUs for Application Acceleration Dan Ernst, Brandon Holt University of Wisconsin – Eau Claire 1
More slides like this


Slide #2.

What is (Historical) GPGPU ?  General Purpose computation using GPU and graphics API in applications other than 3D graphics   Data parallel algorithms leverage GPU attributes     GPU accelerates critical path of application Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation Applications – see GPGPU.org   2 Game effects (FX) physics, image processing Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting
More slides like this


Slide #3.

Why GPGPU Processing?  A quiet revolution  Calculation: TFLOPS vs. 100 GFLOPS Memory Bandwidth: ~10x  GPU in every PC– massive volume and potential impact  3
More slides like this


Slide #4.

Intel P4 Northwood 4
More slides like this


Slide #5.

NVIDIA GT200 5
More slides like this


Slide #6.

NVIDIA GT200 6
More slides like this


Slide #7.

GeForce 8800 (2007) Host Input Assembler Thread Execution Manager Texture Load/store Parallel Data Cache Texture Texture Load/store Parallel Data Cache Load/store Texture Texture Texture Global Memory 7 Parallel Data Cache Texture Load/store Parallel Data Cache Parallel Data Cache Texture Parallel Data Cache Load/store Parallel Data Cache Texture Parallel Data Cache Load/store
More slides like this


Slide #8.

G80 Characteristics       367 GFLOPS peak performance (25-50 times of current high-end microprocessors) 265 GFLOPS sustained for apps such as VMD Massively parallel, 128 cores, 90W Massively threaded, sustains 1000s of threads per app 30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics “I think they're right on the money, but the huge performance differential (currently 3 GPUs ~= 300 SGI Altix Itanium2s) will invite close scrutiny so I have to be careful what I say publically until I triple check those numbers.”  8 John Stone, VMD group, Physics, UIUC
More slides like this


Slide #9.

Fermi (Earlier this year) ~1.5TFLOPS (SP)/~800GFLOPS (DP) 140+ GB/s DRAM Bandwidth 9 ASCI Red – Sandia National Labs – 1997
More slides like this


Slide #10.

NVIDIA Tesla C2050 Card Specs        448 GPU cores 1.15 GHz Single precision floating point performance: 1030.4 GFLOPs (2 single precision flops per clock per core) Double precision floating point performance: 515.2 GFLOPs (1 double precision flop per clock per core) Internal RAM: 3 GB DDR5 Internal RAM speed: 144 GB/sec (compared 21-25 GB/sec for regular RAM) Has to be plugged into a PCIe slot (at most 8 GB/sec) 10
More slides like this


Slide #11.

NVIDIA Tesla S2050 Server Specs        4 C2050 cards inside a 1U server (looks like a Sooner node) 1.15 GHz Single Precision (SP) floating point performance: 4121.6 GFLOPs Double Precision (DP) floating point performance: 2060.8 GFLOPs Internal RAM: 12 GB total (3 GB per GPU card) Internal RAM speed: 576 GB/sec aggregate Has to be plugged into two PCIe slots (at most 16 GB/sec) 11
More slides like this


Slide #12.

Compare x86 vs S2050  Let’s compare the best dual socket x86 server today vs S2050. Dual socket, AMD NVIDIA Tesla S2050 Peak DP FLOPs Peak SP FLOPS Peak RAM BW Peak PCIe BW Needs x86 server to attach to? Power/Heat Code portable? 12 2.3 GHz 12-core 220.8 GFLOPs DP 441.6 GFLOPs SP 25 GB/sec N/A No ~450 W Yes 2060.8 GFLOPs DP (9.3x) 4121.6 GFLOPs SP (9.3x) 576 GB/sec (23x) 16 GB/sec Yes ~900 W + ~400 W (~2.9x) No (CUDA) Yes (PGI, OpenCL)
More slides like this


Slide #13.

Compare x86 vs S2050  Here are some interesting measures: NVIDIA Tesla S2050 DP GFLOPs/Watt Dual socket, AMD 2.3 GHz 12-core ~0.5 GFLOPs/Watt SP GFLOPS/Watt ~1 GFLOPs/Watt ~3.2 GFLOPs/Watt (~3x) DP GFLOPs/sq ft ~590 GFLOPs/sq ft ~2750 GFLOPs/sq ft (4.7x) SP GFLOPs/sq ft ~1180 GFLOPs/sq ft ~5500 GFLOPs/sq ft (4.7x) ~1.6 GFLOPs/Watt (~3x) Racks per PFLOP DP 142 racks/PFLOP DP 32 racks/PFLOP DP (23%) Racks per PFLOP SP 71 racks/PFLOP SP 16 racks/PFLOP SP (23%) OU’s Sooner is 34.5 TFLOPs DP, which is just over 1 rack of S2050. 13
More slides like this


Slide #14.

Previous GPGPU Constraints  Dealing with graphics API  Working with the corner cases of the graphics API Input Registers  Essentially – re-write entire program as a collection of shaders and polygons Fragment Program per thread per Shader per Context Texture Constants Temp Registers Output Registers FB Memory 14
More slides like this


Slide #15.

CUDA   “Compute Unified Device Architecture” General purpose programming model    Targeted software stack   15 User kicks off batches of threads on the GPU GPU = dedicated super-threaded, massively data parallel co-processor Compute oriented drivers, language, and tools Driver for loading computation programs onto GPU
More slides like this


Slide #16.

Parallel Computing on a GPU  400-series GPUs deliver 450 to 1,400+ GFLOPS on compiled parallel C applications      Available in laptops, desktops, and clusters GeForce GTX 460 GPU parallelism is doubling every year Programming model scales transparently Programmable in C with CUDA tools Multithreaded SPMD model uses application data parallelism and thread parallelism Tesla S1070 Tesla M2050 16
More slides like this


Slide #17.

Overview  CUDA programming model  Basic concepts and data types  CUDA application programming interface (API) basics  A couple of simple examples  Performance features will be covered this afternoon 17
More slides like this


Slide #18.

CUDA Devices and Threads  A CUDA compute device       Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Is typically a GPU but can also be another type of parallel processing device Data-parallel portions of an application are expressed as device kernels which run on many threads Differences between GPU and CPU threads  GPU threads are extremely lightweight   GPU needs 1000s of threads for full efficiency  18 Very little creation overhead Multi-core CPU needs only a few (and is hurt by having too many)
More slides like this


Slide #19.

CUDA – C with a Co-processor  One program, two devices   Serial or modestly parallel parts in host C code Highly parallel parts in device kernel C code Serial Code (host) Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args); ... Serial Code (host) Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args); 19 ...
More slides like this


Slide #20.

Extended C
More slides like this


Slide #21.

Buzzword: Kernel  In CUDA, a kernel is code (typically a function) that can be run inside the GPU.  The kernel code runs on the many stream processors in the GPU in parallel.  21 Each processor runs the code over different data (SPMD)
More slides like this


Slide #22.

Buzzword: Thread  In CUDA, a thread is an execution of a kernel with a given index.    they even have shared and private variables. So what’s the difference with CUDA?  22 threadID 0 1 2 3 4 5 6 7 Think: MPI Process ID These are very much like threads in OpenMP   Each thread uses its index to access a specific subset of the data, such that the collection of all threads cooperatively processes the entire data set. Threads are free … float x = input[threadID]; float y = func(x); output[threadID] = y; …
More slides like this


Slide #23.

Buzzword: Block  In CUDA, a block is a group of threads.  Blocks are used to organize threads into manageable chunks.     Can organize threads in 1D, 2D, or 3D arrangements What best matches your data? Some restrictions, based on hardware Threads within a block can do a bit of synchronization, if necessary. 23
More slides like this


Slide #24.

Buzzword: Grid  In CUDA, a grid is a group of blocks   Grids are used to organize blocks into manageable chunks.    no synchronization at all between the blocks. Can organize blocks in 1D or 2D arrangements What best matches your data? A Grid is the set of threads created by a call to a CUDA kernel 24
More slides like this


Slide #25.

Mapping Buzzwords to GPU Hardware   Grids map to GPUs Blocks map to the MultiProcessors (MP)    Blocks are never split across MPs, but an MP can have multiple blocks Threads map to Stream Processors (SP) Warps are groups of (32) threads that execute simultaneously  Completely forget about these until later Image Source: NVIDIA CUDA Programming Guide
More slides like this


Slide #26.

Transparent Scalability  Hardware is free to assign blocks to any SM (processor)  A kernel scales across any number of parallel processors Device Kernel grid Device Block 0 Block 1 Block 2 Block 3 Block 0 Block 1 Block 4 Block 5 Block 6 Block 7 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 26 time Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Each block can execute in any order relative to other blocks.
More slides like this


Slide #27.

Block IDs and Thread IDs  Each thread uses IDs to decide what data to work on    Simplifies memory addressing when processing multidimensional data    27 BlockIdx: 1D or 2D ThreadIdx: 1D, 2D, or 3D Image processing Solving PDEs on volumes …
More slides like this


Slide #28.

CUDA Memory Model Overview  Global memory     Main means of communicating R/W Data between host and device Contents visible to all threads Long latency access We will focus on global memory for now  28 Other memories will come Host later Note: This is not hardware! Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Registers Registers Thread (0, 0) Thread (1, 0) Global Memory
More slides like this


Slide #29.

CUDA Device Memory Allocation  cudaMalloc()   Allocates object in the device Global Memory Requires two parameters    cudaFree()  Frees object from device Global Memory  29 Address of a pointer to the allocated object Size of of allocated object Pointer to freed object
More slides like this


Slide #30.

CUDA Device Memory Allocation (cont.)  Code example:    Allocate a 64 * 64 single precision float array Attach the allocated storage to pointer named Md “d” is often used in naming to indicate a device data structure TILE_WIDTH = 64; float* Md; int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudaMalloc((void**)&Md, size); cudaFree(Md); 30
More slides like this


Slide #31.

The Physical Reality Behind CUDA CPU (host) GPU w/ local DRAM (device) 31
More slides like this


Slide #32.

CUDA Host-Device Data Transfer  cudaMemcpy()   memory data transfer Requires four parameters     Pointer to destination Pointer to source Number of bytes copied Type of transfer      Host to Host Host to Device Device to Host Device to Device Asynchronous transfer 32 Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Thread (0, 0) Thread (1, 0) Host Registers Thread (0, 0) Thread (1, 0) Global Memory
More slides like this


Slide #33.

CUDA Host-Device Data Transfer (cont.)  Code example:    Transfer a 64 * 64 single precision float array M is in host memory and Md is in device memory cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost); 33
More slides like this


Slide #34.

CUDA Kernel Template  In C: void foo(int a, float b) { // slow code goes here }  In CUDA: __global__ void foo(int a, float b) { // fast code goes here! }
More slides like this


Slide #35.

Calling a Kernel Function  A kernel function must be called with an execution configuration: __global__ void KernelFunc(...); dim3 dim3 DimGrid(100, 50); DimBlock(4, 8, 8); KernelFunc(...); 35 // 5000 thread blocks // 256 threads per block // invoke a function
More slides like this


Slide #36.

Calling a Kernel Function  A kernel function must be called with an execution configuration: Declare the dimensions for grid/bloc __global__ void KernelFunc(...); dim3 dim3 DimGrid(100, 50); DimBlock(4, 8, 8); KernelFunc(...); 36 // 5000 thread blocks // 256 threads per block // invoke a function
More slides like this


Slide #37.

Calling a Kernel Function  A kernel function must be called with an execution configuration: Declare the dimensions for grid/bloc __global__ void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block KernelFunc<<>>(...); //invoke a kernel Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking 37 
More slides like this


Slide #38.

C SAXPY void saxpy_serial(int n, float a, float *x, float *y) { int i; for(i=0; i < n; i++) { y[i] = a*x[i] + y[i]; } } … //invoke the kernel saxpy_serial(n, 2.0, x, y); 38
More slides like this


Slide #39.

SAXPY on a GPU  Doing anything across an entire vector is perfect for massively parallel computing.  Instead of one function looping over the data set, we’ll use manythreadID threads, each doing one 0 1 2 3 4 5 6 7 calculation … y[tid] = a*x[tid] + y[tid]; … 39
More slides like this


Slide #40.

CUDA SAXPY __global__ void saxpy_cuda(int n, float a, float *x, float *y) { int i = (blockIdx.x * blockDim.x) + threadIdx.x; if(i < n) y[i] = a*x[i] + y[i]; } … int nblocks = (n + 255) / 256; //invoke the kernel with 256 threads per block saxpy_cuda<<>>(n, 2.0, x, y); 40
More slides like this


Slide #41.

Matrix Multiplication in CUDA A case study 41
More slides like this


Slide #42.

Matrix Multiplication: A Case Study  Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA    Usage of thread/block IDs Memory data transfer between host and device Motivates some performance issues:    Assumptions:   42 shared memory usage register usage Basic unoptimized sgemm Matrices are square (for simplicity)
More slides like this


Slide #43.

Programming Model: Square Matrix Multiplication Example P=M*N   Each is of size WIDTH x WIDTH N Basic Idea:  WIDTH  One thread calculates one element of P  M and N are loaded WIDTH times from global memory P WIDTH M 43 WIDTH WIDTH
More slides like this


Slide #44.

Step 1: Matrix Multiplication A Simple Host Version in C // Matrix multiplication on the (CPU) host in double precision 44 WIDTH N WIDTH k P WIDTH void MatrixMulOnHost(float* M, float* N, float* P, int WIDTH) { int i, j, k; double a, b, sum; j for (i = 0; i < WIDTH; ++i) for (j = 0; j < WIDTH; ++j) { sum = 0; for (k = 0; k < WIDTH; ++k) { a = M[i * WIDTH + k]; M b = N[k * WIDTH + j]; sum += a * b; i } P[i * WIDTH + j] = sum; } } k WIDTH
More slides like this


Slide #45.

Step 2: Input Matrix Data Transfer (Host-side Code) void MatrixMulOnDevice(float* M, float* N, float* P, int WIDTH) { int size = WIDTH * WIDTH * sizeof(float); float* Md, Nd, Pd; … // 1. Allocate and Load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(&Pd, size); 45
More slides like this


Slide #46.

Step 3: Output Matrix Data Transfer (Host-side Code) // 2. Kernel invocation code – to be shown later … // 3. Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree (Pd); } 46
More slides like this


Slide #47.

Step 4: Kernel Function __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int WIDTH) { float Pvalue = 0; k WIDTH for (int k = 0; k < WIDTH; ++k) { float Melement = Md[threadIdx.y*WIDTH+k]; float Nelement = Nd[k*WIDTH+threadIdx.x]; Pvalue += Melement * Nelement; } Nd tx Pd[threadIdx.y*WIDTH+threadIdx.x] = Pvalue; } Md Pd ty tx k 47 WIDTH ty WIDTH WIDTH
More slides like this


Slide #48.

Step 5: Kernel Invocation (Host-side Code) // Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(WIDTH, WIDTH); // Launch the device computation threads! MatrixMulKernel<<>>(Md, Nd, Pd, WIDTH); 48
More slides like this


Slide #49.

Only One Thread Block Used Nd Grid 1  One Block of threads compute the matrix Pd     2 4 Each thread computes one element of the matrix Pd 2 Thread (2, 2) Each thread  6 Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements  Compute to off-chip memory access ratio close to 1:1 (not very good)  Size of matrix limited by the number of threads allowed in a thread block (512) 49 Block 1 3 2 5 WIDTH Md 4 48 Pd
More slides like this


Slide #50.

Block IDs and Thread IDs  Each thread uses IDs to decide what data to work on    Simplifies memory addressing when processing multidimensional data    50 Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Image processing Solving PDEs on volumes …
More slides like this


Slide #51.

Matrix Multiplication Using Multiple Blocks bx 0 1 2 tx 0 1 2 TILE_WIDTH-1 Nd  Break-up Pd into tiles Each block calculates one tile   WIDTH  Each thread calculates one element Block size equal tile size Md Pd 1 ty Pdsub TILE_WIDTH-1 TILE_WIDTH 51 2 WIDTH WIDTH WIDTH by 0 1 2 TILE_WIDTHE 0
More slides like this


Slide #52.

Revised mmult Kernel using Multiple Blocks __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; } 52
More slides like this


Slide #53.

G80 Block Granularity Considerations Q: For Matrix Multiplication using multiple blocks, should I use 8x8, 16x16 or 32x32 blocks?  For 8x8, we have 64 threads per Block. Since each SM can take up to 768 threads, there are 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!  For 16x16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule. 53  For 32x32, we have 1024 threads per Block. Not
More slides like this


Slide #54.

Exercise: Area Under the Curve cp -r ~ernstdj/NCSI2010 . go to “cuda_trap” directory. less README.txt 54
More slides like this