Slide #1.

CUDA Lecture 4 CUDA Programming Basics Prepared 6/22/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
More slides like this


Slide #2.

Parallel Programming Basics Things we need to consider:  Control  Synchronization  Communication Parallel programming languages offer different ways of dealing with above CUDA Programming Basics – Slide 2
More slides like this


Slide #3.

Overview  CUDA programming model – basic concepts and data types  CUDA application programming interface - basic  Simple examples to illustrate basic concepts and functionalities  Performance features will be covered later CUDA Programming Basics – Slide 3
More slides like this


Slide #4.

Outline of CUDA Basics  Basic kernels and execution on GPU  Basic memory management  Coordinating CPU and GPU execution  See the programming guide for the full API CUDA Programming Basics – Slide 4
More slides like this


Slide #5.

CUDA – C with no shader limitations!  Integrated host + device application program in C  Serial or modestly parallel parts in host C code  Highly parallel parts in device SPMD kernel C code  Programming model  Parallel code (kernel) is launched and executed on a device by many threads  Launches are hierarchical  Threads are grouped into blocks  Blocks are grouped into grids  Familiar serial code is written for a thread  Each thread is free to execute a unique code path  Built-in thread and block ID variables CUDA Programming Basics – Slide 5
More slides like this


Slide #6.

CUDA – C with no shader limitations! Serial Code (host) Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args); ... Serial Code (host) Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args); ... CUDA Programming Basics – Slide 6
More slides like this


Slide #7.

CUDA Devices and Threads  A compute device  Is a coprocessor to the CPU or host  Has its own DRAM (device memory)  Runs many threads in parallel  Is typically a GPU but can also be another type of parallel processing device  Data-parallel portions of an application are expressed as device kernels which run on many threads CUDA Programming Basics – Slide 7
More slides like this


Slide #8.

CUDA Devices and Threads  Differences between GPU and CPU threads  GPU threads are extremely lightweight  Very little creation overhead  GPU needs 1000s of threads for full efficiency  Multi-core CPU needs only a few CUDA Programming Basics – Slide 8
More slides like this


Slide #9.

G80 – Graphics Mode  The future of GPUs is programmable processing  So – build the architecture around the Host Input Assembler processor Setup / Rstr / ZCull TF TF SP SP L1 L2 FB SP SP TF TF L1 L1 L2 FB SP SP TF L2 FB SP SP TF L2 FB SP SP SP TF L2 FB Thread Processor SP L1 SP L1 TF SP L1 SP Pixel Thread Issue L1 SP Geom Thread Issue L1 Vtx Thread Issue L2 FB CUDA Programming Basics – Slide 9
More slides like this


Slide #10.

G80 CUDA Mode – A Device Example  Processors execute computing threads  New operating mode/hardware interface for computing Host Input Assembler Thread Execution Manager Parallel Data Cache Texture Load/store Texture Texture Load/store Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Load/store Parallel Data Cache Load/store Parallel Data Cache Texture Parallel Data Cache Load/store Parallel Data Cache Texture Parallel Data Cache Load/store Global Memory CUDA Programming Basics – Slide 10
More slides like this


Slide #11.

Global Memory SME M SME M SME M SME M High Level View PCIe CPU Chipset CUDA Programming Basics – Slide 11
More slides like this


Slide #12.

Blocks of Threads Run on a SM Streaming Multiprocessor SME M Streaming Processor Threadblock Thread Register s Memory Per-block Shared Memory Memory CUDA Programming Basics – Slide 12
More slides like this


Slide #13.

Whole Grid Runs on GPU Many blocks of threads SME M SME M SME M SME M .. . Global Memory CUDA Programming Basics – Slide 13
More slides like this


Slide #14.

Extended C Type Qualifiers global, device, shared, local, constant __device__ float filter[N]; __global__ void convolve (float *image) __shared__ float region[M]; ... Keywords threadIdx, blockIdx region[threadIdx] = image[i]; __syncthreads __syncthreads() ... Intrinsics Runtime API Memory, symbol, execution management Function launch { } image[j] = result; // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage); CUDA Programming Basics – Slide 14
More slides like this


Slide #15.

Extended C Integrated source (foo.cu) cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly CPU Host Code foo.s foo.cpp OCG gcc / cl Mark Murphy, “NVIDIA’s Experience with Open64,” G80 SASS www.capsl.udel.edu/conferences/open64/2008/Papers/101.doc foo.sass CUDA Programming Basics – Slide 15
More slides like this


Slide #16.

Arrays of Parallel Threads A CUDA kernel is executed by an array of threads  All threads run the same code (SPMD)  Each thread has an ID that it uses to compute memory addresses and make control decisions threadID 0 1 2 3 4 5 6 7 … float x = input[threadID]; float y = func(x); output[threadID] = y; … CUDA Programming Basics – Slide 16
More slides like this


Slide #17.

Thread Blocks: Scalable Cooperation Divide monolithic thread array into multiple blocks  Threads within a block cooperate via shared memory, atomic operations and barrier synchronization Thread Block 1 Thread Block N - 1 Thread Block 0  Threads in different blocks cannot cooperate threadID 0 1 2 3 4 5 6 … float x = input[threadID]; float y = func(x); output[threadID] = y; … 7 0 1 2 3 4 5 6 7 … float x = input[threadID]; float y = func(x); output[threadID] = y; … 0 … 1 2 3 4 5 6 7 … float x = input[threadID]; float y = func(x); output[threadID] = y; … CUDA Programming Basics – Slide 17
More slides like this


Slide #18.

Thread Hierarchy Threads launched for a parallel section are partitioned into thread blocks  Grid = all blocks for a given launch Thread block is a group of threads that can  Synchronize their executions  Communicate via shared memory CUDA Programming Basics – Slide 18
More slides like this


Slide #19.

Blocks Must Be Independent Any possible interleaving of blocks should be valid  Presumed to run to completion without preemption  Can run in any order  Can run concurrently OR sequentially Blocks may coordinate but not synchronize  Shared queue pointer: OK  Shared lock: BAD … can easily deadlock Independence requirement gives scalability CUDA Programming Basics – Slide 19
More slides like this


Slide #20.

Basics of CUDA Programming A CUDA program has two pieces  Host code on the CPU which interfaces to the GPU  Kernel code which runs on the GPU At the host level, there is a choice of 2 APIs (Application Programming Interfaces):  Runtime: simpler, more convenient  Driver: much more verbose, more flexible, closer to OpenCL We will only use the Runtime API in this course CUDA Programming Basics – Slide 20
More slides like this


Slide #21.

Basics of CUDA Programming At the host code level, there are library routines for:  memory allocation on graphics card  data transfer to/from device memory constants texture arrays (useful for lookup tables) ordinary data  error-checking  timing There is also a special syntax for launching multiple copies of the kernel process on the GPU. CUDA Programming Basics – Slide 21
More slides like this


Slide #22.

Block IDs and Thread IDs  Each thread uses IDs to decide what data to work on  Block ID: 1-D or 2-D  Unique within a block  Thread ID: 1-D, 2-D or 3-D  Unique within a block  Dimensions set at launch  Can be unique for each grid CUDA Programming Basics – Slide 22
More slides like this


Slide #23.

Block IDs and Thread IDs  Built-in variables  threadIdx, blockIdx  blockDim, gridDim  Simplifies memory addressing when processing multidimensional data  Image processing  Solving PDEs on volumes  … CUDA Programming Basics – Slide 23
More slides like this


Slide #24.

Basics of CUDA Programming In its simplest form launch of kernel looks like: kernel_routine<<>>(args); where  gridDim is the number of copies of the kernel (the “grid” size”)  blockDim is the number of threads within each copy (the “block” size)  args is a limited number of arguments, usually mainly pointers to arrays in graphics memory, and some constants which get copied by value The more general form allows gridDim and blockDim to be 2-D or 3-D to simplify application programs CUDA Programming Basics – Slide 24
More slides like this


Slide #25.

Basics of CUDA Programming At the lower level, when one copy of the kernel is started on a SM it is executed by a number of threads, each of which knows about:  some variables passed as arguments  pointers to arrays in device memory (also arguments)  global constants in device memory  shared memory and private registers/local variables  some special variables: gridDim size (or dimensions) of grid of blocks blockIdx index (or 2-D/3-D indices) of block blockDim size (or dimensions) of each block threadIdx index (or 2-D/3-D indices) of thread CUDA Programming Basics – Slide 25
More slides like this


Slide #26.

Basics of CUDA Programming Suppose we have 1000 blocks, and each one has 128 threads – how does it get executed? On current Tesla hardware, would probably get 8 blocks running at the same time on each SM, and each block has 4 warps => 32 warps running on each SM Each clock tick, SM warp scheduler decides which warp to execute next, choosing from those not waiting for  data coming from device memory (memory latency)  completion of earlier instructions (pipeline delay) Programmer doesn’t have to worry about this level of detail, just make sure there are lots of threads / warps CUDA Programming Basics – Slide 26
More slides like this


Slide #27.

Basics of CUDA Programming In the simplest case, we have a 1-D grid of blocks, and a 1-D set of threads within each block. If we want to use a 2-D set of threads, then blockDim.x, blockDim.y give the dimensions, and threadIdx.x, threadIdx.y give the thread indices To launch the kernel we would use somthing like dim3 nthreads(16,4); my_new_kernel<<>>(d_x); where dim3 is a special CUDA datatype with 3 components .x, .y, .z each initialized to 1. CUDA Programming Basics – Slide 27
More slides like this


Slide #28.

For Example  Launch with dim3 dimGrid(2, 2); dim3 dimBlock(4, 2, 2); kernelFunc<<>>(…);  Zoomed in on block with blockIdx.x = blockIdx.y = 1, blockDim.x = 4, blockDim.y = blockDim.z = 2  Each thread in block has coordinates (threadIdx.x, threadIdx.y, threadIdx.z) CUDA Programming Basics – Slide 28
More slides like this


Slide #29.

Basics of CUDA Programming A similar approach is used for 3-D threads and/or 2-D grids. This can be very useful in 2D / 3-D finite difference applications. How do 2-D / 3-D threads get divided into warps?  1-D thread ID defined by threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.x * blockDim.y and this is then broken up into warps of size 32. CUDA Programming Basics – Slide 29
More slides like this


Slide #30.

CUDA Memory Model Overview  Global memory  Main means of communicating R/W data between host and device  Contents visible to all threads  Long latency access  We will focus on global memory for now Host Grid Block (0, 0) Block (1, 0) Shared Memory Registers Registers Thread (1, (1, 0) 0) Thread (0, 0) Thread Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Global Memory  Constant and texture memory will come later CUDA Programming Basics – Slide 30
More slides like this


Slide #31.

Memory Model Kernel 0 .. . Kernel 1 Sequential Kernels Per-device Global Memory ... CUDA Programming Basics – Slide 31
More slides like this


Slide #32.

CUDA API Highlights: Easy and Lightweight The API is an extension to the ANSI C programming language  Low learning curve The hardware is designed to enable lightweight runtime and driver  High performance CUDA Programming Basics – Slide 32
More slides like this


Slide #33.

Memory Spaces CPU and GPU have separate memory spaces  Data is moved across the PCIe bus  Use functions to allocate/set/copy memory on GPU Very similar to corresponding C functions Pointers are just addresses  Can’t tell from the pointer value whether the address is on CPU or GPU  Must exercise care when dereferencing Dereferencing CPU pointer on GPU will likely crash and vice-versa CUDA Programming Basics – Slide 33
More slides like this


Slide #34.

CUDA Device Memory Allocation  cudaMalloc()  Allocates object in the device global memory  Requires two parameters Address of a pointer to the allocated object  Size of allocated object   cudaFree() Grid Block (0, 0) Block (1, 0) Shared Memory Registers Registers Thread (1, (1, 0) 0) Thread (0, 0) Thread Shared Memory Registers Registers Thread (0, 0) Thread (1, 0)  Frees objects from device global memory Host  Global Memory Pointer to freed object CUDA Programming Basics – Slide 34
More slides like this


Slide #35.

CUDA Device Memory Allocation Code example  Allocate a 64-by-64 single precision float array  Attach the allocated storage to Md “d” is often used to indicate a device data structure TILE_WIDTH = 64; float* Md; int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudaMalloc((void**)&Md, size); cudaMemset(Md, 0, size); cudaFree(Md); CUDA Programming Basics – Slide 35
More slides like this


Slide #36.

CUDA Host-Device Data Transfer  cudaMemcpy()  Memory data Grid transfer  Requires four parameters Block (0, 0) Shared Memory Pointer to destination Pointer to source Number of bytes copied  Type of transfer        Host to host Host to device Device to host Device to device Block (1, 0) Registers Registers Thread (1, (1, 0) 0) Thread (0, 0) Thread Host Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Global Memory  Asynchronous transfer CUDA Programming Basics – Slide 36
More slides like this


Slide #37.

Memory Model Device 0 memory Host memory cudaMemcpy() Device 1 memory cudaMemcpy()  Returns after the copy is complete  Blocks CPU thread until all bytes have been copied  Doesn’t start copying until previous CUDA calls complete Non-blocking copies are also available CUDA Programming Basics – Slide 37
More slides like this


Slide #38.

CUDA Host-Device Data Transfer Code example  Transfer a 64-by-64 single precision float array  M is in host memory and Md is in device memory  cudaMemcpyHostToDevice , cudaMemcpyDeviceToHost and cudaMemcpyDeviceToDevice are symbolic constants cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost); CUDA Programming Basics – Slide 38
More slides like this


Slide #39.

First Simple CUDA Example #include int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc((void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int i=0; i
More slides like this


Slide #40.

Code Executed on GPU C/C++ with some restrictions  Can only access GPU memory  No variable number of arguments  No static variables  No recursion  No dynamic polymorphism Must be declared with a qualifier  __global__ : launched by CPU, cannot be called from GPU  __device__ : called from other GPU functions, cannot be called by the CPU  __host__ : can be called by the CPU CUDA Programming Basics – Slide 40
More slides like this


Slide #41.

CUDA Function Declarations Executed on the Only callable from the __device__ float DeviceFunc() Device Device __global__ void KernelFunc() Device Host Host Host __host__ float HostFunc() __global__ defines a kernel function  Must return void __device__ and __host__ can be used together  Sample use: overloading operators CUDA Programming Basics – Slide 41
More slides like this


Slide #42.

CUDA Function Declarations __device__ int reduction_lock = 0; The __device__ prefix tells nvcc this is a global variable in the GPU, not the CPU. The variable can be read and modified by any kernel Its lifetime is the lifetime of the whole application Can also declare arrays of fixed size Can read/write by host code using special routines cudaMemcpyToSymbol, cudaMemcpyFromSymbol or with standard cudaMemcpy in combination with cudaGetSymbolAddress CUDA Programming Basics – Slide 42
More slides like this


Slide #43.

CUDA Function Declarations __device__ functions cannot have their address taken For functions executed on the device  No recursion  No static variable declarations inside the function  No variable number of arguments CUDA Programming Basics – Slide 43
More slides like this


Slide #44.

Calling a Kernel Function – Thread Creation As seen a kernel function must be called with an execution configuration: __global__ void KernelFunc(…); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads/block size_t SharedMemBytes = 64; // 64 bytes shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(…); Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking CUDA Programming Basics – Slide 44
More slides like this


Slide #45.

Basics of CUDA Programming The kernel code looks fairly normal once you get used to two things:  code is written from the point of view of a single thread quite different to OpenMP multithreading similar to MPI, where you use the MPI “rank” to identify the MPI process all local variables are private to that thread  need to think about where each variable lives any operation involving data in the device memory forces its transfer to/from registers in the GPU no cache on old hardware so a second operation with the same data will force a second transfer often better to copy the value into a local register variable CUDA Programming Basics – Slide 45
More slides like this


Slide #46.

Next CUDA Example: Vector Addition // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } int main() { int N = 16; // total number of elements in the vector/array int TPB = 4; // number of threads per block // allocate and initialize host (CPU) memory float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C; // allocate device (GPU) memory cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // assign values to d_A and d_B; // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice); // Run grid of N/4 blocks of 4 threads each vecAdd<<< N/4, 4>>>(d_A, d_B, d_C); // copy result back to host memory cudaMemcpy( h_C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost); // do something with the result… // free device (GPU) memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); } CUDA Programming Basics – Slide 46
More slides like this


Slide #47.

Next CUDA Example: Vector Addition __global__ identifier says its a kernel function Each thread sets one element of C[] array Within each block of threads, threadIdx.x ranges from 0 to blockDim.x-1, so each thread has a unique value for i CUDA Programming Basics – Slide 47
More slides like this


Slide #48.

Kernel Variations and Output __global__ void kernel( int *a ) { int idx = threadIdx.x + blockDim.x * blockIdx.x; a[idx] = 7; } Output: 77777777777 77777 __global__ void kernel( int *a ) { int idx = threadIdx.x + blockDim.x * blockIdx.x; a[idx] = blockIdx.x; } Output: 00001111222 23333 __global__ void kernel( int *a ) { int idx = threadIdx.x + blockDim.x * blockIdx.x; a[idx] = threadIdx.x; } Output: 01230123012 30123 CUDA Programming Basics – Slide 48
More slides like this


Slide #49.

Next CUDA Example: Kernel with 2-D Addressing __global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1; } int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers } h_a = (int*)malloc(num_bytes); cudaMalloc((void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; kernel<<>>( d_a, dimx, dimy ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int row=0; row
More slides like this


Slide #50.

A Simple Running Example Matrix Multiplication A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs  Leave shared memory usage until later  Local, register usage  Thread ID usage  Memory data transfer API between host and device  Assume square matrix for simplicity CUDA Programming Basics – Slide 50
More slides like this


Slide #51.

Programming Model Square Matrix Multiplication Example N WIDTH P = M × N of size WIDTH-byWIDTH Without tiling  One thread P WIDTH M WIDTH WIDTH calculates one element of P  M and N are loaded WIDTH times from global memory CUDA Programming Basics – Slide 51
More slides like this


Slide #52.

Memory Layout of a Matrix in C M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3 M M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3 CUDA Programming Basics – Slide 52
More slides like this


Slide #53.

Memory Layout of a Matrix in the Textbook M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 53
More slides like this


Slide #54.

Step 1: Matrix Multiplication A Simple Host Version in C N k M WIDTH j P WIDTH i k WIDTH WIDTH CUDA Programming Basics – Slide 54
More slides like this


Slide #55.

Step 1: Matrix Multiplication A Simple Host Version in C // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width) { for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * Width + k]; double b = N[k * Width + j]; sum += a * b; } P[i * Width + j] = sum; } } CUDA Programming Basics – Slide 55
More slides like this


Slide #56.

Step 2: Input Matrix Data Transfer (Host-Side Code) void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float* Md, Nd, Pd; // allocate and load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // allocate P on the device cudaMalloc(&Pd, size); CUDA Programming Basics – Slide 56
More slides like this


Slide #57.

Step 3: Output Matrix Data Transfer (Host-Side Code) // kernel invocation code – to be shown later (Step 5) … // read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); } // free device matrices cudaFree(Md); cudaFree(Nd); cudaFree(Pd); CUDA Programming Basics – Slide 57
More slides like this


Slide #58.

Step 4: Kernel Function (Overview) Nd threadIdx.x Md WIDTH k Pd threadIdx.y WIDTH threadIdx.y k threadIdx.x WIDTH WIDTH CUDA Programming Basics – Slide 58
More slides like this


Slide #59.

Step 4: Kernel Function // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel (float* Md, float* Nd, float* Pd, int Width) { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; for (int k = 0; k < Width; ++k) float Melement = Md[threadIdx.y * Width + k]; float Nelement = Nd[k * Width + threadIdx.x]; Pvalue += Melement * Nelement; } } Pd[threadIdx.y * Width + threadIdx.x] = Pvalue; CUDA Programming Basics – Slide 59
More slides like this


Slide #60.

Step 5: Kernel Invocation (Host-Side Code) // Set up the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(Width, Width); // Launch the device computation threads MatrixMulKernel<<>>(Md, Nd, Pd, Width); CUDA Programming Basics – Slide 60
More slides like this


Slide #61.

Only One Thread Block Used One block of threads compute matrix Pd  Each thread computes one element of Pd Each thread  Loads a row of matrix Md  Loads a column of matrix Nd  Performs one multiply and addition for each pair of Md and Nd elements  Compute to off-chip memory access ratio close to 1:1 (not very high) Size of matrix limited by the number of threads allowed in a thread block CUDA Programming Basics – Slide 61
More slides like this


Slide #62.

Only One Thread Block Used Nd Grid 1 Block 1 2 4 2 Thread (2, 2) 6 3 2 5 4 48 WIDTH Md Pd CUDA Programming Basics – Slide 62
More slides like this


Slide #63.

Handling Square Matrices with Arbitrary Size Have each 2-D thread block compute a (TILE_WIDTH)² sub-matrix (tile) of the result matrix  Each has (TILE_WIDTH)² threads Generate a 2-D grid of (WIDTH / TILE_WIDTH)² blocks You still need to put a loop around the kernel call for cases where WIDTH / TILE_WIDTH is greater than the max grid size (64K) CUDA Programming Basics – Slide 63
More slides like this


Slide #64.

Matrix Multiplication Using Multiple Blocks Nd WIDTH  Break-up Pd into tiles  Each block calculates one tile  Each thread Md Pd by ty bx WIDTH WIDTH TILE_WIDTH calculates one element  Block size equal tile size tx WIDTH CUDA Programming Basics – Slide 64
More slides like this


Slide #65.

A Small Example: Multiplication Block(0,0) Nd0,0Nd1,0 Block(1,0) Nd0,1Nd1,1 Pd0,0 Pd1,0 Pd2,0 Pd3,0TILE_WIDTH = 2 Nd0,2Nd1,2 Pd0,1 Pd1,1 Pd2,1 Pd3,1 Nd0,3Nd1,3 Pd0,2 Pd1,2 Pd2,2 Pd3,2 Pd0,3 Pd1,3 Pd2,3 Pd3,3 Block(0,1) Block(1,1) Md0,0Md1,0Md2,0Md3,0 Pd0,0Pd1,0Pd2,0Pd3,0 Md0,1Md1,1Md2,1Md3,1 Pd0,1Pd1,1Pd2,1Pd3,1 Pd0,2Pd1,2Pd2,2Pd3,2 Pd0,3Pd1,3Pd2,3Pd3,3 CUDA Programming Basics – Slide 65
More slides like this


Slide #66.

Revised Matrix Multiplication Kernel Using Multiple Blocks // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel (float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; } Pd[Row*Width+Col] = Pvalue; CUDA Programming Basics – Slide 66
More slides like this


Slide #67.

CUDA Thread Block  All threads in a block execute the same kernel program (SPMD)  Programmer declares block: CUDA Thread Block Thread Id #: 0123… m  Block size 1 to 512 concurrent threads  Block shape 1-D, 2-D, or 3-D  Block dimensions in threads Thread program  Threads have thread id numbers within block  Thread program uses thread id to select work and address shared data Courtesy: John Nickolls, NVIDIA CUDA Tools and Threads – Slide 67
More slides like this


Slide #68.

CUDA Thread Block  Threads in the same block share data and synchronize while doing their share of the work  Threads in different blocks cannot cooperate  Each block can execute in any CUDA Thread Block Thread Id #: 0123… m Thread program order relative to other blocs! Courtesy: John Nickolls, NVIDIA CUDA Tools and Threads – Slide 68
More slides like this


Slide #69.

Transparent Scalability Hardware is free to assign blocks to any processor at any time  A kernel scales across any number of parallel processors Device Device Kernel grid Block 0 Block 1 Block 2 Block 3 Block 0 Block 2 Block 1 Block 3 Block 4 Block 5 Block 6 Block 7  Block 4 Block 5 Block 6 Block 7 time Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Each block can execute in any CUDA Tools and Threads – Slide order relative 69
More slides like this


Slide #70.

G80 CUDA Mode – A Review  Processors execute computing threads  New operating mode/hardware interface for computing Host Input Assembler Thread Execution Manager Parallel Data Cache Texture Load/store Texture Texture Load/store Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Load/store Parallel Data Cache Load/store Parallel Data Cache Texture Parallel Data Cache Load/store Parallel Data Cache Texture Parallel Data Cache Load/store Global Memory CUDA Tools and Threads – Slide 70
More slides like this


Slide #71.

G80 Example: Executing Thread Blocks  Threads are assigned to streaming multiprocessors (SMs) in block granularity  Up to 8 blocks to each SM as resource allows  Each SM in G80 can take up to 768 threads  Could be 256 (threads/block) × 3 blocks  Or 128 (threads/block) × 6 blocks, etc.  Threads run concurrently  Each SM maintains thread/block id numbers  Each SM manages/schedules thread execution CUDA Tools and Threads – Slide 71
More slides like this


Slide #72.

G80 Example: Executing Thread Blocks SM 0 SM 1 MT IU MT IU t0 t1 t2 … tm SP SP t0 t1 t2 … tm Blocks Blocks Shared Memory Shared Memory Flexible resource allocation CUDA Tools and Threads – Slide 72
More slides like this


Slide #73.

G80 Example: Thread Scheduling  Each block is executed as 32-thread warps  An implementation decision, not part of the CUDA programming model  Warps are scheduling units in an SM  If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM?  Each block is divided into 256/32 = 8 warps  There are 8 × 3 = 24 warps CUDA Tools and Threads – Slide 73
More slides like this


Slide #74.

G80 Example: Thread Scheduling Block 1 Warps … …Block 2 Warps t0 t1 t2 … …t31 Block 1 Warps … t0 t1 t2 … …t31 t0 t1 t2 … …t31 Streaming Multiprocessor Instruction L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP CUDA Tools and Threads – Slide 74
More slides like this


Slide #75.

G80 Example: Thread Scheduling  Each SM implements zero-overhead warp scheduling  At any time, only one of the warps is executed by an SM  Warps whose next instruction has its operands ready for consumption are eligible for execution  Eligible warps are selected for execution on a prioritized scheduling policy  All threads in a warp execute the same instruction when selected CUDA Tools and Threads – Slide 75
More slides like this


Slide #76.

G80 Block Granularity Considerations  For matrix multiplication using multiple blocks, should I use 8 × 8, 16 × 16 or 32 × 32 blocks?  For 8 × 8, we have 64 threads per Block. Since each SM can take up to 768 threads, there are 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!  For 16 × 16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule. CUDA Tools and Threads – Slide  For 32 × 32, we have 1024 threads per Block. Not even 76
More slides like this


Slide #77.

Application Programming Interface  The API is an extension to the C programming language  It consists of:  Language extensions  To target portions of the code for execution on the device  A runtime library split into:  A common component providing built-in vector types and a subset of the C runtime library in both host and device codes  A host component to control and access one or more devices from the host  A device component providing device-specific CUDA Tools and Threads – Slide functions 77
More slides like this


Slide #78.

Language Extensions: Built-in Variables  dim3 gridDim;  Dimensions of the grid in blocks (gridDim.z unused)  dim3 blockDim;  Dimensions of the block in threads  dim3 blockIdx;  Block index within the grid  dim3 threadIdx;  Thread index within the block CUDA Tools and Threads – Slide 78
More slides like this


Slide #79.

Common Runtime Component: Mathematical Functions  pow, sqrt, cbrt, hypot  exp, exp2, expm1  log, log2, log10, log1p  sin, cos, tan, asin, acos, atan, atan2  sinh, cosh, tanh, asinh, acosh, atanh  ceil, floor, trunc, round  Etc.  When executed on the host, a given function uses the C runtime implementation if available  These functions are only supported for scalar types, not vector types CUDA Tools and Threads – Slide 79
More slides like this


Slide #80.

Common Runtime Component: Mathematical Functions  Some mathematical functions (e.g. sin(x)) have a less accurate, but faster device-only version (e.g. __sin(x))  __pow  __log, __log2, __log10  __exp  __sin, __cos, __tan CUDA Tools and Threads – Slide 80
More slides like this


Slide #81.

Host Runtime Component  Provides functions to deal with:  Device management (including multi-device systems)  Memory management  Error handling  Initializes the first time a runtime function is called  A host thread can invoke device code on only one device  Multiple host threads required to run on multiple devices CUDA Tools and Threads – Slide 81
More slides like this


Slide #82.

Device Runtime Component: Synchronization Function  void __syncthreads();  Synchronizes all threads in a block  Once all threads have reached this point, execution resumes normally  Used to avoid RAW / WAR / WAW hazards when accessing shared or global memory  Allowed in conditional constructs only if the conditional is uniform across the entire thread block CUDA Tools and Threads – Slide 82
More slides like this


Slide #83.

Final Thoughts memory allocation cudaMalloc((void **)&xd, nbytes); data copying cudaMemcpy(xh, xd, nbytes, cudaMemcpyDeviceToHost); reminder: d (h) to distinguish an array on the device (host) is not mandatory, just helpful labeling kernel routine is declared by __global__ prefix, and is written from point of view of a single thread CUDA Programming Basics – Slide 83
More slides like this


Slide #84.

End Credits Reading: Chapters 3 and 4, “Programming Massively Parallel Processors” by Kirk and Hwu. Based on original material from  The University of Illinois at Urbana-Champaign David Kirk, Wen-mei W. Hwu  Oxford University: Mike Giles  Stanford University Jared Hoberock, David Tarjan Revision history: last updated 6/22/2011. CUDA Programming Basics – Slide 84
More slides like this


2019 slides.show. All Rights Reserved