ilovepdf_merged.pdf
Document Details
Uploaded by ImpartialKremlin5502
Université de Perpignan Via Domitia
Full Transcript
GPU Teaching Kit Accelerated Computing Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective – To learn the main venues and developer resources for GPU computing – Where CUDA C fits in the big picture 2 3 Ways to Acce...
GPU Teaching Kit Accelerated Computing Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective – To learn the main venues and developer resources for GPU computing – Where CUDA C fits in the big picture 2 3 Ways to Accelerate Applications Applications Compiler Programming Libraries Directives Languages Easy to use Easy to use Most Performance Most Performance Portable code Most Flexibility 3 Libraries: Easy, High-Quality Acceleration Ease of use: Using libraries enables GPU acceleration without in- depth knowledge of GPU programming “Drop-in”: Many GPU-accelerated libraries follow standard APIs, thus enabling acceleration with minimal code changes Quality: Libraries offer high-quality implementations of functions encountered in a broad range of applications 4 NVIDIA GPU Accelerated Libraries DEEP LEARNING cuDNN TensorRT DeepStream SDK LINEAR ALGEBRA cuBLAS cuSPARSE cuSOLVER SIGNAL, IMAGE, VIDEO cuFFT NVIDIA NPP CODEC SDK PARALLEL ALGORITHMS nvGRAPH NCCL 5 Vector Addition in Thrust #include #include int main(void) { size_t inputLength = 500; thrust::host_vector hostInput1(inputLength); thrust::host_vector hostInput2(inputLength); thrust::device_vector deviceInput1(inputLength); thrust::device_vector deviceInput2(inputLength); thrust::device_vector deviceOutput(inputLength); thrust::copy(hostInput1.begin(), hostInput1.end(), deviceInput1.begin()); thrust::copy(hostInput2.begin(), hostInput2.end(), deviceInput2.begin()); thrust::transform(deviceInput1.begin(), deviceInput1.end(), deviceInput2.begin(), deviceOutput.begin(), thrust::plus()); } 6 Compiler Directives: Easy, Portable Acceleration Ease of use: Compiler takes care of details of parallelism management and data movement Portable: The code is generic, not specific to any type of hardware and can be deployed into multiple languages Uncertain: Performance of code can vary across compiler versions 7 OpenACC – Compiler directives for C, C++, and FORTRAN #pragma acc parallel loop copyin(input1[0:inputLength],input2[0:inputLength]), copyout(output[0:inputLength]) for(i = 0; i < inputLength; ++i) { output[i] = input1[i] + input2[i]; } 8 Programming Languages: Most Performance and Flexible Acceleration Performance: Programmer has best control of parallelism and data movement Flexible: The computation does not need to fit into a limited set of library patterns or directive types Verbose: The programmer often needs to express more details 9 GPU Programming Languages Numerical analytics MATLAB,, Mathematica, LabVIEW Python PyCUDA, Numba Fortran CUDA Fortran, OpenACC C CUDA C, OpenACC C++ CUDA C++, Thrust C# Hybridizer 10 CUDA - C Applications Compiler Programming Libraries Directives Languages Easy to use Easy to use Most Performance Most Performance Portable code Most Flexibility 11 GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License. GPU Teaching Kit Accelerated Computing Lecture 2.2 - Introduction to CUDA C Memory Allocation and Data Movement API Functions Objective – To learn the basic API functions in CUDA host code – Device Memory Allocation – Host-Device Data Transfer 2 Data Parallelism - Vector Addition Example vector A A A A … A[N-1] vector B B B B … B[N-1] + + + + vector C C C C … C[N-1] 3 Vector Addition – Traditional C Code // Compute vector sum C = A + B void vecAdd(float *h_A, float *h_B, float *h_C, int n) { int i; for (i = 0; i(args);... Serial Code (host) Parallel Kernel (device) KernelB>(args);... 4 From Natural Language to Electrons Natural Language (e.g, English) Algorithm High-Level Language (C/C++…) Compiler Instruction Set Architecture Microarchitecture Circuits Electrons ©Yale Patt and Sanjay Patel, From bits and bytes to gates and beyond 5 A program at the ISA level – A program is a set of instructions stored in memory that can be read, interpreted, and executed by the hardware. – Both CPUs and GPUs are designed based on (different) instruction sets – Program instructions operate on data stored in memory and/or registers. 6 6 A Thread as a Von-Neumann Processor A thread is a “virtualized” or “abstracted” Von-Neumann Processor Memory I/O Processing Unit Reg ALU File Control Unit PC IR 7 Arrays of Parallel Threads A CUDA kernel is executed by a grid (array) of threads – All threads in a grid run the same kernel code (Single Program Multiple Data) – Each thread has indexes that it uses to compute memory addresses and make control decisions 0 1 2 254 255 … i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; … 8 Thread Blocks: Scalable Cooperation Thread Block 0 Thread Block 1 Thread Block N-1 0 1 2 254 255 0 1 2 254 255 0 1 2 254 255 … … … i = blockIdx.x * blockDim.x + i = blockIdx.x * blockDim.x + i = blockIdx.x * blockDim.x + … threadIdx.x; threadIdx.x; threadIdx.x; C[i] = A[i] + B[i]; C[i] = A[i] + B[i]; C[i] = A[i] + B[i]; … … … – Divide thread array into multiple blocks – Threads within a block cooperate via shared memory, atomic operations and barrier synchronization – Threads in different blocks do not interact 9 9 blockIdx and threadIdx Each thread uses indices to decide what data to work on – blockIdx: 1D, 2D, or 3D (CUDA 4.0) – threadIdx: 1D, 2D, or 3D device Simplifies memory Grid Block (0, Block (0, addressing when processing 0) 1) multidimensional data Block (1, Block (1, – Image processing 0) 1) – Solving PDEs on volumes – … Block (1,1) (1,0,0) (1,0,1) (1,0,2) (1,0,3) Thread Thread Thread 10 Thread (0,0,0) (0,0,1) (0,0,2) (0,0,3) Thread Thread Thread Thread (0,0,0) Thread (0,1,0) (0,1,1) (0,1,2) (0,1,3) 10 GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License. GPU Teaching Kit Accelerated Computing Lecture 2.4 – Introduction to CUDA C Introduction to the CUDA Toolkit Objective – To become familiar with some valuable tools and resources from the CUDA Toolkit – Compiler flags – Debuggers – Profilers 2 GPU Programming Languages Numerical analytics MATLAB,, Mathematica, LabVIEW Python PyCUDA, Numba Fortran CUDA Fortran, OpenACC C CUDA C, OpenACC C++ CUDA C++, Thrust C# Hybridizer 3 CUDA - C Applications Compiler Programming Libraries Directives Languages Easy to use Easy to use Most Performance Most Performance Portable code Most Flexibility 4 NVCC Compiler – NVIDIA provides a CUDA-C compiler – nvcc – NVCC compiles device code then forwards code on to the host compiler (e.g. g++) – Can be used to compile & link host only applications 5 Example 1: Hello World #include int main() { printf("Hello World!\n"); return 0; } Instructions: 1. Build and run the hello world code 2. Modify Makefile to use nvcc instead of g++ 3. Rebuild and run 6 CUDA Example 1: Hello World #include __global__ void mykernel(void) { } int main(void) { mykernel(); printf("Hello World!\n"); return 0; } Instructions: 1. Add kernel and kernel launch to main.cc 2. Try to build 7 CUDA Example 1: Build Considerations – Build failed – Nvcc only parses.cu files for CUDA – Fixes: – Rename main.cc to main.cu OR – nvcc –x cu – Treat all input files as.cu files Instructions: 1. Rename main.cc to main.cu 2. Rebuild and Run 8 Hello World! with Device Code #include __global__ void mykernel(void) { } int main(void) { mykernel(); printf("Hello World!\n"); return 0; } Output: $ nvcc main.cu $./a.out Hello World! – mykernel(does nothing, somewhat anticlimactic!) 9 Developer Tools - Debuggers Nsight CUDA Nsight Systems CUDA-GDB MEMCHECK NVIDIA Provided 3rd Party https://developer.nvidia.com/debugging-solutions 10 Compiler Flags – Remember there are two compilers being used – NVCC: Device code – Host Compiler: C/C++ code – NVCC supports some host compiler flags – If flag is unsupported, use –Xcompiler to forward to host – e.g. –Xcompiler –fopenmp – Debugging Flags – -g: Include host debugging symbols – -G: Include device debugging symbols – -lineinfo: Include line information with symbols 11 CUDA-MEMCHECK – Memory debugging tool – No recompilation necessary %> cuda-memcheck./exe – Can detect the following errors – Memory leaks – Memory errors (OOB, misaligned access, illegal instruction, etc) – Race conditions – Illegal Barriers – Uninitialized Memory – For line numbers use the following compiler flags: – -Xcompiler -rdynamic -lineinfo http://docs.nvidia.com/cuda/cuda-memcheck 12 Example 2: CUDA-MEMCHECK Instructions: 1. Build & Run Example 2 Output should be the numbers 0-9 Do you get the correct results? 2. Run with cuda-memcheck %> cuda-memcheck./a.out 3. Add nvcc flags “–Xcompiler – rdynamic –lineinfo” 4. Rebuild & Run with cuda-memcheck 5. Fix the illegal write http://docs.nvidia.com/cuda/cuda-memcheck 13 CUDA-GDB – cuda-gdb is an extension of GDB – Provides seamless debugging of CUDA and CPU code – Works on Linux and Macintosh – For a Windows debugger use NVIDIA Nsight Eclipse Edition or Visual Studio Edition http://docs.nvidia.com/cuda/cuda-gdb 14 Example 3: cuda-gdb Instructions: 1. Run exercise 3 in cuda-gdb %> cuda-gdb --args./a.out 2. Run a few cuda-gdb commands: (cuda-gdb) b main //set break point at main (cuda-gdb) r //run application (cuda-gdb) l //print line context (cuda-gdb) b foo //break at kernel foo (cuda-gdb) c //continue (cuda-gdb) cuda thread //print current thread (cuda-gdb) cuda thread 10 //switch to thread 10 (cuda-gdb) cuda block //print current block (cuda-gdb) cuda block 1 //switch to block 1 (cuda-gdb) d //delete all break points (cuda-gdb) set cuda memcheck on //turn on cuda memcheck (cuda-gdb) r //run from the beginning 3. Fix Bug http://docs.nvidia.com/cuda/cuda-gdb 15 Developer Tools - Profilers NSIGHT NVVP NVPROF NVIDIA Provided TAU VampirTrace 3rd Party https://developer.nvidia.com/performance-analysis-tools 16 NVPROF Command Line Profiler – Compute time in each kernel – Compute memory transfer time – Collect metrics and events – Support complex process hierarchy's – Collect profiles for NVIDIA Visual Profiler – No need to recompile 17 Example 4: nvprof Instructions: 1. Collect profile information for the matrix add example %> nvprof./a.out 2. How much faster is add_v2 than add_v1? 3. View available metrics %> nvprof --query-metrics 4. View global load/store efficiency %> nvprof --metrics gld_efficiency,gst_efficiency./a.out 5. Store a timeline to load in NVVP %> nvprof –o profile.timeline./a.out 6. Store analysis metrics to load in NVVP %> nvprof –o profile.metrics --analysis-metrics./a.out 18 NVIDIA’s Visual Profiler (NVVP) Timeline Guided System Analysis 19 Example 4: NVVP Instructions: 1. Import nvprof profile into NVVP Launch nvvp Click File/ Import/ Nvprof/ Next/ Single process/ Next / Browse Select profile.timeline Add Metrics to timeline Click on 2nd Browse Select profile.metrics Click Finish 2. Explore Timeline Control + mouse drag in timeline to zoom in Control + mouse drag in measure bar (on top) to measure time 20 Example 4: NVVP Instructions: 1. Click on a kernel 2. On Analysis tab click on the unguided analysis 2. Click Analyze All Explore metrics and properties What differences do you see between the two kernels? Note: If kernel order is non-deterministic you can only load the timeline or the metrics but not both. If you load just metrics the timeline looks odd but metrics are correct. 21 Example 4: NVVP Let’s now generate the same data within NVVP Instructions: 1. Click File / New Session / Browse Select Example 4/a.out Click Next / Finish 2. Click on a kernel Select Unguided Analysis Click Analyze All 22 NVTX – Our current tools only profile API calls on the host – What if we want to understand better what the host is doing? – The NVTX library allows us to annotate profiles with ranges – Add: #include – Link with: -lnvToolsExt – Mark the start of a range – nvtxRangePushA(“description”); – Mark the end of a range – nvtxRangePop(); – Ranges are allowed to overlap http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx/ 23 NVTX Profile 24 NSIGHT – CUDA enabled Integrated Development Environment – Source code editor: syntax highlighting, code refactoring, etc – Build Manger – Visual Debugger – Visual Profiler – Linux/Macintosh – Editor = Eclipse – Debugger = cuda-gdb with a visual wrapper – Profiler = NVVP – Windows – Integrates directly into Visual Studio – Profiler is NSIGHT VSE 25 Example 4: NSIGHT Let’s import an existing Makefile project into NSIGHT Instructions: 1. Run nsight Select default workspace 2. Click File / New / Makefile Project With Existing CodeTest 3. Enter Project Name and select the Example15 directory 4. Click Finish 5. Right Click On Project / Properties / Run Settings / New / C++ Application 6. Browse for Example 4/a.out 7. In Project Explorer double click on main.cu and explore source 8. Click on the build icon 9. Click on the run icon 10.Click on the profile icon 26 Profiler Summary – Many profile tools are available – NVIDIA Provided – NVPROF: Command Line – NVVP: Visual profiler – NSIGHT: IDE (Visual Studio and Eclipse) – 3rd Party – TAU – VAMPIR 27 Optimization Assess Deploy Parallelize Optimize 28 Assess HOTSPOTS – Profile the code, find the hotspot(s) – Focus your attention where it will give the most benefit 29 Parallelize Applications Compiler Programming Libraries Directives Languages 30 Optimize Timeline Guided System Analysis 31 Bottleneck Analysis – Don’t assume an optimization was wrong – Verify if it was wrong with the profiler 129 GB/s 84 GB/s 32 Performance Analysis 84 GB/s 137 GB/s 33 GPU Teaching Kit The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License. GPU Teaching Kit Accelerated Computing Lecture 2.5 – Nsight Compute and Nsight Systems Introduction to the CUDA Toolkit Objective – To become familiar with Nsight Systems and Compute 2 Profiling Tools nvprof NVVP Phasing out Nsight Nsight Compute Systems Current See lecture 2-4 for an overview of all tools 3 Nsight Compute & Nsight Systems – Command Line and Interactive Profilers – Bundled with CUDA Toolkit – Newer standalone downloads available on NVIDIA website – Nsight Systems – “Feeds and speeds:” getting data into the GPU and profiling GPU utilization – Nsight Compute – Kernel-level profiling 4 A Common GPU Development Model Developer Remote infrastructure system with no with high-performance GPU Control via ssh or GPUs remote desktop “client” “server” 5 Two-Phase Profiling – (interactive profiling also supported) “client” “server” Nsight Systems nsys profile... Nsight Compute scp or shared file nv-nsight-compute-cli... system Analyze profiling data Record profiling data 6 Before Profiling – Options to improve your profiling experience – Host code annotations with Nvidia Tools Extension Library – Correctness & cuda-memcheck – Compilation flags – Ensure Nsight system environment is correct 7 Before Profiling: Host Code Annotations #include and link with -lnvToolsExt Will show up as a named span in the Nsight System GUI Useful for marking parts of the code for later reference. nvtxRangePush(“sleeping”); sleep(100); nvtxRangePop(); 8 Before Profiling: cuda-memcheck – Certain kinds of errors cause CUDA programs to complete, but crash under profiling – Check your program with cuda-memcheck if code behaves incorrectly under profiling cuda-memcheck./my-program... 9 Before Profiling: Compilation Flags – Compile device code with optimizations – Optimizations dramatically change performance – Remove “-G” device-debug flag from nvcc – Compile device code with line information – Minimal information included in binary to map PTX/SASS to source code – Add “-lineinfo” flag to nvcc nvcc –G main.cu nvcc –lineinfo main.cu 10 Before Profiling: Nsight Systems Configuration Nsight System uses various system hooks to accomplish profiling. Some errors would reduce the amount or accuracy of gathered info, some will make system profiling impossible. An example of a GOOD output: (check with nsys status -e) $ nsys status -e >> Sampling Environment Check >> Linux Kernel Paranoid Level = 2: OK >> Linux Distribution = Ubuntu >> Linux Kernel Version = 4.16.15-41615: OK >> Linux perf_event_open syscall available: OK >> Sampling trigger event available: OK >> Intel(c) Last Branch Record support: Available >> Sampling Environment: OK Consult the Nsight Systems documentation or your system administrator to correct any issues. 11 Nsight Compute – Record and analyze detailed kernel performance metrics – Two interfaces: – GUI (nv-nsight-cu) – CLI (nv-nsight-cu-cli) – Directly consuming 1000 metrics is challenging, we use the GUI to help – Use a two-part record-then-analyze process “client” “server” Nsight Compute nv-nsight-compute-cli... scp or shared file system 12 Performance Counters – Device has many performance counters to record detailed information – Made available as “metrics”. – Nsight Compute helps you interpret these $ nv-nsight-cu-cli --devices 0 --query-metrics lts__t_sectors_srcunit_l1_op_atom_dot_cas l1tex__data_pipe_lsu_wavefronts_mem_shared_cmd_write lts__t_sectors_srcunit_l1_aperture_sysmem_op_read lts__t_requests_op_red_lookup_hit lts__t_sectors_equiv_l1tagmiss_pipe_tex_mem_texture_op_ld l1tex__t_bytes_pipe_tex_lookup_miss l1tex__texin_requests_mem_texture l1tex__t_bytes_pipe_lsu_mem_local_op_ld_lookup_miss l1tex__t_bytes_pipe_tex_mem_surface_op_red_lookup_miss... 13 Record Kernel Traces – Recording may be for the whole execution, or usually restricted to a single launch of a kernel $ nv-nsight-cu-cli \ --kernel-id ::mygemm:6 \ --section “.*” \ -o output_file \ executable 14 Open Nsight Compute – We will import the recorded file File > Open File... > output_file.nsight-cuprof-report – We can open multiple files in multiple tabs, if desired 15 First Look – Can compare multiple open codes with baseline button – For comparing effect of optimizations Tabs and baseline button Next tab now has comparison 16 GPU Speed of Light – Utilization compared to theoretical maximums Tip: mouse over each to see the associated metric 17 Workload Memory Analysis Chart Executed instructions that reference a memory space Requests to Amount of the memory data moving 18 Scheduler Statistics – Activity of the scheduler issuing instructions. – Maximum possible warps per scheduler – Warps that have not exited a kernel – Warps that have execution dependencies satisfied – Warps actually issued 19 Warp State Statistics – How much time warps spend in each state Tip: mouse over to see value 20 Occupancy – How many warps are active compared to the maximum possible – Achieved: true number of active warps as average – Charts show how kernel resources and grid dimension affect occupancy 21 Instruction Hotspots – Show metrics by source line, PTX instruction, or SASS instruction Switch page to “Source” “Source and PTX” (usually) or “Source and SASS” 22 Instruction Hotspots – Source file may not load if profiling was recorded on another system Click “resolve” and find your local copy of the code that was compiled or run remotely. 23 Instruction Hotspots Click a line to highlight lines …corresponding PTX/SASS from other side… lines over here. Program counter spends most of its Sometimes, stalls can show time on instructions from this line. in a following instruction that Mouse over for breakdown. depends on a previous one 24 Nsight Systems – Record and analyze system utilization information – Two interfaces: – GUI (nsight-sys) – CLI (nsys) – Use a two-part record-then-analyze process “client” “server” Nsight Systems nsys... scp or shared file system 25 Record Execution Traces – Record system information for the entire execution $ nsys profile. \ -o output_file \ executable 26 First Look – File > Open > output_file.qdrep – If multiple files are open, shown in the left pane – Timeline of OS calls, CUDA calls, NVTX spans, and GPU activity 27 Timeline View NVTX annotations CPU activity Click to expand GPU Activity 28 Click and drag to zoom in Mouse over spans for more info 29 GPU Teaching Kit The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License. GPU Teaching Kit Accelerated Computing Lecture 2.6 - Introduction to CUDA C Unified Memory Objective - To learn the basic API functions in CUDA host code for CUDA Unified Memory - Unified Memory Allocation - Data Transfer in Unified Memory 2 CUDA Unified Memory (UM) Is a single memory address space accessible both from the host and from the device. The hardware/software handles automatically the data migration between the host and the device maintaining consistency between them. 3 Partial Overview of CUDA Memories (Device) Grid – Device code can: Block (0, 0) Block (0, 1) – R/W per-thread registers Host Registers Registers Registers Registers – R/W all-shared global Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1) memory Host Global – R/W managed memory Memory Memory (Unified Memory) Unified Memory – Host code can – Transfer data to/from per grid global memory – R/W managed memory 4 Partial Overview of CUDA Memories (Device) Grid – cudaMallocManaged() Block (0, 0) Block (0, 1) – Allocates an object in the Unified Memory address space. Registers Registers Registers Registers – Two parameters, with an optional third Host parameter. – Address of a pointer to the allocated Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1) object Host Global – Size of the allocated object in terms of Memory Memory bytes Unified Memory – [Optional] Flag indicating if memory can be accessed from any device or stream – cudaFree() – Frees object from unified memory. – One parameter – Pointer to freed object 5 Partial Overview of CUDA Memories (Device) Grid – cudaMemcpy() Block (0, 0) Block (0, 1) – Memory data transfer Registers Registers Registers Registers – Requires four parameters Host Thread (0, 0) Thread (0, 1) – Pointer to destination Thread (0, 0) Thread (0, 1) – Pointer to source – Number of bytes copied Host Global Memory Memory – Type/Direction of transfer Unified Memory – Depending on the transfer type, the driver may decide to use the memory on the host or the device. – In Unified Memory this function is utilized to copy data between different arrays, regardless of position. 6 Putting it all together, vecAdd CUDA host code using Unified Memory int main() { float *m_A, float *m_B, float *m_C, int n; int size = n * sizeof(float); cudaMallocManaged((void**) &m_A, size); cudaMallocManaged((void**) &m_B, size); Allocation of Managed Memory cudaMallocManaged((void**) &m_C, size); // Memory initialization on the Host m_A, m_B gets initialized on the host // Kernel invocation code - to be shown later The device performs the actual vector addition cudaFree(m_A); cudaFree(m_B); cudaFree(m_C); } 7 CUDA Unified Memory for different architectures Prior to compute capability 6.x Compute capability 6.x onwards – There is no specialized hardware – There are specialized hardware units to improve UM efficiency. units managing page faulting. – For data migration the full memory – Data is migrated on demand, block needs to be copied meaning that data gets copied only synchronically by the driver. on page fault. – No memory oversubscription. – Possibility to oversubscribe memory, enabling larger arrays than the device memory size. 8 GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA under the Creative Commons Attribution-NonCommercial 4.0 International License..