Lecture4 - 2023.pdf

Duration: 13 Sessions Graduate Program Fall 2023 HW/SW Co-Design Real-Time AI Instructor: Dr. Amin Safaei Fall 2023 Xilinx 2021 Department: Computer Science Duration: 180 min Graduate Program Fall 2023 Deep Learning Challenges (Recap) Instructor: Dr. Amin Safaei Fall 2023 Xilinx 2021 Department: Computer Science 3.1 Deep Neural Network • Major Challenges • Computational Intensive • Deeper Layers • Billions of Compute Operations Per Inference • Memory Bandwidth Intensive • Requires high bandwidth memory access between layers • Faster throughput → Memory bandwidth • Deployment Power Efficiency • Fast growing market adaptions • More power/cost efficient deployment solutions Xilinx 2021 3.2 Deep Neural Network • There are several challenges in developing deep learning solutions ✓ Challenge1: Chose the Right Deep Learning Network ✓ Challenge2: Billions of multiply-accumulate operations and tens of megabytes of parameter data ✓ Challenge3: Continuous stream of new algorithms • Pruning • Most neural networks are typically over-parameterized with significant redundancy to achieve a certain accuracy. • Inference in machine learning is computation intensive and requires high memory bandwidth to meet the low-latency and high-throughput requirements of various applications. • Pruning is the process of eliminating redundant weights while keeping the accuracy loss as low as possible. 3.3 Deep Neural Network (Cont.) • Al optimizer • The Al pruner (or VAI pruner) prunes redundant connections in neural networks and reduces the overall required operations 3.4 Deep Neural Network (Cont.) • Pruning Methods • Fine-grained • Coarse-grained • Coarse-grained pruning • Sparsity Determination • Channel Selection • Accuracy • Iterative Pruning • The AI pruner is designed to reduce the number of model parameters while minimizing accuracy loss. • Pruning results in accuracy loss and retraining, or finetuning, recovers accuracy. 3.5 Deep Neural Network (Cont.) • Quantization • Generally, 32-bit floating-point weights and activation values are used when training neural networks. • By converting the 32-bit floating-point weights and activations to 8-bit integer (INT8) format, the AI quantizer can reduce computing complexity without losing prediction accuracy. • The fixed-point network model requires less memory bandwidth, thus providing faster speed and higher power efficiency than the floating-point model. Duration: 180 min Graduate Program Fall 2023 Deep Learning Processor Unit (DPU) Instructor: Dr. Amin Safaei Fall 2023 Xilinx 2021 Department: Computer Science 4.1 Deep Learning Processor Unit (DPU) • Introduction to the Deep Learning Processor Unit (DPU) • Programmable engine optimized for deep neural networks • Parameterizable IP cores • Released with the AI tools: Vitis AI and PYNQ • Algorithms widely adopted in various computer vision applications • Image or Video classification • Semantic segmentation • Object detection • Tracking 4.1 Deep Learning Processor Unit (DPU) • DPU Variants • Edge Applications • DPUCZDX8G • • Zynq@-7000 SoCs Zynq UltraScale+ MPSoCs • Cloud Applications • DPUCADX8G • • xDNN Alveo U200/U250 • DPUCAHX8H • • High-throughput applications Alveo U50/U280 • • Low-latency applications Alveo U50/U50LV/U280 • • High-throughput applications Alveo U200/U250 • Optimized for the Versal AI Core series • DPUCAHX8L • DPUCADF8H • DPUCVDX8G https://thinklucid.com/triton-edge/ 4.1 Deep Learning Processor Unit (DPU) • Zynq UltraScale+ MPSoCs - DPUCZDX8G • IP can be integrated as a block in the programmable logic • User configurable and exposes several parameters • Designed to be efficient, have low latency • It supports the most commonly used network layers and operators, using hardware acceleration • Takes full advantage of the underlying Xilinx FPGA architecture and achieve the optimal tradeoff between latency, power, and cost. • Specialized instruction set that enables it to work efficiently for many convolutional neural networks. 4.1 Deep Learning Processor Unit (DPU) • Zynq UltraScale+ MPSoCs - DPUCZDX8G • Configuration module Provides user-configurable parameters to optimize resources or to support different features. • Convolution computing module Has the processing engines that perform all the major convolution calculations. • Data controller module Schedules the data flow in the DPU 4.1 Deep Learning Processor Unit (DPU) • Alveo U200/U250 Cards: DPUCADX8G • Classification • Object detection • Segmentation 4.1 Deep Learning Processor Unit (DPU) • Alveo U200/U250 Cards: DPUCADX8G • Key features ✓ 96x16 DSP systolic array operating at 700 MHz ✓ Instruction-based programming model for simplicity and flexibility ✓ 9 MB of on-chip tensor memory composed of UltraRAM ✓ Distributed on-chip filter cache ✓ External DDR memory ✓ Pipelined scale, rectified linear unit, and pooling blocks ✓ Standalone pooling or elementwise execution block ✓ Hardware-assisted tiling engine ✓ Standard AXI-Memory Mapped and AX14-Lite toplevel interfaces ✓ Optional pipelined RGB tensor convolution engine 4.1 Deep Learning Processor Unit (DPU) • Alveo U50/U280 Cards: DPUCAHHX8H • Programmable engine • Optimized for CNN • High-throughput applications 4.1 Deep Learning Processor Unit (DPU) • Alveo U50/U280 Cards: DPUCAHHX8H • Consists of • High-performance scheduler • Hybrid computing array module • Instruction fetch unit module • Global memory pool module • Uses a specialized instruction set • Allows efficient implementation of many convolutional neural networks. • Examples: VGG, ResNet, GoogLeNet, YOLO, SSD, MobileNet, FPN 4.1 Deep Learning Processor Unit (DPU) • Alveo U50/U50LV/U280 Cards: DPUCAHX8L • Low-latency applications 4.1 Deep Learning Processor Unit (DPU) • Alveo U50/U50LV/U280 Cards: DPUCAHX8L • Key features ✓ New low-latency DPIJ micro-architecture with an HBM memory sub-system supporting a 4TOPs to 5.3TOPs MAC array. ✓ Supports back-to-back convolution and depthwise convolution engines ✓ Supports a hierarchical memory system, URAM, and HBM • Maximize data movement 4.1 Deep Learning Processor Unit (DPU) • Alveo U200/U250 Cards: DPUCADF8H • High-throughput applications 4.1 Deep Learning Processor Unit (DPU) • Alveo U200/U250 Cards: DPUCADF8H • Key features ✓ Throughput oriented and high-efficiency computing ✓ engines ✓ wide range of convolutional neural network support ✓ Compressed convolutional neural networks ✓ Optimized for high-resolution image optimization 4.1 Deep Learning Processor Unit (DPU) • Versal AI:DPUCVDX8G 4.1 Deep Learning Processor Unit (DPU) • DPU Options 4.1 Deep Learning Processor Unit (DPU) • Use Cases 4.1 Deep Learning Processor Unit (DPU) 4.1 Deep Learning Processor Unit (DPU) What is the DPUCZDX8G ? ✓ Optimized for Xilinx MPSoC and SoC devices ✓ Integrated as a block in the programmable logic (PL) of Zynq@-7000 SoCs and Zynq UltraScale+TM MPSoCs with direct connections to the processing system (PS). ✓ User configurable and exposes several parameters ✓ Designed to be efficient, have low latency, and be scalable ✓ Take full advantage of the underlying Xilinx FPGA architecture and achieve the optimal tradeoff between latency, power, and cost ✓ Specialized instruction set that enables it to work efficiently for many convolutional neural networks • • • Configuration module Provides user-configurable parameters to optimize resources or to support different features. Convolution computing module Has the processing engines that perform all the major convolution calculations. Data controller module Schedules the data flow in the DPU 4.1 Deep Learning Processor Unit (DPU) DPUCZDX8G Hardware Architecture ✓ The DPUCZDX8G architecture is configurable, extensible, and provides multi-dimension parallelism. ✓ The computing engine has the processing engines that perform all the major convolution calculations. ✓ The configuration module has the encoders and decoders that squeeze the network model size. 4.1 Deep Learning Processor Unit (DPU) DPUCZDX8G IP Core: Supported CNN Operations 4.1 Deep Learning Processor Unit (DPU) DPUCZDX8G IP CoreData Flow • Image Pre-processing This is the processing that occurs on each image or frame before it is fed to the network. Each neural network may require pre-processing; for example, color space conversion, video decode, etc. • Compute This is the processing that occurs in the DPU, accelerating the elements of the network graph in the PL of the Zynq device using the DPUCZDX8G IP core; for example, CONV, POOL, FC, ReLU, etc. • Image post-processing This is the processing that occurs following the inference. The DPUCZDX8G outputs will vary depending on the network goal and topology. 4.1 Deep Learning Processor Unit (DPU) DPUCZDX8G IP CoreData Flow 4.1 Deep Learning Processor Unit (DPU) DPUCZDX8G IP CoreData Flow 4.1 Deep Learning Processor Unit (DPU) DPUCZDX8G IP CoreData Flow 4.1 Deep Learning Processor Unit (DPU) DPUCZDX8G IP CoreData Flow 4.1 Deep Learning Processor Unit (DPU) DPUCZDX8G IP CoreData Flow 10101010 4.1 Deep Learning Processor Unit (DPU) DPUCZDX8G IP CoreData Flow 4.1 Deep Learning Processor Unit (DPU) DPUCZDX8G IP CoreData Flow 10101010 4.1 Deep Learning Processor Unit (DPU) Performance of Different Models 10101010

Lecture4 - 2023.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue