23-24 - M2AI - DL4CV - 1 - Deep Learning 175-200.pdf

Inference challenges Implementation scenarios Requirements GPP Memory to store feature maps and weights Intel, AMD...

Inference challenges Implementation scenarios Requirements GPP Memory to store feature maps and weights Intel, AMD Processing speed GPGPU Challenges PC + GeForce, NVIDIA Jetson, … Model size vs. memory size Compute capability vs. ops per image Embedded (ARM) + accelerator Smartphones, Raspberry Pi Model simplification/Model Compression Remove redundant kernels: less weights + FPGAs/SoCs less operations Xilinx, Altera, Lattice Quantize weights: smaller weights + faster operations ASICs Google TPU, Intel NCS2 MiriadX, AWS Inferentia Cloud AWS, Azure… José Henrique Brito | 2Ai - EST - IPCA, PT | DL4CV @ M2AI 175 175 Inference challenges Model simplification/Model Compression Pruning: removing redundant weights or kernels less weights – lower memory requirements less operations – faster inference time Quantizing: using less bits to store weights and features smaller weights/features – lower memory requirements faster operations – faster inference time Knowledge Distillation: train a weaker smaller network to provide outputs similar to a good large network less weights – lower memory requirements less operations – faster inference time José Henrique Brito | 2Ai - EST - IPCA, PT | DL4CV @ BAIT 176 176 1 Model Pruning Reduce computation time at the cost of reduced accuracy Removing 1 neuron implies Removing the weights and bias of the removed neuron Removing the memory storage of the neuron Removing the weights of all following neurons connected to the removed neuron Retrain to recover some of the lost https://pohsoonchang.medium.com/neural-network-pruning-update-cda56343e5a2 accuracy José Henrique Brito | 2Ai - EST - IPCA, PT | DL4CV @ M2AI 177 177 Model Pruning Removing Neurons vs Kernels Removing 1 kernel implies Removing the kernel Removing resulting feature map from memory Removing the input channel of all kernels of the following layer Several possible strategies https://pohsoonchang.medium.com/neural-network-pruning-update-cda56343e5a2 Kernels with lower values (L1/L2) Structured pruning Smallest effect on activations of next layer Minimize feature map reconstruction error of next layer Network pruning as architecture search https://medium.com/@anuj_shah/model-pruning-in-keras-with-keras-surgeon-e8e6ff439c07 José Henrique Brito | 2Ai - EST - IPCA, PT | DL4CV @ M2AI 178 178 2 Model Pruning https://github.com/tensorflow/model-optimization Yann LeCun, John Denker, Sara Solla, Optimal Brain Damage, NeurIPS, 1989, https://proceedings.neurips.cc/paper/1989/hash/6c9882bbac1c7093bd25 041881277658-Abstract.html Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, Trevor Darrell, “Rethinking the Value of Network Pruning”, ICLR 2019 https://github.com/Eric-mingjie/rethinking-network-pruning Julieta Martinez, Jashan Shewakramani, Ting Wei Liu, Ioan Andrei Barsan, Wenyuan Zeng, Raquel Urtasun, “Permute, Quantize, and Fine-Tune: Efficient Compression of Neural Networks”, CVPR 2021, https://github.com/uber-research/permute-quantize-finetune José Henrique Brito | 2Ai - EST - IPCA, PT | DL4CV @ M2AI 179 179 Quantization Weights are normally stored and Different possibilities for used as float (32 bit floating point) quantization balance Simplify weights to use integers 8 bits for weights and features with less bits (reduced precision) 4 bits for weights and features Smaller model 2 bits for weights, 6 bits for features Faster operations 1 bit weights, 8 bit features 1 bit weights, 32 bit features Possible approaches quantize weights after training Lu Hou, James T. Kwok, “Loss- Retrain model with reduced precision aware Weight Quantization Of quantize weights in the training phase Deep Networks”, ICLR 2018 https://www.tensorflow.org/lite/p erformance/post_training_quantiz ation José Henrique Brito | 2Ai - EST - IPCA, PT | DL4CV @ M2AI 180 180 3 Mobile/Embedded AI Implementing in devices with limited resources usually involves pruning and quantization https://www.tensorflow.org/lite https://www.udacity.com/course/intro-to-tensorflow-lite--ud190 https://pytorch.org/mobile (deprecated) https://pytorch.org/edge Getting Started with AI on Jetson Nano https://learn.nvidia.com/courses/course-detail?course_id=course-v1:DLI+S- RX-02+V2 José Henrique Brito | 2Ai - EST - IPCA, PT | DL4CV @ M2AI 181 181 TinyML On-device TinyML applications usually also rethink network architecture SqueezeNet for image classification Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, Kurt Keutzer (2016), “SqueezeNet: AlexNet- level accuracy with 50x fewer parameters and

23-24 - M2AI - DL4CV - 1 - Deep Learning 175-200.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue