pop.pdf
Document Details
Uploaded by IngenuousIndianArt5040
COEP Technological University
Full Transcript
INDEX S.NO TOPICS PAGE.NO Week1 1 Introduction to AI Systems Hardware part 1 4 2 Introduction to AI Systems Hardware - part 2 35 3 Introduction to AI Accele...
INDEX S.NO TOPICS PAGE.NO Week1 1 Introduction to AI Systems Hardware part 1 4 2 Introduction to AI Systems Hardware - part 2 35 3 Introduction to AI Accelerators,GPUs 52 4 Introduction to Operating Systems, Virtualization, Cloud - part 1 84 5 Introduction to Operating Systems, Virtualization, Cloud part 2 97 Week2 6 Introduction to Containers and IDE Dockers part1 115 7 Introduction to Containers and IDE Dockers part 2 128 8 Scheduling and Resource Management part 1 163 9 Scheduling and Resource Management part 2 178 "DeepOps: Deep Dive into Kubernetes with deployment of various AI 10 based Services." Part 1 200 "DeepOps: Deep Dive into Kubernetes with deployment of various AI 11 based Services." Part 2 212 Week3 "DeepOps: Deep Dive into Kubernetes with deployment of various AI 12 based Services Session II part1" 230 "DeepOps: Deep Dive into Kubernetes with deployment of various AI 13 based Services Session II part 2" 246 14 Design principles for Building High Performance Clusters part 1 291 15 Design principles for Building High Performance Clusters part 2 310 16 Design principles for Building High Performance Clusters part 3 328 17 Design principles for Building High Performance Clusters part 4 342 Week4 1 18 Introduction to Pytorch part 1 355 19 Introduction to Pytorch part 2 371 20 Introduction to Pytorch part 3 391 21 Introduction to Pytorch part 4 414 22 Profiling with DLProf Pytorch Catalyst part 1 444 23 Profiling with DLProf Pytorch Catalyst part 2 472 Week5 24 Introduction to TensorFlow part 1 500 25 Introduction to TensorFlow part 2 516 26 Accelerated TensorFlow Part 1 538 27 Accelerated TensorFlow Part 2 554 28 Accelerated TensorFlow - XLA Approach Part 1 571 29 Accelerated TensorFlow - XLA Approach part 2 598 Week6 30 Optimizing Deep learning Training: Automatic Mixed Precision part 1 635 31 Optimizing Deep learning Training: Automatic Mixed Precision part 2 661 32 Optimizing Deep learning Training: Transfer Learning part 1 703 33 Optimizing Deep learning Training: Transfer Learning part 2 734 Week7 34 Fundamentals of Distributed AI Computing Session 1 Part 1 811 35 Fundamentals of Distributed AI Computing Session 1 Part 2 828 36 Fundamentals of Distributed AI Computing Session 2 Part 1 847 37 Fundamentals of Distributed AI Computing Session 2 Part 2 869 38 Distributed Deep Learning using Tensorflow and Horovod 901 Week8 39 Challenges with Distributed Deep Learning Training Convergence 940 40 Fundamentals of Accelerating Deployment part 1 972 2 41 Fundamentals of Accelerating Deployment part 2 982 Week9 42 Accelerating neural network inference in PyTorch and TensorFlow 997 part 1 Accelerating neural network inference in PyTorch and TensorFlow 43 part 2 1024 44 Accelerated Data Analytics part 1 1045 45 Accelerated Data Analytics part 2 1059 46 Accelerated Data Analytics part 3 1084 47 Accelerated Data Analytics part 4 1110 48 Accelerated Machine Learning 1138 Week10 49 Scale Out with DASK 1171 50 Web visualizations to GPU accelerated crossfiltering part 1 1197 51 Web visualizations to GPU accelerated crossfiltering part 2 1216 52 Accelerated ETL Pipeline with SPARK part 11263 53 Accelerated ETL Pipeline with SPARK part 21287 Week11 54 Applied AI: Smart City ( Intelligent Video Analytics) Session 1 part 1 1328 55 Applied AI: Smart City ( Intelligent Video Analytics) Session 1 part 2 1352 Applied AI: Smart City ( Intelligent Video Analytics) Session 2 56 Deepstream part 1 1374 Applied AI: Smart City ( Intelligent Video Analytics) Session 2 57 Deepstream part 2 1398 Week12 58 Applied AI : Health care Session I part 1 1419 59 Applied AI : Health care Session I part 2 1430 60 Applied AI : Health care Session II part 1 1460 3 61 Applied AI : Health care Session II part 2 1490 4 Applied Accelerated AI Dr. Satyajit Das Department of Computer Science and Engineering Indian Institute of Technology, Palakkad Lecture - 01 Introduction to AI Systems Hardware part 1 Hello and welcome to the Applied Accelerated Artificial Intelligence course, I am Satyajit Das. So, I will be handling the Introduction to AI Systems Hardware and system software. And I will be taking some of the SDKs with PyTorch and TensorFlow and yeah so that’s about it and let’s get started. (Refer Slide Time: 00:40) So, these are the contents of today’s lecture. So, basically we will start with the introduction; we will see some applications of AI in modern days just to get motivated in the beginning and then we will talk about the computing systems from the perspective of AI. So, from the perspective of running at artificial intelligence benchmarks, what are the modern systems available and how they can scale or how we can use them and what are the shortfalls that are there and how we can minimize the gap between the requirements and already available systems. 5 ( So, we will talk about those and of course, we will talk about some of the computing engines, we might not be covering all the computing engines that are available nowadays. But of course, we will try to cover some of these and to see of what are the things available nowadays. (Refer Slide Time: 01:49) So, let’s see one application as Amazon Go. (Refer Slide Time: 01:56) 6 So, basically this Amazon Go provides you the flexibility to have seamless shopping experience. Refer Slide Time: 02:02) (Refer Slide Time: 02:04) So, you go into one shop. 7 ( Refer Slide Time: 02:06) And you just take your things. (Refer Slide Time: 02:08) 8 ( ) And you just get out of the shop. Refer Slide Time: 02:10 (Refer Slide Time: 02:13) 9 ( ) And your transaction will be automatically credited. Refer Slide Time: 02:15 (Refer Slide Time: 02:19) 10 ( ) And you will have this seamless shopping experience lot of AI is applied here. Refer Slide Time: 02:23 (Refer Slide Time: 02:27) 11 ( ) We will talk about what kind of algorithms or benchmarks that are being used. 12 ( Refer Slide Time: 02:27) (Refer Slide Time: 02:32) Another application is from this retail clothing scenario. Refer Slide Time: 02:34) 13 ( (Refer Slide Time: 02:34) So, you go into one store and you without any contact you can just try out the clothings for your shapes and sizes that is available. Refer Slide Time: 02:41) 14 ( So, as you can see you can just stand in front of the mirror and you can try out several clothes depending on your requirements. (Refer Slide Time: 02:55) And like things and you can try it out. Refer Slide Time: 02:59) 15 ( (Refer Slide Time: 03:03) And you can have bought them as you go. So, without even going into the trial rooms you can have this seamless experience. Refer Slide Time: 03:15) 16 ( (Refer Slide Time: 03:17) Again another application of course, is everybody knows nowadays the application of automated cars. Refer Slide Time: 03:24) 17 ( (Refer Slide Time: 03:25) Refer Slide Time: 03:27) 18 ( So, basically self driving vehicles. (Refer Slide Time: 03:30) Refer Slide Time: 03:31) 19 ( So, if you see different applications of computer vision and deep learning. (Refer Slide Time: 03:37) Refer Slide Time: 03:40) 20 ( (Refer Slide Time: 03:43) You will see it is most intense use of artificial intelligence in terms of application nowadays. Self driving cars. Refer Slide Time: 03:47) 21 ( (Refer Slide Time: 03:48) Again you have AI application in the area of medical imaging and automated operations. Refer Slide Time: 03:57) 22 ( (Refer Slide Time: 04:01) So, based on the images that are available you can actually have. Refer Slide Time: 04:03) 23 ( Like which way will be the most efficient and reliable way to go into one tumor that is there inside your brain. (Refer Slide Time: 04:13) And inside someone’s brain and it can track that without any invasive measures right. So, these are the applications of AI and there is numerous applications you have heard of. Refer Slide Time: 04:29) 24 ( But the main question is what is artificial intelligence and here we have many definitions available. But mostly accepted definition was given by John McCarthy in the year of 2004 in one paper published from IBM and it says that it is the science and engineering of making intelligent machines especially intelligent computer programs. So, the programs that are intelligent that are not anymore the rule based or conventional programming method right. And it is related to the similar task of using computers to understand human intelligence ok. So, that is the main purpose of emulating in human intelligence into the machines through these computer programs or intelligent computer programs. But AI does not have to confine itself to methods that are biologically observable of course, this is the generic acceptable definition of artificial intelligence. But broadly it has evolved vastly from the earlier rule-based approach to solve intelligent problems to learning based approach. And these learning-based approaches became very popular with the advancement of new computing systems that are like GPUs and FPGAs and different systems like that. Refer Slide Time: 06:08) 25 ( So, as I was mentioning that learning based approaches to solve or to emulate human intelligence into computer programs have been very popular nowadays. And these are mostly widely named as deep learning and machine learning; of course, we will use them interchangeably and it is important to understand the nuances between ok. So, deep learning is basically comprising of neural networks which has more than three layers ok three or more layers. So, it includes the input and output layer as you can see here and deep learning automates this learning process by extracting the features from the data available. And that is completely or mostly automated because this does not need the human intervention as much as the classical machine learning needs that intervention of human dependence right. So, that is kind of difference between these two concepts, but again deep learning is one subset of machine learning as you can see here right. And now the most important thing is that what kind of computations or what kind of complexity these benchmarks have. And if you understand that then it will be very easy to understand what kind of computing systems that we can engage them for right. So, the configurations or if you do not understand the complexity of the algorithm, it will be hard to relate to the computing systems that are underlying or that will be running 26 your benchmarks right. So, from all these benchmarks that are available in deep learning that are mostly neural network based benchmarks. (Refer Slide Time: 08:17) But they are evolving day by day and that is mostly because of different algorithmic innovation. So, every year you can see hundreds thousands of paper being published from the algorithmic innovation point of view in machine learning and deep learning and their applications. Of course, when I say applications they are not confined to only deep learning or machine learning that they can be applied to let’s say natural language processing or text processing, signal processing or maybe speech processing or got it right. But of course, these core benchmarks will be used for different applications including the computer vision also that that application that we just saw as let us say in self driving cars or Amazon Go or different other applications. So, these algorithmic evaluation innovations are happening along with the more and more availability of data and nowadays we have abundance of data and in terms of learning the more data you have intuitively your learning will be better ok. So, the amount of compute or computation you need to do you train these algorithms are also getting more or more; because you have more and more data and you have to compute or you have to process those data into the systems that is available at your disposal, but of course, the amount increases and the compute intensity also will increase right. 27 So, now with this factor of advancement of AI, it is necessary to understand the computation complexity of AI algorithms right. (Refer Slide Time: 10:18) So, this paper was published last year in archive preprint as you can see here and of course, you will get to learn a lot about these neural network algorithms with basics and different algorithmic innovations as well as different systems that are available to process them this is a good read. So, in the reference I will give a link to this paper. But if you see this graph, this graph shows a very like summarized way of representing these algorithms in terms of their computational complexity right. So, if you see the x axis, x axis in this graph represents what the top 5 error. Now what is top 5 error? Top 5 error is that these algorithms are supposed to learn something and depending on their learned parameters or whatever they have learnt for. So, basically the parameters that we call them or the model itself they will probably produce different possibilities for your output. Let’s say one classification problem ok. So, this algorithms that you are seeing here like EfficientNet, DenseNet, ResNet and all these different DNN or specifically CNN networks or Convolutional Neural Networks they are supposed to provide you the classification of one data set called ImageNet ok. Now what kind of possibilities they are producing whether that represents the true class or not. So, that is the measure of the error percentage and top 5 error is if your guess top five 28 guess of these algorithms lie in that your true class of this particular image or video that you are trying to classify if they lie on their top 5 cases. Then how much percentage of time they can guess that. So, that is the measure of this top 5 error. So; that means, if you want to increase the accuracy so, this is basically then the graph showing the accuracy versus the number of operations need to be processed for these benchmarks. And these number of operations you can see in the y axis as represented as GOPS or giga operations per second you can see that in many places they are also referred to as gigaflops. So, floating point operations also the same as GOPS; because all the competitions will be in these benchmarks will be mostly for floating point arithmetic right. So, now you have top 5 error you want to increase the accuracy of this benchmarks and you need to also increase the number of computations that you need to do for this benchmarks to gave more accurate. Now, this blue line you can see here is the accuracy measure right, it is linear. So, if you want to get linear increase in the accuracy the computations you can see the evolvement of the computations get exponentially increased ok. And this is the factor you need to understand before going into implementation of these algorithms that you will see in the subsequent classes right. So, now this is one fact or one take away from this slide you will remember that to increase or to get linear accuracy you need to increase the computations exponentially right. Now we will look into the computing systems that are available and we will try to relate what kind of computing systems will be more suited for this kind of computational density or computational complexity that are being incurred by this AI benchmarks right. (Refer Slide Time: 14:45) 29 So, of course, traditional computing systems as you know that we have computation engine and then we have memory, we have memory system different levels or hierarchy of memory is there, that we will see in the next slide and you have the communication unit. Now if your memory is on chip then you have on chip interconnect and if your memory is staying outside like dynamic memory like DRAM or your storage system. So, these are often housed outside of the chip and then you need off chip interconnect to get access to this memory right. (Refer Slide Time: 15:24) And if you see the hierarchy of the entire memory system in the modern computing system you have the register file at the lowest end. And then you have different levels of caches 30 and L1, L2, L3 and then you have your main memory which is your RAM dynamic RAM. And then you have the swap disk to interchange programs between the swap disk and your main memory depending on the locality or availability of the program right. And that is mostly automated and controlled by demand paging. So, most of these concepts also we will see in the subsequent class in system software, but the most important thing in this slide to look at that the speed of memory accesses. Now if you see in the register files the memory access speed is highest, then you have L1 cache in the range of nanoseconds. Then you have many nanoseconds in L2 cache; then some more nano nanoseconds in L 3 cache and few hundred nanosecond in your main memory access. So, DRAM access and then you have 10s of microseconds to milliseconds in the access of swap memory or swap disk driven. Now, in the modern systems of course, your swap disk is extendable to your remote storage through different gateways that you can increase and it can have several TBs even petabytes of swap disks available. But just to look at the local system then that you can have the sizes vary from words to KBs to several MBs in the last level of cache and then several GBs in your dynamic memory or dynamic RAM right. Now why this hierarchy is necessary? Because if you see the L1 level of cache this gives you a very faster access to the computing unit because the computations are much more faster. And you need the data available to the computation unit much more faster than the dynamic RAM; because dynamic RAM you can see the access is higher right the access time is higher. So, at the lower level you go the access time will increase the performance of access time will increase; that means, you will get access in less time. And if you go in the upper level the performance will degrade and also you can see that it can house larger size ok. So, the sizes also you can see. Now, why this is important? This is important because the computation unit now see that it has now that illusion of high or several level of memory is giving you like high volume ok. So, larger space so, this illusion of larger space with very short access time is giving you this hierarchy ok. So, this is why this hierarchy is important now; why it is important 31 from AI computational AI benchmarks point of view? Because from AI benchmarks we have seen that we need lot of data to be accessed for training these benchmarks right. And when we talk about data, memory is the first thing it will come into mind right because our memory is where you will store the data. So, what modern systems are doing is that, because if you see that access times will be very low for this L1 level of cache then you keep it the closest to your computing in it that that, but your processor of code. And then you keep your L2 cache then you keep L 3 cache and you can keep your main memory or DRAM and swap disk off chip ok. (Refer Slide Time: 19:51) And that is what actually happening nowadays and as I was mentioning that the benchmarks are bottlenecked by the data access the AI algorithms are extremely data hungry, if you can see that how much data computation it needs per second right you have seen the graph before. So, for these memory accesses another key factor is the energy consumption. So, size, the access time that we have seen in the previous slide now from the point of view of energy consumption this memory hierarchy is also important; because the more energy you will spend or accessing memory the more heat you will generate and you need more cooling system to employ or deploy to cool your systems and it is much more cost intensive than generating one system or chip right. 32 So, that is why energy consumption is a key limit of this data movement and data movement is itself is energy hungry; because if you see this graph of the data access energy consumption versus data computation. So, basically this in x axis you will see that these are the different operations. So, of course, AI benchmarks or DNN benchmarks are widely dominated by addition multiplication and multiply accumulates. So, addition and multiplication operations energy consumption you can see here. And DRAM access and SRAM access so, SRAM is static RAM which is which technology is mostly employed in the caches. And for rams your drams you can see DRAM access and SRAM access the energy consumption is much higher. And if you go into DRAM it is even several orders of magnitude higher than the computation itself. And if you just compare with add operation this DRAM axis needs almost 6400 times of energy to access this DRAM right. So, SRAMs or cache based memory that are on chip because they give you very short access time that we have seen in the previous slide. And now you can see here SRAM is also giving you much more energy efficiency in terms of memory accesses and DRAM is giving you much more energy consumption right. Now what means this is the trend that you can see here, but what this to be done. (Refer Slide Time: 22:57) So, modern memory systems what they do is that. So, they employ the L2 caches the L1 caches inside the core itself. So, in this cores that you are seeing here core 0 core 1 core 2 33 core 3. So, this is the core that is available and inside this cores you have L1 cache it is not particularly visible here, but L2 cache you can see prominently L2 cache level cache you can see on chip and then it has shared L3 level of cache on chip and DRAM is a kind of object because you need higher memory density and it needs to deploy much larger space of memory ok. (Refer Slide Time: 23:46) Now, if you see from the AI computation point of view, I will just give you one example of how much memory that are deployed inside the chip itself to get a larger memory bandwidth for running the AI benchmarks. So, this chip or this proposal was published in the year of 2019 and this is Cerebras Wafer Scale Engine. So, basically this whole chip is basically one wafer itself and all the other chips that you see are basically separate; in one wafer you will have several or several number of chips, but this is a wafer scale engine and this is one ML compute engine. Basically it has a lot of multiply and accumulate unit stored as or arranged as an array of processing engines. And this particular engine has 400,000 of this kind of cores ok and on chip memory it has around 18 GB of on chip memory. And this if you compare the size of this chip compared to the GPU that are available. So, the largest GPU that is available nowadays is NVIDIAs Ampere GA100 GPU; and that has around 5 54.2 billion transistors. And this particular WSE or Cerebras WSE is having 1.2 trillion of transistors. 34 So, you can imagine that how much bigger it is in size. And as well as to accelerate to accommodate all the computations that needed for different AI benchmarks that you have seen before if it employs 18 GB of on chip memory and that gives it 9 petabytes per second memory bandwidth. So, 9 petabytes per second data that you can access of course, this is a full precision data that we are talking about. (Refer Slide Time: 25:55) Now, one next generation of this wafer scale engine was published in 2021 and that has 850000 cores and that can deploy 40 GB of on chip memory and having a 20 petabytes per second of memory bandwidth. So, you can imagine how much the memory it is employed ok. So, just to have this shorter access time and reduced energy and. So, all these things we have seen in the previous slide right. But of course, these are very highly specific of ML accelerators or ML compute engine or AI engine; must we will see much more generalized systems that are available and mainstream devices that are available ok. So, that brings us to the section, where we will talk about this specialized computation engines. 35 Applied Accelerated AI Dr. Satyajit Das Department of Computer Science and Engineering Indian Institute of Technology, Palakkad Lecture - 02 Introduction to AI Systems Hardware Part-2 (Refer Slide Time: 00:40) (Refer Slide Time: 00:41) 36 Now, if you see from the computation engine so, processors ok; so, is the basic computation engine. And, as you know that you cannot increase the clock speed anymore due to the end of Dennard scaling and the power wall is heating and that’s why you cannot get more than a particular fixed clock frequency that is available nowadays in the processors or computation engines. So, how we will get much more compute density? So, at in terms of speed we want to increase the density as well as computation, because you have to have the flexibility of including more number of transistors in a chip. So, that is fine and in terms of clock speed also you cannot increase the clock speed anymore. So, compute density you cannot increase after a certain limit. (Refer Slide Time: 01:38) So, what the traditional systems or the modern systems are going towards is that this heterogeneous computing. So, where you have the serial tasks or mostly the parallel workloads ok which are not that much of data intensive; they are being controlled or they are being executed by the processor itself. So, that might be multi-core processors, single core processor. So, in this particular figure you are seeing that if it has a dual core processor. Now, also in addition to processing engines to get the data parallel workloads executed, we have specialized computing engines and these specialized computing engines are called accelerators. Of course, you will see several accelerators that are being used for 37 particularly AI benchmarks nowadays, but in this slide you can see that one GPU is there to handle these data parallel workloads right. And, one such accelerator or specialized system that you have seen in the last slide is Cerebras wafer scale engine right. So, now, the tradition is what divide your task into different subtasks and of course, this the data parallel workloads. So, mostly this AI benchmarks that we are talking about. So, from the AI benchmarks for perspective, these AI benchmarks or AI algorithms those will be executed by these accelerators. So, that is why we are talking about GPUs and because of what that you will see in the subsequent slides. But, the main important thing is that we have now accelerators in the system in the computing system with along with the processors ok. So, that is the main take away from this slide. So, now, what kind of accelerators that are available nowadays. (Refer Slide Time: 02:58) So, we talk about specialized computation engines for AI benchmarks right and over the years that you can see here this graph shows the trends from 2012 to 2020 and you can see now different computation engines ok. So, we will discuss them very briefly here. So, what we have? We have ASICs; so, ASICs are ASIC engines or Application Specific Integrated Circuits ok. So, these ASICs are mostly specialized like highly specialized only for the AI benchmarks and also we have GPUs available in this graph. Now, GPUs can give you the flexibility to 38 run both AI benchmarks as well as let’s say video or graphics benchmarks as well ok. So, now, you can ask you can just try to realize like how much generalized way of computing that can happen. So, you have the processors which are very generalized. So, maybe the you can say that general purpose computing engine like you can do everything in the computing in your processors. Now, you have ASICs which are highly specialized only for AI benchmarks. Here we will talk about AI based ASICs of course, for any other application domain you can have different ASICs as well. And, we have GPUs which are Graphics Processing Unit, graphical processing unit. So, which has now nowadays have the flexibility of ah accelerating your AI benchmarks as well, while that we will see in the coming slides. But, you can see here what are the things available here. So, ASICs, GPUs with FP32 so, what is FP32? That is Floating Point 32. Now, this is data type. So, now data type and the accuracy that you have seen for the AI benchmarks have very close relations ok. So, that relationship we will talk about and you have your GPU INT also 8, INT 8 means 8 bit integer units ok, GPUs with 8 bit integer units, GPUs with 32 bit floating point units. So, 32 bit floating point units means single precision floating point units, you can have double precision floating point units as well which will be then 64 bits right. And, these ASICs are a highly customized bit level implementation of computing systems for AI benchmarks. And, that is why you can see that the performance density is much higher in these cases ok; because they are only specialized in running AI benchmarks only for given data type or given precision ok. So, their performance density or GOPS/mm2 performance per area is much more higher. GPU is a much more generalized in terms of it can both accelerate your AI benchmarks as well as your video processing as well, graphics processing. And, it has almost closer or almost similar of performance density that is being achieved nowadays and the trend you can see right. So, this yellow line is the trend for your GPU FP32 which is almost generalized GPUs that you can get in the market nowadays. And, very few are; so, basically if you see that integer 8 so, basically these are with having these GPUs with RTX2080 T4 V100. So, these are also having the flexibility to run your 39 fixed point 8 bit right units. Now, what is the relationship I was talking about between the accuracy of the benchmarks, AI benchmarks and the bit precision? Ok. So, the more you have or more precision available in the computation for your AI benchmarks, the accuracy will be higher. So, this is the simple relationship. Now, how you want to get higher computing density? You can reduce the size of the feature size feature right. So, you can in the in this box here you can see that different sizes of feature map that is available. Now, you want to have much more GOPS/mm2 ok. So, that is you want to accommodate much more compute or you want to achieve much more complete density available for your systems. So, you will go for a lowering your feature size which is let us say 7 nanometer or 8 nanometer that is available that is CMOS technology that is being used to manufacture these chips like a V100 series of NVIDIA GPU. And, also around 28 nanometer is being used for this DaDianNao. This is one ASIC based version of DianNao which was published in the year of 2014. And, in 2016 TPU was published ok; so, first version of TPU. So, this is Tensor Processing Unit published by Google and that is also ASIC based ok and you have Cambricon, Eyeriss; this was published from the research group of MIT, EIE. So, several ASIC based implementations are there that you can see and their uses they use several feature map size. So, basically the size of this diamonds specify what kind of feature map they are using. Now, with decreasing the feature maps; that means, you can accommodate more compute units inside your chip right so; that means, you can increase the performance density, GOPS/mm2 right. So, how far you can go? Of course, you cannot go beyond certain limit. So, feature map size or technology scaling you cannot do let’s say beyond 1nm nowadays you cannot go beyond that. So that means, there is a certain level of performance that you can achieve. So, there is a limit that you can achieve ok; so, that is the main idea that you will get from this slide. So, the take away is; so, what is the trend? You can see the trend of different computation engines in terms of GPUs here you can see, what are their performance density in terms of GOPS/mm2 and also the systems that are available with different feature map sizes ok. So, higher compute density will be achieved by lower feature map sizes. 40 So, that is the intuition, intuitive idea that is very easy to understand right. So, now you want to achieve, you have seen from the previous section that your compute density is kind of exponentially increasing ok. And, your computation engine is kind of limited with this feature map sizes, you cannot put much more than that resource can allow you right. So, these are the two more much more important conclusions that we will take away after this slide. (Refer Slide Time: 12:44) Now, next what we will do is that we will go into some details of evaluation of different GPUs from the mainstream NVIDIA GPUs that are available in market nowadays and their functional number of units that are available. So, increase number of functional units. So, you can see that you are increasing the performance density and number of GFLOPS or gigaflops per second that you are actually much more. So, currently we have ampere series of NVIDIA GPU which has several thousands of functional units that are available inside these GPUs ok. So, we will see a much more clear array like in a very coarser array like what are the things available and how you can program them in the next lecture. But, we will see some abstraction of these or some features of these modern GPUs which are available and also modern ASICs which are available in their performances how they can run ok. (Refer Slide Time: 13:58) 41 So, in 2017 NVIDIA released V100 that is before in the series of NVIDIA GPU and ampere series of GPU was released in the year of 2020 and you can see that NVIDIA in NVIDIAs terminology this V100 comprises of around 5000 stream processors so, basically 80 cores with 64 SIMD functional units ok. And, the ampere series having this A100 having around 7000 of stream processors, which can be interpreted as 108 cores with 64 SIMD functional ok so, these are the processing cores or this SM stream processors that you can see here it is having L2 cache also and it has also dynamic memory or DRAM. So, which is this HBM2 that you are seeing here and the their interface that is available on this particular systems. Now, the basic difference of these two you can see here of course, in the micro architecture level there are differences that you will see. But from the memory point of view you can see that L2 cache is now in the ampere series it is kind of banked into 2 banks ok. So, L2 cache 1 and L2 cache 2 and here in the volta series, the previous series having only one cache memory. And, that is just to increase the throughput and reduce the latency of memory accesses ok. So, now, you can see that how much increase in the number of cores that can be accommodated into one particular system ok. So, from 80 cores to 108 cores in 2 years, you can imagine like how much progress that is happening and how these systems are getting scaled. Now, from the point of view of precision you can see that the tensor this series of GPUs having tensor cores ok. 42 So, tensor cores for machine learning is available in your Volta series in the NVIDIA V100 as well as your ampere series, but these having new floating point data type as TF32 ok. So, that gives a bit more flexibility in terms of model training or benchmark training that there will be discussing in the coming classes, but it supports also sparsity in the machine learning or on the AI benchmark. So, these two are very important things to understand from the algorithmic point of view ok. So, you can accommodate more number of computing cores and now your cores also having tensor cores which are specialized in machine learning. (Refer Slide Time: 17:20) Now, if you see the NVIDIA core, one core itself you can see that. So, this is the core of V100, you can see that these are having this are floating point 64. So, this is double precision unit integer units; so, basically these are integer units for MAC operations so, all these are MAC units. So, with the you can see here floating point 16 who will be multiplied and accumulated with the 32 bit of floating point then it will generate 32 bit of data right. So, this these are floating point 32 bit unit as you can see here and then along with these SIMD processing units you have tensor. And, what does this tensor core do is that tensor core actually does this matrix multiplication in one side. 43 So, basically 4x 4 one matrix multiplication will be done in one side, 1 clock cycle to be precise. Now, why matrix multiplication is necessary and why we are telling or why we are calling them tensor core? So, basically the data is this tensor cores are employed or designed specifically to run AI benchmarks which are deep neural network based or convolutional neural network based to be precise. So, these convolutional neural network based computations need the main engine is the convolution engine. And, the convolution can be transferred or interpreted as matrix multiplication; it can be converted to matrix multiplication. So, if you have matrix multiplication you need inside your SIMD of matrix multiplication unit, you can perform the entire tensor of like set of input data. So, basically these two sets of data you can see 16 x 16 plus you can accumulate another 16 which is already there. So, in the accumulator and you will get the data of 16 or entire. So, basically 16 MAC operations you are doing in the one side. And, that is how you can increase the throughput in many fold and that is the purpose of these tensor cores, that are available in modern computing engines. (Refer Slide Time: 19:50) Now, we are talking about tensor cores. So, that was core available in the previous generation of NVIDIA GPUs which are V100 and then we now have ampere series of GPUs. And, here you can see in the cores, you can see this register file is there, shared 44 register file and then you have this SIMD of integer 32 then floating point 32, floating point 64. So, it supports all these different precisions of computations you can see ok. So, all these units are basically this 32 unit, 32 bit fixed point, 32 bit floating point, 64 bit floating point; all these units are mostly used in accelerating your video processing and graphics benchmarks and tensor cores are mostly used for your AI benchmarks ok. So, that is why the modern GPUs that are coming with more number of tensor cores. And the tensor core in the ampere series is even much more flexible than the earlier series. So, there is a notion of sparsity inside your training or inference in your AI benchmarks. So, what is sparsity? Sparsity means you can have multiple weights or parameters which are very close to 0 can be interpreted as 0. And, if you think from computation point of view; so, if you have let us say 4x4 matrix multiplier. And, if your let’s say half of the data is let’s say 0, then you do not need to compute those particular half number of multiplication. So that means, of course, that will be manifold for your 2D matrix multiplication, but you get the idea right. So, basically for where one operand is 0, you just don’t do the multiplication. So, in terms of energy efficiency in terms of throughput you can increase it many fold ok. So, that is why the sparsity and the matrix multiplication is introduced in this tensor cores that are available in ampere series ok. (Refer Slide Time: 22:16) 45 So, in overall idea like Volta series that the T4 series, the RTX series. So, what kind of performance or density they are getting with particular power envelope that you can see here in this table. Again this table is taken from this reference and of course, I will give the name of this reference that you can go through after this. Plus, that the main important thing is that you can increase the density; here you can see only take particular for V100 and for V100 you can see that almost 10 times increase of performance density you can get just by using mixed precision. So, mixed precision means in the whole computation engine you can have 8 to 32 bits of multiply and accumulate. And, with full precision, full means the double precision you can see the GOPS and one order magnitude you can get of more performance density in 32 bit and even more you can get it if you go for mixed precision. So, precision compute density, the energy the you can see all they are in the same power envelope ok. So, in the same power envelope without either without the inducing more energy you are actually getting much more performance density ok. But of course, again just to connect to that graph that we saw before is that you cannot increase beyond certain point; because the features size you cannot decrease beyond certain ok. If you could decrease beyond certain point then you will go to atomic level of feature map size and, but that is just theoretical ok. So, beyond 1nm it is very difficult, because the energy, the temperature that will be generated by the processing engines will be much higher and it will be hard to contain the 46 amount of temperature that will be generated ok. So, that’s why you cannot go beyond a certain level of feature size scaling. (Refer Slide Time: 24:45) Now, in terms of FPGA based accelerators, FPGAs are fully reconfigurable and this is the abbreviation of Field Programmable Gate Arrays which are field programmable means these are mostly bit level configurable devices. Now, in the FPGAs of course, you can employ different ASIC level accelerators, because it is fully configurable. But, this slide presents different performances of different ASICs that are available. So, what are different ASICs that are available? We have the TPU which is Tensor Processing Unit from Google. So, v1 is essentially only published for inferencing, but now it is v2, v3 and other versions are employed in Google data centers for a large scale deep neural networks training. And, all these computation engines that you can see here, they are essentially array of processing elements or they are called systolic array based engines. And, these processing engines as you can see, these are just array of multiply and accumulate engines. And, to feed the huge number of processing engines you can see 12x14 multiplication engines that are being employed in one; this is the first version of the TPU we are talking about. And, one scratch card memory is deployed to just feed these data hungry processing elements ok and then you have the off-chip DRAM to offload the data to your scratch pad. 47 So, this is the kind of architectural ASIC based accelerators or ASIC based specialized AI engines that are available in markets. And, this is one picture of DaDianNao that you are seeing here and you can see that the GOPS that can be achieved is much more higher with of course, with you can see that how much power it is consuming. (Refer Slide Time: 27:10) But of course, these are in terms of milliwatt whereas, the power consumed by these GPUs are in terms of watt as you can see because of course, they are much more generalized in terms of computing systems. And, these are much more specialized because this fixed precision or mixed precision multiply and accumulate engines only for these processing engines and that is why they consume much more much less power as you can see here. (Refer Slide Time: 27:45) 48 And of course, the FPGA based accelerators as I was mentioning that these arrays that you saw in a in the ASIC based accelerators, they can be actually configured or programmed into the FPGA; they emulate one compute engine ok. So, this is basically the figure of that array of multiply and accumulate processing engines and memory hierarchy and these engines are basically this convolution engines. These are problems and then you have multiply and accumulate then you have final at the tree of multiply and accumulate engines and then you have the final output ok. So, all these AI benchmarks are mostly to train your deep neural network benchmarks to be precise. (Refer Slide Time: 28:36) 49 Well, from the market share point of view. So, what kind of market shares in these companies have in terms of different engines like GPU and FPGA based? So, GPU based of course, market is fastly dominated by Nvidia 72.8 percent and rest of the market is. So, these are the stocks of after 2018 financial year. So, now, the ratio might have changed slightly, but overall you can get the idea and then we have in the FPGA domain we have Xilinx, Altera, Microsemi and Lattice semiconductors, which is very few share. Well, these are the systems that are available nowadays. (Refer Slide Time: 29:30) Now, we are talking about a gap. So, this gap I have several times mentioned in the previous lessons, we will just again go through this gap just to summarize whatever we have studied so far right. So, what is this gap about? The gap is about; so, to get particular accuracy level; so, if you want to get linear increase in accuracy ok. So, you need to exponentially increase the computation density ok. So, this is the take away, this was the take away from the left hand side graph. And, the systems that are available we know that mixed precision and different precision we can employ and we can have much more performance density. But of course, there is some limit and we cannot go beyond certain limit of performance density right. So, the trend to if you see the computation requirement frame that is going exponentially and the resources that are available the trend is kind of getting saturated right, because this the feature size is almost going to your nanometer level right. 50 So, now how we can bridge this gap? We want to accelerate the AI; that means, the benchmark that we talked about. So, these benchmarks like all these different benchmarks that we have seen and few of these we will see in details in the coming classes, because we need to employ or you need to implement them for a particular target device right. Now, the gap is there ok. So, this computational requirement or the requirement of the computational density is going high exponentially and the performance density is getting saturated. So, how we can accelerate some beyond certain limit right; so, that is this course all about. We will learn how to accelerate the AI benchmarks that we have seen with the systems that are available nowadays with us and how to employ several techniques from algorithmic to different configurations in training, how we can with different libraries, with different SDKs how we can actually employ efficiently these benchmarks onto these systems that is the goal of this course. And, we will see the implementation or how we can implement on these systems that are available in the coming lectures. (Refer Slide Time: 32:40) So, to conclude we have these references. So, you can see that we have the reference from Professor Mutlu. So, you can learn about the memory systems, the state of the memory technologies that are available nowadays. And of course, about the basics of these neural networks, the specialized computation engine, their performance and how this scaling 51 happening from that algorithmic innovation point of view as well as from the system level point of view. And, how we can actually merge them to get better computation and energy efficient computation, performance wise at high accuracy computation, for these benchmarks you refer this paper. Well, that is all about for today. Applied Accelerated AI Prof. Satyadhyan Chickerur School of Computer Science and Engineering Indian Institute of Technology, Palakkad Introduction to AI Accelerators Lecture - 03 GPUs Part 1 Good evening, everyone. Welcome to the 2nd day session on Applied Accelerated Artificial Intelligence workshop. So, today we would be discussing about AI Accelerators and to be very specific we will try to discuss in terms of what we are talking about is GPUs today right. (Refer Slide Time: 00:40) So, the agenda of this particular session of the day would be like this. We will first start with what are AI accelerators, where are they used, how do they work, one view of the AI accelerators, second view of the AI accelerators. Then what are these GPUs, how do you 52 actually think of writing a program in terms of just running it on a CPU as against on multi core or a parallel program versus a program which you write it on a GPU. Then we will try to see the architecture or the setup of PARAM Shivay which is a cluster which also has GPUs and DGX I. And we will do some demos on these particular systems ok, very brief small demos to start with and then we will end up with what would be the benefits of actually working on such a hardware right, ok. 53 ( ) Refer Slide Time: 01:50 If you see a AI accelerator is a high performance specialized hardware that is specifically meant for AI workloads which in turn would be neural networks, machine learning programs, and intensive programs which are basically sensor driven, which involve certain processes, which also would link all of this right. So, you can work with a program which has neural networks, which has machine learning right which is linked to inputs from some sensor devices. And if you classify it, there are three main types of artificial intelligence accelerators for that matter. You have a CPU; you have a graphics processing unit or a GPU and then you have Field Programmable Gate Arrays or ASICs as well. So, these are three actual areas or devices basically, wherein you can work with for your AI workloads. Refer Slide Time: 03:11 54 ( ) And if you see that way from the classification point of view, you can divide these AI accelerators into two groups based on where actually you can use them right. So, you have these AI accelerators at data centers and you have these AI accelerators which are on your edge devices. Or the devices where you actually do inferencing, which are not very computationally strong devices right. The devices which are there on the edge. So, this is how the classification basically looks like, when you talk in terms of where they are placed and how you are going to use them. Refer Slide Time: 04:05 55 ( ) Now, technically speaking, any of these accelerators right are supposed to be working in a coprocessor mode. So, if you have not worked with it this is how you actually try to develop a program which could run on accelerator. You have got two types of programs right. The programs which are there which are run on CPU, then there is some portion right of your data the whole program which you need to actually transfer to the accelerator and the accelerator does the processing for you. You get the result again and then this result is again transferred to your host. Host is nothing but your CPU for that matter. So, CPU is a host, your GPU is a device this is the terminology which generally people use in these type of systems right. So, host is your CPU and the GPU is your device. So, when you start writing such programs, what effectively happens is not everything is to be offloaded to a GPU or accelerator for that matter. The portion which is compute intensive, which is actually required to be run by the GPU will be required to be offloaded to the GPU. And then it basically is executed and then return back to the host, which actually does all the other types of work right. So, this is how they are used by people when you start developing your programs and you will start writing your programs right. Refer Slide Time: 05:52 56 ( ) Now, one view of the AI accelerators ok is something like this. You have a CPU, you have a GPU you have FPGA and you have VPUs and all there are various types of accelerator. So, from one view of the AI accelerator, what I meant was you actually have a serial and a task parallel workload ok and then you have data parallel workloads. So, when you have a data parallel workload, it is very good for people to run it on a FPGA and a VPU. Whereas, if it is a serial and task parallel workload, these type of workloads are effectively run through the CPU or a GPU and this all is in a coprocessor mode. So, you have PCI express through which your CPU and the GPU actually communicates. So, I hope it is clear that different ways of executing compute intensive workloads by different types of accelerators right. So, this is what is a gist of how you actually can split your workload among these various categories of accelerators. Refer Slide Time: 07:15 57 ( ) Now, let us try to do a demo and see if we are trying to work on a CPU ok. It might be a multi core CPU and then you can work on different types of GPUs. So, let us try to understand and do hands-on by these two commands right which are written here. CPU related information if you want to gather, there is something called as lscpu which displays the CPU information and then you have this cat/proc/cpuinfo, these two are going to give you the information about the CPU configuration of your system, the snapshots are attached. Similarly, when you see the nvidia-smi command it displays the GPU information and you have this gpustat which gives a better information also a bit of it. So, let us try to do this hands-on and see at three different places ok. Refer Slide Time: 08:33 58 ( ) So, I will just go to those places and see ok. So, now, I hope this particular command prompt is visible to everyone. Student: Yes, it is. So now, this is a GPU system right, this just a workstation with a GPU ok. So, we will try to do two things now, first we will try to understand what type of CPU this has got, ok. (Refer Slide Time: 09:07) 59 So, if you see this lscpu, the architecture is informed, we have what type of operational modes this CPU can work on, what type of Endian-ism is there ok, how many CPU cores are there in this ok. (Refer Slide Time: 09:34) And then what is the megahertz speed of a CPU, what is the maximum speed, what is the minimum speed ok and the model and how many threads per core. So, hyper threading concept is there and how many cores per socket. So, this is a general thing about this particular system right. Now, let us try to understand what type of GPU this has got. (Refer Slide Time: 10:03) 60 Now, you see this; this particular system has a Quadro K5200 GPU. So, this tells about the GPU number ok, the name and all of this like fan speed, then compute node, the index ok all of this. So, this you can go through in detail when that afterwards, but you get the information about this particular GPU. Now, let us try to go to another place now which is RDGX. (Refer Slide Time: 10:52) So, let us try to do the same thing again. (Refer Slide Time: 11:13) 61 So, if you go to the DGX. So, this DGX has got 80 cores, 80 CPU cores that is what it means right. And you have 20 cores per socket, then you have on node 0 CPU, what are the numbers 0 to 19, 40 to 59 and node 1 CPUs 20 to 39 and 60 to 79. So, this is a DGX server right, DGX I. So, let us try to understand the GPU information about it ok. (Refer Slide Time: 11:41) (Refer Slide Time: 11:50) So, if you see this, this is a DGX and if you see the GPUs, you have got GPU 0, GPU 1, GPU 2, 3, 4 and up till 7 and these are all Tesla V100s ok. And you have got the memory, the utilization of that memory on that particular GPU ok. So, this is how the information about your total GPU system ok you can gather using nvidia-smi. 62 And then for each of these GPUs which are the processes which are running ok, and how many what is the memory which that particular GPU is using right, all of that information could be gathered from these two commands right. So, lscpu and nvidia-smi. (Refer Slide Time: 12:55) So, now let us go to the next slide which talks about, which talks about the parallelism which I had just told right, I had told about the task parallelism as well as the data parallelism. So, let us try to understand it in brief as to when you talk of data parallelism the data actually is split into various blocks and given to various tasks. So, there is a data decomposition which actually happens too much. And then this is how like this block of data goes to this task, this block of data goes to this task, this block of data from third block goes to the third task right. And something like this and then you have this aggregation of all of these results and then do this. This data parallelism when you talk task parallelism let us say you have a small section of data. Then you actually split ok, that particular data which can be unique for each of the tasks and then it is not necessary that here the tasks will use the same data or these tasks will do the same work ok. So, basically it is the decomposition at what level, decomposition at the task level, decomposition at the data level right. So, here these are different tasks, you get the same data or you get different data for the same type of task right. So, this is how actually is a basic difference between the task and data parallelism. (Refer Slide Time: 14:44) 63 And from the second view of AI accelerators, what I meant was you can use these accelerators, if you see here it starts with CPU, GPU, TPU and then all the way up till quantum accelerators. So, the idea here is that all of these different levels of complexities which are available in these accelerators could be used as different levels of your programming applications right. So, you actually can use with these your sensors, you can do data conditioning, you can develop in your algorithms, you can do actually human in loop and all of that human machine interaction at various levels, various complexities. The same hardware could be used right and then you can have application development for various users, then you talk about explainable AI, you can talk about matrices, you can talk about verification, validation, policy, ethics safety training. So, it actually is a very huge gamut of applications, course of action require requirements and then the application or the area where you are working in and the hardware. So, these are all linked right and this is what I thought like we would cover it as a second view of AI accelerator right. Two views of AI accelerator, one is from the view of how you are going to do computations only computational workload and the second thing is how can it be used for other things right, like inferencing or data conditioning or something processing the structured data or unstructured data. 64 Of course, they are all considered as data workloads, but the idea was at different places you want different type of accelerators and different type of usage for those accelerators right. So, that was the intention of this specific slide ok. (Refer Slide Time: 16:57) So, now let us try to understand the difference between the CPU and the GPU, which is something like this. It is basically having multiple cores, the CPU and you have L1 cache for each of the core, then you have controls unit for each of these cores. And then you have got L2 cache, L3 cache and DRAM. For GPU you have got these huge massively parallel processing cores with you and each of these ok, have got their own control and cache and then you have L2 cache you have the DRAM, the way in which these are used for massively parallel applications ok. The way in which they are programmed are bit different, then yeah that is how basically it is done. Not going into the details of them, but this is how the difference lies in the CPU architecture and the GPU architecture. 65 ( ) Refer Slide Time: 18:16 So, massively parallel programming is supported by GPUs and this is an example which I thought, this slide shows if there is a GPU right for the data center and then the same type or the similar family of GPUs ok for the edge. So, the idea was the difference is in this design of how many cores you are working with right, how is the connectivity between the CPU and the GPU stuff, because this is totally a embedded type of a Jetson Xavier NX processor engine which require which also has a high-speed IO and memory fabric. So, this is for the edge devices, wherein when we would be doing inferencing we will try to show you how to use such devices for doing your inferencing portion of your problem which you are trying to solve, right. And this is the GPU for data center you have B100s, you have A100s right and so many of them which are to be used by data centers for training your models right, coming up with lot of different type of approaches for models. Then you have basically so many other things which you can do on these GPUs, which are basically for the data center right and you have this unveiling stuff and all of that. So, this is the difference in the architecture and the design of these accelerators at the data center level and at the edge level ok. Refer Slide Time: 20:00 66 ( ) So, now let us try to understand a very basic thing of something which is called as sequential reduction. For example, you have to add these numbers together. So, let us say how does a sequential processing happen for it, it is something like you go on adding 13 plus 27. Then you get that and then you add 15 to it, then to 14 then 33, 2, 24 and 6. So, this is a way in which a sequential processing happens. So, if you see this particular type of reduction operation, you will see that you cannot do anything in parallel, it cannot be done parallelly right, because this particular thing depends on the sum of this and so on and so forth. So, this is basically a way in which a CPU actually works for you, when you come down to a parallel reduction ok. So, the idea is something like this, you can do this particular addition ok at one particular on one particular core or one particular CPU or a processing element and then you can do this on a different processing element and so on and so forth. And then you can go on summing it up, right. So, this is a way of parallel reduction. Refer Slide Time: 21:22 67 ( ) Now, how do you do this parallel reduction using a GPU? So, the idea is we are trying to assume that you have got let us say N number of elements. So, you basically start with N/2 threads. So, you use one thread for every two elements, each thread computes the sum of the corresponding elements which you are using it for and then you store the result ok, in one of the threads. So, iteratively at each step the number of thread is going to be half ok and the step size right, between the corresponding elements will go on doubling and ultimately you get the sum as a sum of a array of all of this and then you get the result. So, this is how actually you actually move from a sequential type of execution thinking to parallel and then to the concept of GPUs with many threads ok. Refer Slide Time: 22:39 68 ( ) So now, this is actually a snapshot which shows about the difference in the processing time for a small classification problem ok, which we ran on the GPU as against the CPU. So, here if you see for epochs vary ranging from 1 to 99 ok, you see the time taken for each epoch, on a CPU as against on a GPU right. So, you can see that you have got for if there is only 1 epoch to be run you get about 154.26 seconds time to get executed on a CPU as against 18.21 seconds on the GPU, right. This was just to show you the time difference right between something which you run on a CPU and something which you run on a GPU ok, all the way up till hundred epochs. So, the idea was to just show you the difference, then show you the difference and then you can assume that just for a very small classification problem right of some diabetic retinopathy detection, just we tried doing it for some application ok. And this was just idea to be given so that the time difference for execution it on a CPU and a GPU ok. Refer Slide Time: 24:19 69 ( ) So now, if you are trying to train a convolutional neural network on multiple GPUs with TensorFlow right, this is how it is going to happen. You have your CPU right and CPU has to be the master and then you have these coprocessors which are your GPUs. So, actually what is happening, when you try to develop a convolutional neural network or for that matter neural network you will have gradients, losses, weights and all of this right. And then you will have your own model right which basically would be running on a GPU. So, you have these variables, you have to calculate the mean, the updation of all of this and then it has to be again fed back to your model for you know improving its accuracy and so on and so forth. So, the idea is this is how it is going to be linked. So, CPU does certain specific portion which is not actually effectively to be done by a GPU, because GPU is compute hungry type of a processing element right. You have to give it lot of computations, you cannot depend on this data intensive or io intensive applications to be effectively run on GPU. So, that is one thing which you should keep in mind, that not all applications are suitable suited to be running on a GPU, they have to be massively parallel, massively parallel requirement should be there for that type of application. Then only it actually gives you the benefit and the advantage of trying to solve a problem on a GPU. So, this is how it is going to actually be used when you are trying to go ahead 70 in these classes, because you will be working on various types of models, how do you accelerate them and all of that. So, this is the basic idea of what is going to come. This is example of a multi GPU type of a setup, you can have a single GPU type of a setup and then you can have a cluster GPU with clusters right, that type of a setup. So, let us try to see in the days to come as to how we can use this type of a setup also for solving certain problems ok. (Refer Slide Time: 26:43) So, this is the PARAM Shivay cluster at IIT Varanasi, this gives you about 837 teraflops, if you see the configuration of this system right you have got about 4 login nodes, you have 192 compute nodes and then you have GPU nodes with NVIDIA Tesla V100. So, the same V100 which I showed you on the DGX. So, DGX has got a very specific type of a setup which I will show you in the next slide, but the idea here is here also we are using V100s, there are 11 numbers and then you have actually InfiniBand switch and then the communication, then you have primary storage of about 750 terabytes and 250 terabytes of archival storage. Then you have got high memory compute nodes right. So, there are about 20 nodes, 800 cores all of this information is available on the NSM website, the reference is also given. So, if you see the GPU compute node specifications here, you have got about eleven nodes 71 ( ) with 440 CPU cores and then you have got about 2 x V100 PCI accelerated cards, each with 5120 CUDA cores right. So, this is basically with each node which you are talking of. So, this is the total design of your PARAM Shivay cluster which you have and then this is the DGX I server actually. (Refer Slide Time: 28:35) So, if you see this architecture, this is something which is having a NVLink right. So, NVLink is a proprietary bus link ok which is specifically developed by NVIDIA and it uses its NVLink in its servers right. So, if you see here you have got about 8 GPU cards right and then you have PCIe switches, then you have your CPU cores, then you have NIC. So, if you see this DGX 1 server, you have got 8 tesla V100s, you have got 256 GB of memory, you have got dual 20-core Intel Xeon CPU processors. And then you have got about 40960 CUDA cores, you have got 5120 tensor cores. You have these system memory and the network right. So, this is the architecture of your DGX I. Refer Slide Time: 29:55 72 Now, let us try to go and see what is one of the very important features for these V100 GPUs. One of the very important things which is there is a tensor core. So, there is an animation which is also running on the slide, which tells you about how basically these tensor cores can process right; floating point operations. So, FP16 or FP32 or INT4 and all so on and so forth. So, the idea here is you perform a operation of the type A * B + C, which is equal to D. So, you are trying to actually do a matrix processing ok, using arrays and then it is something like accumulation which you are trying to do in the end right. So, this is a generic representation of this and then you can work with any type of values, it can be FP16 or FP32 and you can get the result in one of the various things ok. 73 ( ) Refer Slide Time: 31:31 So, now why is a tensor core very important? So, if you see one arithmetic operation that holds very high importance is a matrix multiplication. So, you tend to solve lot of your problems using matrix multiplications and if you specifically see matrix multiplication of let us say a 4 x 4 matrix, it involves about 64 multiplications and 48 additions. So, this is where this tensor core is going to be of help to us. So, the whole GPU is made up of two types of cores, one is the CUDA core and another one is the tensor core. So, notionally if you see CUDA cores would be bit slower, but they will give you lot of significant precision. Whereas, a tensor core right is very fast and along the compensating for this speed right you lose some precision ok. So, all tensor core basically does is that it accelerates the speed of your matrix multiplication. So, tensor cores are able to multiply two fp16 matrices 4 x 4 and add the multiplicative product right and then to the accumulator. So, this is General Matrix Multiplication or GEMM. One of the things is instead of needing to use many CUDA cores and more clocks to accomplish task let us say, it can be done in a single clock cycle with lot of dramatic speedup for applications which involve machine learning this that everything ok on these tensor cores. Refer Slide Time: 33:46 74 ( ) So, this is how we can think of actually trying to create ok easy to understand representations of whatever we would be using in our deep learning or machine learning type of model development. You can show it as an equivalent matrix multiplications for forward propagation or your activation gradient calculation and then weight gradient calculation, when you are trying to do it for a fully connected layer in the end. So, you can see here that these are the matrices right, size of the matrix and then you have these output activations input activation weights and all of that put in a manner which could be easily right executed by tensor cores. So, if you see that way how basically people could use tensor cores is that you can use tensor cores in CUDA libraries, you have this cuBLAS which uses tensor core to speed up the GEMM computations. And then cuDNN also uses tensor cores to speed up both the convolutions and the recurrent neural networks or RNNs. So, this is how you could specifically start using tensor cores for your own application development right. And then there is something which is called as CUDA Warp Matrix Multiply Accumulate or WMMA, this will help you to actually use tensor cores very fast. Refer Slide Time: 35:57 75 ( ) And then how these tensor cores could be used for various other applications right. So, here there is something which is a application of a sampling thing right. So, you have something called as a deep learning super sampler. So, what effectively people have tried to do is that you render a frame at a very low resolution right and then once it is finished right you increase the resolution so that matches with your screen dimensions of the monitor. That way you get the performance benefit of processing fewer pixels, but still get a very nice looking image on the screen right. So, this is actually how tensor cores are being used ok. So, now there is a demo which we thought we will show you of matrix operation right on a hardware which has tensor cores and I will like to show it to you and this is how the result will look like. Refer Slide Time: 37:50 76 ( ) If you see this we are trying to use three matrices, 4096 x 4096 each then you do the computation which is the same thing which we were thinking of D is equal to A * B + C. (Refer Slide Time: 38:18) And we have done that on GPU without tensor cores and with tensor cores. So, let me just take you to that this thing and then I will just execute and discuss that program as well in brief. Refer Slide Time: 38:43 77 ( ) So, let me just, 1 minute, give me a sec yeah ok; so. (Refer Slide Time: 39:28) Refer Slide Time: 40:34 78 ( ) Student: Can you increase the font size? 1 minute sir, I will do that. Student: Yes, this is good, this is good. Ok, sure so. Student: You can explain what you have fired and what is happening. Yes, I will explain the program as well. The idea is that we are trying to compute ok, D = A * B + C with tensor core and without tensor cores right. So, let me just show the program once, there are three programs actually. So, this is a sample program which talks of trying to use something which is called as WMMA. So, this actually means warp matrix multiply and accumulate. So, the idea here is you are trying to use these matrix tile dimensions. Refer Slide Time: 41:14 79 ( ) So, this is actually WMMA initialization, then you work with tensor cores using WMMAF16TensorCore type of thing instead of just CUDA cores. I will show you a GPU program also, there is a GPU program also which does not use this tensor core right. Of course, you use your blockId, block dimension threadId which gives you all of these right the same thing, but here you are trying to work with WARPs right and that is why it is called as WARP MMA and here it is basically trying to understand a very basic same thing which you try to do, but you are using the CUDA tensor thing here right. So, which I just now showed you. Refer Slide Time: 42:15 80 ( ) So, I will show you a GPU program also, I am not going into the details, but it is a CUDA program. So, if you can understand CUDA programming you will understand. But here the idea is to have this particular thing used ok, which is a additional thing which has been now added into your CUDA thing ok. (Refer Slide Time: 42:40) So, this everywhere uses this tensor core and then you can run it using this same grid dimension, block dimension type of stuff ok and then you synchronize it and then you do things. 81 ( ) Refer Slide Time: 42:58 So, this is how it is and then this is a GPU program without the tensor core, wherein you just do CUDA matrix addition, use the same block Idx block dimension. (Refer Slide Time: 43:16) And then come up with this, then you do the same thing and then you run this using the grid dimension, block dimension type of thing and then this is again a very basic rudimentary CPU version, which talks of doing the same thing and then trying to do the multiplication and then stuffs like that; so, yeah. 82 ( ) Refer Slide Time: 43:32 So, this is just trying to also use the unified memory and yeah, so that is it. So, I thought like this is just a idea of how basically people could use tensor cores and how they can use CUDA cores for that matter and this was just the idea of doing things. And then this was the idea of trying to tell the thing that people can start programming using the tensor cores and then we can actually see certain other things on the DGX. And we can do all of those things as we go ahead in the days to come, we will try to actually execute lot of programs. And then and then from the benefits point of view right, sustainability is one thing, the speed is another thing, scalability, heterogeneous architecture you could utilize, then the overall efficiency right improves. So, these are certain specific benefits of trying to go in for GPU programming and then, yes. So, we can actually see some more things in; so, ok. 83 ( Refer Slide Time: 45:10) So, this is a thing which I had left actually so I thought I will share that with you. So, if you see the gpustat, gpustat again gives you what which GPU are going to use right, Tesla V100, 32 GB of memory, then the temperature at which it is working and then how much percentage it is being utilized with. What are the various programs which are running, how much of it is being utilized, how much memory is being utilized by which program. So, this is a way of trying to actually see it in a different format instead of trying to use nvidia-smi. So, these are certain things which you can go on trying and then do things. Applied Accelerated AI Dr. Satyajit Das Department of Computer Science and Engineering Indian Institute of Technology, Palakkad Lecture - 04 Introduction to Operating Systems, Virtualization, Cloud Part - 1 Hello good evening everybody. So, today we will start with the Operating System first. So, we will have a brief overview of what operating system does and what are the privileges that the operating system or the piece of software wants to access. And also we will talk about the virtualization, so this will be the most important aspect of this presentation because we will learn what are the underlying technologies or methodologies that are available for virtualization. 84 So, that you will get to know when we will use several virtualization methods afterwards you will get to know, what actually you are doing also you will get an overview of the cloud. So, let us start with the operating systems. (Refer Slide Time: 01:07) So, operating system is nothing but a piece of software. So, basically it is same as the software that you are having as user applications or user software as you can say. So, but the main difference is that, so they are all together residing in somewhere in memory. But the main but the operating system is not just any other software. So, operating system or OS is the most privileged software in a computer system. So, privileged means now operating system can do some special things which the user software or user applications cannot do. So, which are these special things like writing to disks, accessing the network cards that is available inside your computing system, control over memory controlling the entire memory hierarchy, CPU usage and so many other things. Now, as I was mentioning that all the underlying hardware in the computing systems, so you will have had an idea before that of what are the underlying hardware or resources that are available in the computing system of which are like, the processing system like CPU, the storage, the memory and IO devices for peripheral purposes. So, all these underlying hardware will be managed by this operating system this piece of software. (Refer Slide Time: 02:50) 85 So, let us have a look why we will need this access or why we will need this software where we can directly run our application software on the hardware. So, now by hardware what I mean? By hardware I mean the resources that are available in a computing system, as the computing system the memory system and other IO devices right. Now you have your application program which is the piece of software that you want to run. You could run the application program directly onto the software the hardware. So, what you need you need one interface called instruction set architecture. So, this piece of interface will give you the interface between the software and hardware, so that you can talk to the hardware. That means, controlling the underlying hardware the computing system the memory IO devices. So, if you want to access the memory or store some files or maybe do some operations in the CPU, so all these things will be controlled by this interface called instruction set architecture. Now your application program you have on top of the instruction set architecture. So, if you can access the instruction set architecture then why you will need the operating system right. So, if you see that all these instructions set architecture can be abstracted into some libraries and those are called the device drivers. Because all the programs that you will run in this underlying hardware, will not be able to talk to all the systems at one time right. So, basically let us say you have program 1 and program 1 only needs, let us say needs to talk to CPU or maybe only memory and CPU. 86 And program 2 needs to access to your network card and let us say some IO devices and as well as the CPU. So, there are drivers for different components that are underlying. And you have abstracted them, so for to control them you have several sets of instruction set architecture. And then we have abstracted those into libraries driver libraries, for these IO and other devices to have access. Now, you can use these libraries inside your application program. So, that you do not need to write instruction set architecture for all the accesses and privileges that you want for. So, now you can write these libraries and you can take the help of these libraries and you can access them from the program, you can run them. So, this is kind of the primitive OS that is coming in between the application program and your ISA interface right. (Refer Slide Time: 06:09) Now, what if you want to run several programs. Let us say program 1, program 2, concurrent. That means, at the same time you want to run several programs. So, that means several programs need now to access the CPU the underlying hardware now they need some multiplexing. Because if you want to run several programs concurrently; that means, underlying hardware will be multiplex between this between these programs or software codes or user applications. So, now what we are doing here? We are sharing the resources that are available in underlying hardware through this hardware multiplexing. And we are giving that illusion 87 to these programs that they are accessing the entire resources that are available inside the hardware. Now you can see that we are evolving from this device libraries for IO then order multiplexing. And then you need some protection as well, when multiple programs you have allowed to run on top of your instruction set architecture and your operating system which we are talking about here. There you need to take care of several levels of thread models. So, like two programs which are running together they might not trust each other operating system also might not trust the programs that are running. So; that means, that the programs itself which are the user level applications that are running on top of your OS, they can try to potentially put some malicious instructions inside your OS ok. And hardware also cannot have the full trust over the programs, but because you can have the memory access you can you the program user level program can access the privileged sections of memory or so on and so forth. So, different levels of protections or thread models you need to take care. And all these protection models if you integrate into your device private libraries plus your scheduling, multiplexing your hardware resources and the protection model. Then you have kind of an encapsulated version of the modern operating system. So, this is where the idea of this piece of software lies, because we have the application programs, the operating systems, instruction set architecture. Interface then the underlying hardware which is essentially running all the instructions, that are coming from the operating systems to be done ok. (Refer Slide Time: 09:25) 88 Now, if you see the software layers ok. So, we are talking about user level programs we are talking about operating systems which has several libraries and other protection modules, the multiplexer or scheduling modules right. So, all these software are lying in layer wise inside your system. So, you have your user space where all these processes will be running, so processes means the user applications. Then you have the OS, so which has this scheduling memory management file system network stack device drivers and so on. Now, in the between so user space, so all these processes will talk to the OS through this system call interface. So, these interfaces are very much necessary to understand like where they are placed, to get a very comprehensive idea about the virtualization that we are trying to talk in the lecture. So, now, system call interface. So, basically the privilege instructions like let us say one process wants to access some memory or maybe file system, which is residing in your hard disk or your network storage. So, process will invoke system calls to this operating system which is underlying. And then operating system will convert them into instruction set architecture, which has several components as user components as well as system components that we will see next, but instruction set architecture and that will go eventually through the hardware underlying hardware. But increasingly in this stack one more layer is getting added which is called as hypervisor or which is widely known as the virtualization layer. So, with 89 hypervisor is also another a piece of software. So, you can see the layer position, so it is actually lying underneath the operating system. And most of the time hypervisors want to or try to stay invisible or very very transparent to the operating system. But if it wants to show itself before the operating system, then it will have this hyper call interface as the mid interface between the operating system and hypervisor. But most of the times operating system will not know that there is one hypervisor layer underneath it. But of course, we will see several levels of usage of hypervisors and how we are virtualizing the entire system, but yeah. So, basically this is the software stack that we are talking about and several level of interfaces, system call interface, lying between your user space and OS and hyper call interface which is lying around between your operating system and hypervisor ok. (Refer Slide Time: 12:54) So, next talking about interfaces, it is now important to know what is this system level or system call interface and some other interfaces also we need to know to get our knowledge around the virtualization layers because virtualization will be happening in several layers of our computing system. And that is why we need to know several interfaces that are present inside our computation inter compute system, computing system in different positions. Because when we are 90 talking about the virtualization, essentially what we are trying to do is that we are trying to virtualize the interfaces. So, let us look at the interfaces that are available in the computer system. So, as you can see here is the ISA the software stack is on top of this. So, basically you have operating system here I will just enable the laser pointer for better visibility. So, here you can see that ISA is here in between and beneath that we have the entire hardware stack. So, you have execution hardware memory translation unit system interconnect bus which is connecting to your controllers main memory and IO devices and controllers. And on top of that you have this interface called ISA; ISA has this two sets of instructions as we call because we have user level ISA which is 7 labelled as 7 and system ISA which is labelled as 8. So, system ISA will be mostly invoked by the operating systems as the privileged instructions of the privilege calls from the instructions to operating systems through this instruction set architecture. Now, on top of operating system we are having these libraries, now these libraries are essentially encapsulating the system calls and all. And these will give one interface to your application program as this two. So, like these libraries and application programs are interacting with this interface two interface. So, we will talk about that and an application program also can have the user instructions and can have system call instructions, that are directly directed to your operating system. So, now application programs have two sections open you see here, so 7 is open and 3 is open to your operating system. So, application programs are making system calls to your operating system to get privileged instructions to be executed and application programs directly talking to your hardware through the user ISA ok, so two sections is 3 and 7. Now 3 and 7 if you combine them you will have this Application Binary Interface or ABI widely known as ABI. And if you combine this 2 and 7 interface which is between the library the API mostly. So, API will be the view of the libraries and associated with the users ISA ok. So, now, these ABI and API is very important to understand. So, you see this interface 2 and 7 and 3 and 7 right. So, these two interface are essentially you see here application is talking to your operating system and hardware to these two interfaces, either ABI or API. Now ok so. 91 (Refer Slide Time: 16:54) Now we will talk about the virtualization. So, virtualization makes a real system appear to be a set of systems which are virtualized, but they will appear as real system to the operating software. So, basically let us say you have one piece of machine which you want to virtualize. So that means, you can have one-to-many virtualization, so one physical machine can appear multiple machines, multiple virtual machines. You can have your storage also virtualized, so basically if you have learnt about virtual memory and paging and so on, so that time you have read about virtual memories. So, basically you have one physical disk or physical memory and you want it look like multiple virtual disks right and also the network also may look like as multiple virtual networks. So, this is from one computing system point of view. So, all the components in your computing system you can virtualize. So, basically you are virtualizing your computing system one to many and your storage unit one to many and your network interface one to many and so on. Now you can have many to one virtualization also. So, many physical machines which are networked together and they may appear you can also emulate them to look like one virtual machine altogether ok. So, these two concepts will be very instrumental to understand the concept of cloud computing ok. And then you have many to many virtualization which you can just map however you want. (Refer Slide Time: 19:05) 92 So, we are talking about virtualization. So, basically first we will look at the virtual machines what way we can or different ways we can virtualize. So, we want to virtualize, so basically when we are talking about virtual machines then the CPU plus memory, the IO the entire computing system that we want to emulate as a full computing environment and we want to virtualize it right. And implementing by adding layers of software to this real machine to support your desired virtual machines that will be the representation of your virtual machine. So, we will discuss about that broadly like in different levels of computation how you can in your computing system how you will virtualize, but; Let us get motivated, why we will need this virtual machines creative in the first place right. So, one of the use cases will be you can run multiple OSes in one machine including your legacy OS which is native OS that is already done right. We can have isolation because, let us say several processes you want to isolate them; that means, the underlying resources they are sharing if they are they are executing in one machine. But if you want to virtualize them and if you want to have greater isolation between these processes that are running then you cannot do using these virtual machines. So, better I isolation I mean better segregation or boundary between the resources they are sharing right or how they are talking to each other. Now you can have enhanced security also because, let us say one user program is malicious and you want to test it whether the user program is malicious or not. 93 If you are trying on your native OS then if that is malicious then entire OS can get corrupted, but if you want to try them on your. Let us say virtual machine then if the piece of software is a malicious software, then only the operating system or the guest OS will get infected and you can throw it out and renew one new virtual machine right. So, that way you can have enhanced security as well as you can have better boundary between different processes that will be running on different virtual machines that we will see that will be the one of the important factor which will create the convention of containers and so on. So, that we will see. Live migration of servers, so without shutting down your PC you can actually or physical machine you can migrate your servers from one virtual machine to one another virtual machine. You can create virtual environment for your testing development of your processes or your user applications. You can emulate your platform, let us say you have created one new ISA which you will be running on onto your newer system or newer CPU right. Now you do not have yet the hardware, but you want to emulate them for your running. So, basically you want to test how we perform the new ISA. Now at that time you can virtualize and you can test in your virtual machine for this targeted ISA new targeted ISA. You can do on the fly optimization as well, your user program will run and on the fly you can optimize by virtualizing. So, all these things you will see realizing ISAs which is not found in physical machines that, I just talked about in one such users use cases of platform emulation ok. So, all these are different use cases and motivation for getting introduced to the virtual machines or VMs right. (R