VMware Private AI Foundation with NVIDIA PDF
Document Details
Uploaded by RevolutionaryKremlin
CMU
Tags
Summary
This document is a VMware Private AI Foundation with NVIDIA guide. It covers generative AI, large language models, and GPU configuration aspects. The guide details various components, including deep learning, machine learning, and different configuration modes for GPUs. It aims to help cloud and DevOps engineers.
Full Transcript
**VMware Private AI Foundation with NVIDIA** - Generative AI and Large Language Models 1. Artificial Intelligence - Mimicking the intelligence or behavioral pattern of humans or any other living entity 2. Machine Learning - Computer can learn from dat...
**VMware Private AI Foundation with NVIDIA** - Generative AI and Large Language Models 1. Artificial Intelligence - Mimicking the intelligence or behavioral pattern of humans or any other living entity 2. Machine Learning - Computer can learn from data, without using a complex set of different rules. Mainly based on training a model with datasets 3. Deep Learning - A technique to perform machine learning inspired by our brain's own network of neurons 4. Generative AI -- form of LLMs offer human like creativity reasoning and language understanding 5. Revolutionized natural language processing tasks, enabling machines to understand, generate and interact with human language in a human-like manner 6. LLMS Examples -- (chat)GPT-4, MPT, Vicuna and Falcon have gained popularity because of their ability to process vast amounts of text data and produce coherent and contectually relevant responses 7. Components of LLMs - Deep-learning (transformers) neural nets - Hardware accelerators - Machine learning software stack - Pre-training tasks - Fine-tuning tasks - Inference (prompt completion) tasks A screenshot of a computer Description automatically generated **Architecture and Configuration of NVIDIA GPUs in Private AI Foundation** - GPUs are preferred over CPUs to acceleate computational workloads in modern high-performance computing (HPC) and machine learning or deep learning landscapes - A GPU has significantly more cores than a CPU -- can be used for processing tasks in parallel - Tolerant of memory latency because it is designed for higher throughput - Works with fewer, smaller cache layers because it has more components dedicated to computation - Comparison 8. CPU only virtualization - Apps & vms - Hypervisor - Server 9. NVIDIA with GPU - Configuration Modes - Dynamic DirectPath (I/O) passthrough mode - Entire GPU is allocated to a specific VM-based workload - Nvidia vGPU (Shared GPU) - Multiple running vm workloads (or Tanzu worker node VMs) on a host have direct access to parts of the physical GPU at the same time - Time-Slicing Mode - Workloads share a physical GPU and operate in series - vGPU processing scheduled between multiple VM-based workloads using best effort, equal shares, or fixed shares - Default Setting - Supported by NVIDIA A30,A100 and H100 Devices - Can be configured to support one VM to one full GPU or one VM to Multiple GMs - Best Used for - Resource contention is not a priority - Max GPU utilization by running as many workloads/VMs as possible - 100% of cores given to a single workload for a fraction of a second - Use for large workloads that need to consume more than one physical PU device - MIG Mode (Multi-Instance GPU Mode) - Fractions a physical GPU into multiple smaller GPU instances - Helps to maximize utilization of GPU devices - Fractioned to a max of 7 slices individually represent as a vGPU profile - Isolates internal hardware resources and pathways in a GPU device - Enabled by the nvidia-smi command at the ESXi host level, after the NVIDIA host vGPU manager driver is installed - Best Used For - Multiple workloads that need to operate in parallel - Use when GPU resources need to be run by multiple VMs in parallel - Allocate 1-7 physical slices of GPU to a single workload - Workloads that need secure, dedicated and predictable level of performance - Apps and VMs - Nvidia Computer Driver (Guest OS) - Nvidia Host software (VIB) - VMware vSphere - Nvidia GPU - Nvidia-certified system - Configurations - Workflow to Configure a NVIDIA GPU in VCF 10. ESXi Host Configuration - Add NVIDIA GPU PCIe Device(s) - Enable SR-IOV - Pre-Mage ESXI with NVIDIA VIB - Enable MIG Mode if desired 11. SDDC Manager Configuration - Commission Host(s) - Into VCF Inventory - Cluster Assignment - Assign Host(s) to a workload domain cluster 12. VM/TKG Configuration - Configure VGPU Profile - Allocate vGPU resources (time sharing or MIG) - Configure NVIDIA Guest Driver - Install & Configure NVIDIA Guest Driver to the Workload - Assigning a VGPU Profile to a VM -- Time Slicing 13. Default - Equal shares of GPU resources based on preconfigured profiles - Consists of 4 parts - Assigning a VGPU Profile to a VM -- MIG 14. 1-7 Slices - Consists of 5 parts - Creating VM Class for a TKG Worker Node VM 15. Tanzu Kubernetes Grid Work node VM with a GPU, you must create a VM CLASS ![A screenshot of a computer Description automatically generated](media/image2.png) - Nvidia GPUDirect RDMA - 10x performance - Direct comm between NVIDIA GPUs - Remote Direct Memory Access (RDMA) gives direct access to GPU Memory A diagram of a computer Description automatically generated GPU's for Machine Learning - Graphics processing units (GPUs) are preferred over CPUs to accelerate computational workloads in modern high-performance computing (HPC) and machine learning or deep learning landscapes: - Latency versus throughput: CPUs are optimized to reduce latency for processing tasks in a serialized way. GPUs focus on high throughput volumes. - A GPU has significantly more cores than a CPU. These additional cores can be used for processing tasks in parallel. - The GPU architecture is tolerant of memory latency because it is designed for higher throughput. - A GPU works with fewer, relatively small memory cache layers because it has more components dedicated to computation Nvidia NVLINK ![A diagram of a computer system Description automatically generated](media/image4.png) - Piece of hardware that allows high-speed connection between multiple GPUs on the same server - Provides simplified device consumption with device groups - Available on VCF 5.1 - Groups of multiple PCIe devices share a common PCIe switch or a direct interconnect (NVLINK). - It is defined at the hardware layer and presented to vSphere. - It is added to a virtual machine as a single unit. NVIDIA NVSwitch - Connects multiple NVLinks to provide all-to-all GPU comm at full NVlink speed in a single node and between nodes. Increases the speed of GPU-to-GPU comm for larger AI/ML workloads A screenshot of a computer Description automatically generated - Up to 8 GPUs can be on the same host - All 8 GPUs can be allocated (or a subset) to the same VM with the vSPhere device-group capability - Comm Traffic & CPU overhead are significantly reduced ![A screenshot of a computer Description automatically generated](media/image6.png) **Private AI Foundation with Nvidia Architecture and Components** - Platform for provisioning AI workloads on ESXI hosts with NVIDIA GPUs - Configure & control access to AI and machine learning optimized resources for on-demand developer access - Secure & manage the lifecycle of AI infrastructure using familiar tools, without the need to manage a disparate AI/L silo - Use vSphere vMotion migration and DRS initial placement with NVIDIA-powered GPU workloads - Use Cases - Dev - Cloud and Devops engineers provision AI workloads, including Retrieval-augmented Generation (RAG) in the form of deep learning - Data scientists for AI dev - Prod - Cloud admins provide devops engineers with Private AI foundation with NVIDIA env for production ready AI workloads on Tanzu Kubernetes Grid clsuters on vSphere with Tanzu - Components for AI Workloads in Private AI Foundation with NVIDIA - vSphere Lifecycle Manager - All hosts in a cluster require the same GPU devie and image - NVIDIA AI ent suite licensing is required - VCF Tanzu Kubernetes Grid - GPU-enabled TKG vms must be manually powered off before vSphere lifecycle manager ops - Re-instantiate on TKG worker node VM on another host - VCF vSphere Cluster - GPU-enabled VMs must be manually powered off before vSPhere Lifecycle mgr ops - vMotion IS supported for maintenance ops ONLY (non-vSphere lifecycle manager ops) Termonology DirectPath I/O SR-IOV -- single PCIe device under a single root port to appear as multiple separate physical devices to the hypervisor or guest OS Time-slicing -- operate in series. Contention is not a priority aka time sharing Multi-instance GPU -- operate in parallel -- maximize utilization of GPU devices and provides dynamic scalability by fractioning a physical GPU device into smaller instances Nvidia NVSwitch -- ties GPUs Nvidia AI Suite -- cloud-native suite of AI and data analytics software, optimized, certified, includes key enabling tech for rapid deply, mgmt. and scalability Nvidia GPUDirect RDMA -- direct path for data exchange between GPU and 3^rd^ part peer devices using std features of PCIe -- ex: network interface, video or storage adapters NVLink bridge -- hardware that allows conn between mult GPU on the SAME server. Point to point connection between any 2 GPUs A screenshot of a computer Description automatically generated ![A screenshot of a computer Description automatically generated](media/image8.png) A screenshot of a computer Description automatically generated ![A screenshot of a computer Description automatically generated](media/image10.png)