AI - Distributed AI in the Cloud

Study Notes

Deep learning is a technique that uses large neural networks to learn complex patterns.
A neural network is a model that contains many parameters called weights.
The training process involves many iterations where the weight becomes slightly modified by adding delta W.
In distributed training, we need to take communication overhead into consideration.
The fully connected topology has the lowest latency.
There are two types of distributed training: model parallelism and data parallelism.
Model parallelism happens when we train a very large neural network that cannot fit into a single device.
In model parallelism, the communication between the hosts is intensive. PCIe transaction overhead can be high, and GPU cards must compete for PCIe bus bandwidth.
This is the drawback, and, in some cases, may be a waste of computer resources of the GPU, because the bottleneck shifts from compute to communication.
In data parallelism, there is no communication between the nodes during the processing of a mini batch. These nodes only communicate when a mini batch is processed, and the model update, delta W, is broadcasted.
CPU, GPU, and XPU are compared on model parallelism.
GPU nodes are more sensitive to communication overhead, and CPU nodes are more efficient.
Deep learning requires a lot of computation and communication between computer resources.
There are three main types of distributed training models: data parallelism, model parallelism, and task parallelism.
Data parallelism is the most common type of distributed training, where the training data is split between multiple workers.
Model parallelism uses multiple workers to train the same model simultaneously, while task parallelism divides the task of training a model into multiple subtasks, which can be executed on multiple workers.
The communication overhead required for deep learning is high, which is why it is important to consider not only the compute power of the machines, but also the computer, memory, and communication resources.

AI - Distributed AI in the Cloud - Joanna Huang

Choose a study mode

Podcast

Questions and Answers

What is the main purpose of deep learning?

True or false: Model parallelism is the most common type of distributed training.

True or false: Data parallelism does not require communication between the nodes during the processing of a mini batch.

What is the main difference between model parallelism and data parallelism?

True or false: XPU is more efficient than CPU when it comes to model parallelism.

What is the main drawback of model parallelism?

True or false: Model parallelism is when a single device is used to train a large neural network.

What type of distributed training is the most common?

True or false: Task parallelism divides the task of training a model into multiple subtasks that can be executed on a single worker.

What type of distributed training divides the task of training a model into multiple subtasks?

What is the main benefit of using data parallelism?

What type of computer resources are important to consider when using deep learning?

What type of distributed training uses multiple workers to train the same model simultaneously?

What type of nodes are more sensitive to communication overhead?

What type of nodes are more efficient?

Study Notes

Studying That Suits You

More Like This

Deep Learning Quiz and Flashcards: Test Your AI Knowledge

Deep Learning Quiz: Fundamentals and Basics Test

Convolutional Neural Network Quiz: Test Deep Learning Knowledge

Deep Learning vs Artificial Intelligence: Quiz and Flashcards

Quick Share