CosmoFlow Application Performance on Google Cloud

TimelyForethought2760 avatar
TimelyForethought2760
·
·
Download

Start Quiz

Study Flashcards

11 Questions

What is the purpose of training a network in this context?

To predict physical parameters of the universe

What is TF-IO integrated with in Figure 9?

The DAOS libdfs I/O library

What benefit does the TF-IO integration provide?

Bypassing POSIX and operating system kernel inefficiencies

What was the read bandwidth achieved by DAOS?

96 GiB/s (768 Tbps)

What is notable about DAOS's IOP/s and latency?

Very high IOP/s and remarkably low latency

What is the approximate latency achieved by DAOS?

0.3ms

How does DAOS handle small file reads?

At a rate of 551K/sec

What is the significance of Figure 8?

It reports the performance results of DAOS

What is the random write speed achieved by DAOS?

825K/sec

How does DAOS handle file creation?

At a rate of 1.5M/sec (empty) and 689K/sec (3901 bytes)

What is the key differentiation of DAOS?

Very high IOP/s, MDop/s, and remarkably low latency

Study Notes

DAOS and Google Cloud HPC Performance

  • Dean Hildebrand, Technical Director in the Google Cloud Office of the CTO, praises DAOS' performance, stating it is rare to see such good performance from a single storage system across all four dimensions.
  • The CosmoFlow AI application, leveraging the TensorFlow framework with DAOS, demonstrated high performance during the SC'22 conference.
  • The CosmoFlow training application benchmark is part of the MLPerf HPC benchmark suite, involving the training of a 3D convolutional neural network for N-body cosmology simulation data.

DAOS Configuration and Performance

  • The DAOS configuration for the IO500 benchmark runs consisted of 32 DAOS clients, 17 DAOS servers, and a 102TB storage configuration.
  • The benchmark demonstrated high-bandwidth performance (even exceeding that of Lustre on some workloads) combined with ultra-low-latency storage access and tremendous scalability.
  • DAOS achieved extremely high efficiency, realizing over 94% of the published VM network and Local-SSD bandwidth.

DAOS Features and Benefits

  • DAOS uses a key-value architecture, which avoids many POSIX limitations and differentiates it from other storage solutions.
  • DAOS features low-latency, built-in data protections, and end-to-end data integrity, making it suitable for workloads where small file, small IO, and/or many metadata operations per second (MDop/s) performance is critical.
  • DAOS eliminates many metadata and locking issues of traditional POSIX-based filesystems, providing direct access to both data and metadata.

HPC-in-the-Cloud and Google Cloud HPC Toolkit

  • Google Cloud has the hardware capability to speed the most computationally intensive HPC workloads with fast processors and access to GPU and TPU accelerators.
  • The Google Cloud HPC Toolkit simplifies the process of deploying HPC workloads in the cloud, featuring DAOS as part of its integration.
  • DAOS is recommended for any workload where small file, small IO, and/or many metadata operations per second (MDop/s) performance is critical.

Use Cases and Benefits of DAOS and Cloud Storage

  • Use cases for DAOS and cloud storage include traditional HPC, HPDA, and AI/ML applications.
  • Users can leverage Google Cloud Storage (GCS) for data ingestion from sources across the globe and long-term retention of data at low cost.
  • DAOS can be used for high-performance analysis, drastically reducing the execution time and cost of high-performance applications.

The performance of CosmoFlow AI application on Google Cloud, demonstrated by Google and Intel teams at SC'22 conference, achieving rare TensorFlow-IO performance across all four dimensions.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free
Use Quizgecko on...
Browser
Browser