CosmoFlow Application Performance on Google Cloud

Study Notes

Dean Hildebrand, Technical Director in the Google Cloud Office of the CTO, praises DAOS' performance, stating it is rare to see such good performance from a single storage system across all four dimensions.
The CosmoFlow AI application, leveraging the TensorFlow framework with DAOS, demonstrated high performance during the SC'22 conference.
The CosmoFlow training application benchmark is part of the MLPerf HPC benchmark suite, involving the training of a 3D convolutional neural network for N-body cosmology simulation data.

The DAOS configuration for the IO500 benchmark runs consisted of 32 DAOS clients, 17 DAOS servers, and a 102TB storage configuration.
The benchmark demonstrated high-bandwidth performance (even exceeding that of Lustre on some workloads) combined with ultra-low-latency storage access and tremendous scalability.
DAOS achieved extremely high efficiency, realizing over 94% of the published VM network and Local-SSD bandwidth.

DAOS uses a key-value architecture, which avoids many POSIX limitations and differentiates it from other storage solutions.
DAOS features low-latency, built-in data protections, and end-to-end data integrity, making it suitable for workloads where small file, small IO, and/or many metadata operations per second (MDop/s) performance is critical.
DAOS eliminates many metadata and locking issues of traditional POSIX-based filesystems, providing direct access to both data and metadata.

Google Cloud has the hardware capability to speed the most computationally intensive HPC workloads with fast processors and access to GPU and TPU accelerators.
The Google Cloud HPC Toolkit simplifies the process of deploying HPC workloads in the cloud, featuring DAOS as part of its integration.
DAOS is recommended for any workload where small file, small IO, and/or many metadata operations per second (MDop/s) performance is critical.

Use cases for DAOS and cloud storage include traditional HPC, HPDA, and AI/ML applications.
Users can leverage Google Cloud Storage (GCS) for data ingestion from sources across the globe and long-term retention of data at low cost.
DAOS can be used for high-performance analysis, drastically reducing the execution time and cost of high-performance applications.

Podcast