GRACE: A Compressed Communication Framework for Distributed Machine Learning

Hang Xu, Chen-Yu Ho, Ahmed M. Abdelmoniem, Aritra Dutta, El Houcine Bergou, Konstantinos Karatsenidis, Marco Canini, Panos Kalnis

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Abstract

Powerful computer clusters are used nowadays to train complex deep neural networks (DNN) on large datasets. Distributed training increasingly becomes communication bound. For this reason, many lossy compression techniques have been proposed to reduce the volume of transferred data. Unfortunately, it is difficult to argue about the behavior of compression methods, because existing work relies on inconsistent evaluation testbeds and largely ignores the performance impact of practical system configurations. In this paper, we present a comprehensive survey of the most influential compressed communication methods for DNN training, together with an intuitive classification (i.e., quantization, sparsification, hybrid and low-rank). Next, we propose GRACE, a unified framework and API that allows for consistent and easy implementation of compressed communication on popular machine learning toolkits. We instantiate GRACE on TensorFlow and PyTorch, and implement 16 such methods. Finally, we present a thorough quantitative evaluation with a variety of DNNs (convolutional and recurrent), datasets and system configurations. We show that the DNN architecture affects the relative performance among methods. Interestingly, depending on the underlying communication library and computational cost of compression / decompression, we demonstrate that some methods may be impractical. GRACE and the entire benchmarking suite are available as open-source.
Original languageEnglish
Title of host publication41st International Conference on Distributed Computing Systems (ICDCS)
PublisherIEEE
Publication date2021
Pages561-572
ISBN (Electronic)978-1-6654-4513-9, 978-1-6654-4514-6
DOIs
Publication statusPublished - 2021
Externally publishedYes
Event41st International Conference on Distributed Computing Systems - , United States
Duration: 7. Jul 202110. Jul 2021

Conference

Conference41st International Conference on Distributed Computing Systems
Country/TerritoryUnited States
Period07/07/202110/07/2021

Bibliographical note

DBLP's bibliographic metadata records provided through http://dblp.org/search/publ/api are distributed under a Creative Commons CC0 1.0 Universal Public Domain Dedication. Although the bibliographic metadata records are provided consistent with CC0 1.0 Dedication, the content described by the metadata records is not. Content may be subject to copyright, rights of privacy, rights of publicity and other restrictions.

Keywords

  • Benchmark
  • Deep Learning
  • Distributed Machine Learning
  • Gradient Compression
  • Survey

Fingerprint

Dive into the research topics of 'GRACE: A Compressed Communication Framework for Distributed Machine Learning'. Together they form a unique fingerprint.

Cite this