


Network-Based Computing Laboratory: MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE (2001). Longhorn - Texas Advanced Computing Center Frontera - User Guide. Liquid Submerged System - Texas Advanced Computing Center, Frontera - Specifications. Lindstrom, P.: Fixed-rate compressed floating-point arrays. Lawrence Livermore National Laboratory: lassen-high performance computing (2018).

Association for Computing Machinery, New York, NY, USA (2021).

Kousha, P., et al.: INAM: Cross-Stack Profiling and Analysis of Communication in MPI-Based Applications. In: 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), pp. Kousha, P., et al.: Designing a profiling and visualization tool for scalable and in-depth analysis of high-performance GPU clusters. Kim, Y.J., et al.: Scalable and efficient MOE training for multitask multilingual models (2021) In: International Workshop on OpenPOWER for HPC (IWOPH 19) at the 2019 ISC High Performance Conference (2018) Khorassani, K.S., Chu, C.H., Subramoni, H., Panda, D.K.: Performance evaluation of MPI libraries on GPU-enabled OpenPOWER architectures: early experiences. In: 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp. Khorassani, K.S., Chu, C.H., Anthony, Q.G., Subramoni, H., Panda, D.K.: Adaptive and hierarchical large message All-to-All communication algorithms for large-scale dense GPU systems. In: Proceedings International Parallel and Distributed Processing Symposium, p. Kale, L., Kumar, S., Varadarajan, K.: A framework for collective personalized communication. Jin, S., et al.: Understanding GPU-Based Lossy Compression for Extreme-Scale Cosmological Simulations. IBM: IBM Spectrum MPI: accelerating high-performance application parallelization (2018). In: European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting, pp. In: International Parallel and Distributed Processing Symposium (IPDPS) (2016)įilgueira, R., Singh, D., Calderón, A., Carretero, J.: CoMPI: enhancing MPI based applications performance and scalability using run-time compression. In: Proceedings of the 34th ACM International Conference on Supercomputing (2020)ĭi, S., Cappello, F.: Fast error-bounded lossy HPC data compression with SZ. 8(11), 1143–1156 (1997)Ĭhu, C.H., Kousha, P., Awan, A.A., Khorassani, K.S., Subramoni, H., Panda, D.K.: NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems. Keywordsīruck, J., Ho, C.T., Kipnis, S., Upfal, E., Weathersby, D.: Efficient algorithms for All-to-All communications in multiport message-passing systems. To the best of our knowledge, this is the first work that leverages online GPU-based compression techniques to significantly accelerate MPI_Alltoall communication for HPC and DL applications. For Microsoft’s DeepSpeed, a DL optimization library, the proposed design reduces the MPI_Alltoall runtime by up to 26.4% compared to a state-of-the-art MPI library with point-to-point compression while ensuring data validation. For PSDNS, a traditional HPC application, our proposed design can reduce the All-to-All communication latency and total runtime by up to 29.2% and 21.8%, respectively, while ensuring data validation and not affecting the application convergence time. At the microbenchmark level, the proposed design can reduce the All-to-All communication latency by up to 87%. We demonstrate that the proposed design achieves benefits at both microbenchmark and application levels. In this paper, we redesign an MPI library to enable efficient collective-level online compression with an optimized host-staging scheme for All-to-All communication. The recent research of point-to-point-based online compression with these compression algorithms has shown potential on modern GPU clusters. However, the development of GPU-based compression algorithms with high throughput can reduce the volume of data transferred. However, for state-of-the-art GPU-Aware MPI libraries, MPI_Alltoall communication for large GPU-resident data still suffers from poor performance due to the throughput limitation of commodity networks. Over the last decade, most research has focused on the optimization of large GPU-resident data transfers. Among the available MPI operations in such applications, All-to-All is one of the most communication-intensive operations that becomes the bottleneck of efficiently scaling applications to larger GPU systems. As more High-Performance Computing (HPC) and Deep Learning (DL) applications are adapting to scale using GPUs, the communication of GPU-resident data is becoming vital to end-to-end application performance.
