0 citations0 references

DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

2022pp. 1–15

Citations Over TimeTop 1% of 2022 papers

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Canbing Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, Yuxiong He

Abstract

The landscape of transformer model inference is increasingly diverse in model size, model characteristics, latency and throughput requirements, hardware requirements, etc. With such diversity, designing a versatile inference system is challenging. DeepSpeed-Inference addresses these challenges by (1) a multi-GPU inference solution to minimize latency while maximizing throughput for both dense and sparse transformers when the model fits in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU/NVMe/GPU memory to enable high-throughput inference for models larger than aggregate GPU memory. DeepSpeed-Inference reduces latency by 6.4× and increases throughput by 1.5 ×over the state-of-the-art. It enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25 ×larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over 50% of A6000 peak).

Related Papers

→ ESKVS: efficient and secure approach for keyframes-based video summarization framework(2024)9 cited
Using DataGrid Control to Realize DataBase of Querying in VB6.0(2000)
Susquehanna Chorale Spring Concert "Roots and Wings"(2017)
→ DETERMINING QUALITY REQUIREMENTS AT THE UNIVERSITIES TO IMPROVE THE QUALITY OF EDUCATION(2018)
→ ИСПОЛЬЗОВAНИЕ ПОТЕНЦИAЛA СОЦИAЛЬНЫХ ПAРТНЕРОВ В ПОДГОТОВКЕ БУДУЩИХ ПЕДAГОГОВ(2024)