Elemental Adaptation of Low-Rank Quantisation-Aware Training for Large Language Models on Natural Language Processing Tasks
Abstract
The growing capabilities of Large Language Models are accompanied by high memory & computational needs, posing significant challenges for deployment in resource-constrained environments such as edge devices. Traditional Post-Training Quantization, Quantization-Aware Training & Low-Rank Adaptation (LoRA) methodologies exhibit limitations in accuracy, memory & inference efficiency respectively. Recently proposed Low-Rank Quantisation-Aware Training (LR-QAT), serves as a lightweight, memory & inference efficient general extended pretraining QAT method. This paper presents an elemental adaptation of LR-QAT, making it feasible to train large models like GPT-2 Medium & BERT-Base-Uncased on low resource consumer grade GPUs. Our method freezes pretrained weights, injects simulated low-bit noise in accordance to LR-QAT, trains LoRA adapters and quantization parameters within this grid, fuses the adapters into a single full-precision checkpoint, and finally emits BF16, INT8 & NF4 models from one training run. Experimental evaluations on benchmark Natural Language Processing tasks such as Stanford Sentiment Treebank (SST-2) and Question-Answering Natural Language Inference (QNLI) from the General Language Understanding Evaluation (GLUE) dataset demonstrate that the proposed method edges past both traditional Post-Training Quantization and customized Quantized LoRA (QLoRA) in performance metrics, while maintaining comparable memory usages at reduced bit-widths.