Unified Parameter-Efficient Adaptation Framework for Quantized Large Language Models

Yoona Li

Authors

Yoona Li New York University

Abstract

Large language models (LLMs) such as GPT-like decoder-only transformers are typically trained with billions of parameters and then adapted to many downstream tasks. Directly fine-tuning all parameters is often infeasible for researchers with commodity hardware be cause (i) every parameter needs optimizer states, (ii) GPU memory must hold activations, gradients, and states at once, and (iii) multiple task-specific copies of the model would have to be stored. Parameter-Efficient Fine-Tuning (PEFT) offers a solution: freeze the pretrained backbone and learn only a tiny number of task-specific parameters. Low-Rank Adaptation (LoRA) [1] implements this by expressing the update to a frozen linear map as a rank-r factorization, reducing trainable parameters by up to 104× while preserving quality. LoRA+ [3] shows that the original LoRA update can be made better conditioned by using different learning rates for the two low-rank factors. QLoRA [2] demonstrates that this adaptation can be done even when the backbone is stored in 4-bit, by backpropagating through a quantized model into LoRA parameters, enabling fine-tuning of 33B–65B models on a single 48GB GPU. This paper presents a complete article that unifies these ideas as one constrained optimization problem, derives parameter and memory costs, and includes figure environments for empirical plots.

Unified Parameter-Efficient Adaptation Framework for Quantized Large Language Models

Authors

Abstract

Downloads

Published

Issue

Section

License