Unified Parameter-Efficient Adaptation Framework for Quantized Large Language Models
Abstract
Large language models (LLMs) such as GPT-like decoder-only transformers are typically trained with billions of parameters and then adapted to many downstream tasks. Directly fine-tuning all parameters is often infeasible for researchers with commodity hardware be cause (i) every parameter needs optimizer states, (ii) GPU memory must hold activations, gradients, and states at once, and (iii) multiple task-specific copies of the model would have to be stored. Parameter-Efficient Fine-Tuning (PEFT) offers a solution: freeze the pretrained backbone and learn only a tiny number of task-specific parameters. Low-Rank Adaptation (LoRA) [1] implements this by expressing the update to a frozen linear map as a rank-r factorization, reducing trainable parameters by up to 104× while preserving quality. LoRA+ [3] shows that the original LoRA update can be made better conditioned by using different learning rates for the two low-rank factors. QLoRA [2] demonstrates that this adaptation can be done even when the backbone is stored in 4-bit, by backpropagating through a quantized model into LoRA parameters, enabling fine-tuning of 33B–65B models on a single 48GB GPU. This paper presents a complete article that unifies these ideas as one constrained optimization problem, derives parameter and memory costs, and includes figure environments for empirical plots.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 International Journal of Natural Sciences and Engineering Innovations

This work is licensed under a Creative Commons Attribution 4.0 International License.