LoRA (Low-Rank Adaptation)

LoRA, short for Low-Rank Adaptation, is the method most companies actually use when they customize a large model. Instead of retraining all of a model’s billions of parameters, LoRA freezes the original weights and inserts a small number of new, trainable parameters alongside them. Only those small additions are trained. The result behaves like a fully fine-tuned model on the target task, but the cost and storage are a tiny fraction of the full approach.

The method was introduced in the 2021 paper “LoRA: Low-Rank Adaptation of Large Language Models” (Edward J. Hu and colleagues at Microsoft, arXiv 2106.09685). The abstract states that LoRA “can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times” compared with full fine-tuning. The technical insight is that the change needed to adapt a model to a new task can be captured by a low-rank matrix, a compact mathematical object, rather than by rewriting every weight.

LoRA belongs to a broader family called parameter-efficient fine-tuning (PEFT), and it pairs naturally with quantization (the combination is often called QLoRA) to fit customization onto modest hardware. A practical consequence is that the trained adapter is small, often a few megabytes, so an organization can keep one base model and swap in different lightweight adapters for different departments, tones, or compliance rules.

Why business readers should care: LoRA is the reason customizing a frontier-quality model is within reach of ordinary budgets rather than requiring a data-center training run. It also changes the operational picture: you maintain small, portable adapters instead of full model copies, which lowers cost, speeds iteration, and makes it easier to retire or update a customization. The limit is that LoRA adapts an existing base model; it cannot add knowledge or capability the base model fundamentally lacks.

Sources

Related