llama.cpp puts large language models on ordinary laptops

llama.cpp was created by Georgi Gerganov in March 2023, days after Meta’s LLaMA weights leaked and circulated. It is a plain C/C++ implementation of large-language-model inference with, in the project’s own words, no dependencies, designed for minimal setup and high performance across a wide range of hardware. Apple silicon is treated as a first-class target, optimized with ARM NEON, Accelerate, and Metal.

The key trick is aggressive quantization: llama.cpp supports running models in 2-bit through 8-bit integer formats instead of 16- or 32-bit floats, which shrinks memory enough that multi-billion-parameter models fit in the RAM of a normal laptop and run on the CPU at usable speed. Models are stored in the GGUF format, with conversion scripts to import weights from other formats. The result was that, almost overnight, people could run capable language models entirely offline on hardware they already owned.

llama.cpp became the engine of the local-LLM movement. Its ggml tensor library and GGUF format underpin a large ecosystem, including Ollama and many desktop apps, and it shifted expectations about where AI could run, away from data centers and onto personal devices.

llama.cpp puts large language models on ordinary laptops

Sources

Related