GGUF is the single-file format that packages models for local inference

GGUF (GGML Universal Format) is a binary file format designed for fast loading and saving of models for inference with the GGML library and tools like llama.cpp. Its defining property is that a single GGUF file is self-contained: it holds the model weights together with all the metadata and hyperparameters needed to load and run the model, so no separate config files or external dependencies are required.

The format succeeded the earlier GGML, GGMF, and GGJT formats, which lacked a flexible way to add new information. GGUF fixes this by storing hyperparameters as typed key-value metadata rather than a fixed list of untyped values, which means new fields can be added without breaking older files. It is also designed to be memory-mappable (mmap), so a model can be loaded quickly without copying the whole file into memory, and simple enough to read with minimal code.

GGUF became the de facto distribution format for local large language models. Quantized open-weight models are shared as GGUF files on hubs and run directly by llama.cpp, Ollama, and many desktop applications.

GGUF is the single-file format that packages models for local inference

Sources

Related