Jump to content

Llama.cpp

From ArchWiki


LLM inference in C/C++.

Installation

llama.cpp is available in the AUR:

Note Ensure you have the appropriate Vulkan driver installed.

Usage

Primary executors are llama-cli and llama-server.

llama-cli

llama-cli is the command-line executor:

$ llama-cli -m model.gguf

llama-server

llama-server launches an API server with a built-in WebUI:

$ llama-server --host address --port port -m model.gguf

Obtaining models

llama.cpp uses models in the GGUF format.

Download from Hugging Face

Download models from Hugging Face using the -hf flag:

$ llama-cli -hf org/model
Warning This may overwrite an existing model file without prompting.

Manual download

Manually download models using wget or curl:

$ wget -c model.gguf

Tips and tricks

Model quantization

Quantization lowers model precision to reduce memory usage.

GGUF models use suffixes to indicate quantization level. Generally, lower numbers (e.g. Q4) use less memory but may reduce quality compared to higher numbers (e.g. Q8).

Knowledge distillation

Knowledge distillation compresses a larger model into a smaller model by training the smaller model to follow the behaviors of the larger model.

GGUF models indicate knowledge distillation using the student-teacher-distill denotation, where:

  • student represents the smaller model.
  • teacher represents the larger model.

Specifying context size

llama.cpp loads the context size from the model by default, and it allocates memory for the whole context window.

Specify a lower context size in case you run out of memory.

$ llama-cli -c 32000 -m model.gguf

Key-value cache quantization

For further memory efficiency, you can quantize the key-value cache.

$ llama-cli -ctk q8_0 -ctv q8_0 -m model.gguf

This, combined with a lower context size, can significantly reduce memory usage.

Note
  • Aggressive quantization on keys reduces quality noticeably.
  • Aggressive quantization on values is usually better tolerated, but still risks degradation.

Monitoring GPU utilization

See Graphics processing unit#Monitoring.

See also