Llama.cpp
LLM inference in C/C++.
Installation
llama.cpp is available in the AUR:
- Install llama.cppAUR for CPU inference.
- Install llama.cpp-vulkanAUR for GPU inference.
Usage
Primary executors are llama-cli and llama-server.
llama-cli
llama-cli is the command-line executor:
$ llama-cli -m model.gguf
llama-server
llama-server launches an API server with a built-in WebUI:
$ llama-server --host address --port port -m model.gguf
Obtaining models
llama.cpp uses models in the GGUF format.
Download from Hugging Face
Download models from Hugging Face using the -hf flag:
$ llama-cli -hf org/model
Manual download
Manually download models using wget or curl:
$ wget -c model.gguf
Tips and tricks
Model quantization
Quantization lowers model precision to reduce memory usage.
GGUF models use suffixes to indicate quantization level. Generally, lower numbers (e.g. Q4) use less memory but may reduce quality compared to higher numbers (e.g. Q8).
Knowledge distillation
Knowledge distillation compresses a larger model into a smaller model by training the smaller model to follow the behaviors of the larger model.
GGUF models indicate knowledge distillation using the student-teacher-distill denotation, where:
-
studentrepresents the smaller model. -
teacherrepresents the larger model.
Specifying context size
llama.cpp loads the context size from the model by default, and it allocates memory for the whole context window.
Specify a lower context size in case you run out of memory.
$ llama-cli -c 32000 -m model.gguf
Key-value cache quantization
For further memory efficiency, you can quantize the key-value cache.
$ llama-cli -ctk q8_0 -ctv q8_0 -m model.gguf
This, combined with a lower context size, can significantly reduce memory usage.
- Aggressive quantization on keys reduces quality noticeably.
- Aggressive quantization on values is usually better tolerated, but still risks degradation.
Monitoring GPU utilization
See Graphics processing unit#Monitoring.