Vision Models with llama.cpp
Send images to a vision-capable AI model running entirely on your Android device. No internet connection, no cloud upload — image understanding happens on your hardware.
How it works
Vision models in llama.cpp are multimodal — they can process both text and images. They are made up of two parts: the base language model (a standard GGUF file) and a multimodal projector, sometimes called a clip or vision adapter.
The projector is a small auxiliary model that encodes images into a format the language model can understand. It is distributed as a separate file with the extension .mmproj or .gguf. Both files must be loaded together — the base model alone cannot process images.
Once both files are loaded, an image attachment button becomes active in the prompt input bar. You can attach one or more photos from your gallery, type an optional message, and the model will respond with an understanding of both the image and your text.
What you need
You need two files for each vision model — a matched pair of base model and projector. They must correspond to the same model architecture; mismatched pairs will not work. Common vision-capable GGUF models and their projectors can be found on Hugging Face. Some examples include:
| Model family | Base GGUF | Projector file |
|---|---|---|
| LLaVA 1.5 | llava-1.5-*.gguf | mmproj-llava-1.5-*.gguf |
| LLaVA 1.6 / LLaVA-NeXT | llava-v1.6-*.gguf | mmproj-llava-v1.6-*.gguf |
| Moondream 2 | moondream2-*.gguf | moondream2-mmproj-*.gguf |
| Gemma 3 Vision | gemma-3-*-it-*.gguf | mmproj-gemma-3-*.gguf |
Quick start
.gguf model file using the file picker. Once added, select it from the Model dropdown..mmproj or .gguf). The projector is automatically linked to the currently active model.Projector matching rules
Maid applies the following logic to decide whether a projector will be used:
- The projector key matches the selected model key exactly, or
- The model was loaded from a local file (model key ends in
(local)).
In practice this means: if you load both files using Add Model File and Add Projector File, they are always paired correctly. The projector is always associated with the locally-loaded model.
Attaching images
Once a compatible projector is loaded and vision is active:
- Tap the image icon to the left of the action button. Maid will request photo library access on first use.
- Select one or more images from your photo library.
- Selected images appear as thumbnails above the input field. Tap the × on any thumbnail to remove it before sending.
- Type your message (optional) and tap Send.
Performance considerations
Vision models are more memory-intensive than text-only models of the same parameter count, because the projector adds additional memory overhead and image tokens consume context space. Keep the following in mind:
- Start with smaller models (1B–3B) and Q4_K_M quantization to keep RAM usage manageable.
- Image processing happens before text generation — expect a brief pause after sending an image.
- Close background apps before loading a vision model to maximise available RAM.
- Larger or higher-resolution images increase processing time and context token usage.