logo

Mobile Artificial IntelligenceMobile AI

Maid / Guides / Vision with llama.cpp

Vision Models with llama.cpp

Send images to a vision-capable AI model running entirely on your Android device. No internet connection, no cloud upload — image understanding happens on your hardware.

How it works

Vision models in llama.cpp are multimodal — they can process both text and images. They are made up of two parts: the base language model (a standard GGUF file) and a multimodal projector, sometimes called a clip or vision adapter.

The projector is a small auxiliary model that encodes images into a format the language model can understand. It is distributed as a separate file with the extension .mmproj or .gguf. Both files must be loaded together — the base model alone cannot process images.

Once both files are loaded, an image attachment button becomes active in the prompt input bar. You can attach one or more photos from your gallery, type an optional message, and the model will respond with an understanding of both the image and your text.

What you need

You need two files for each vision model — a matched pair of base model and projector. They must correspond to the same model architecture; mismatched pairs will not work. Common vision-capable GGUF models and their projectors can be found on Hugging Face. Some examples include:

Model familyBase GGUFProjector file
LLaVA 1.5llava-1.5-*.ggufmmproj-llava-1.5-*.gguf
LLaVA 1.6 / LLaVA-NeXTllava-v1.6-*.ggufmmproj-llava-v1.6-*.gguf
Moondream 2moondream2-*.ggufmoondream2-mmproj-*.gguf
Gemma 3 Visiongemma-3-*-it-*.ggufmmproj-gemma-3-*.gguf
Search Hugging Face for the model name followed by "GGUF" to find quantized versions. The projector file is usually in the same repository as the base model.

Quick start

1
Download a matched vision model GGUF and its projector file to your Android device. Transfer them via USB, a file manager, or a cloud storage app.
2
In Maid, go to Settings and select Llama from the API dropdown.
3
Tap Add Model File and select the base .gguf model file using the file picker. Once added, select it from the Model dropdown.
4
Tap Add Projector File and select the projector file (.mmproj or .gguf). The projector is automatically linked to the currently active model.
5
Tap Load Model. Maid loads both the base model and projector. When vision is confirmed, the image icon becomes active in the prompt input bar.
6
Return to the chat screen. Tap the image icon to select one or more photos from your gallery, type an optional message, and tap Send.

Projector matching rules

Maid applies the following logic to decide whether a projector will be used:

  • The projector key matches the selected model key exactly, or
  • The model was loaded from a local file (model key ends in (local)).

In practice this means: if you load both files using Add Model File and Add Projector File, they are always paired correctly. The projector is always associated with the locally-loaded model.

Attaching images

Once a compatible projector is loaded and vision is active:

  1. Tap the image icon to the left of the action button. Maid will request photo library access on first use.
  2. Select one or more images from your photo library.
  3. Selected images appear as thumbnails above the input field. Tap the × on any thumbnail to remove it before sending.
  4. Type your message (optional) and tap Send.
Images are encoded as base64 and passed directly to the local model. They are never uploaded to any server or cloud service.

Performance considerations

Vision models are more memory-intensive than text-only models of the same parameter count, because the projector adds additional memory overhead and image tokens consume context space. Keep the following in mind:

  • Start with smaller models (1B–3B) and Q4_K_M quantization to keep RAM usage manageable.
  • Image processing happens before text generation — expect a brief pause after sending an image.
  • Close background apps before loading a vision model to maximise available RAM.
  • Larger or higher-resolution images increase processing time and context token usage.