Run AI Locally with llama.cpp
Run GGUF language models entirely on your Android device — no internet connection, no API key, and no cost per query. Every token is generated on your hardware, so your conversations stay completely private.
How it works
Maid uses llama.cpp — the widely-used open-source inference engine — compiled for Android via the llama.rn React Native library. When you select the Llama provider in Maid, the app loads a GGUF model file directly into memory and runs inference on the device CPU. No network requests are made at any point.
GGUF is the standard model format for llama.cpp. It packages the model weights, tokenizer, and metadata into a single file. Maid can load any GGUF file — either from its built-in catalogue or from a file you provide yourself.
Performance depends on your device's CPU and available RAM. On a modern flagship Android phone (e.g. Pixel 9 Pro, Samsung Galaxy S24) you can expect 10–20 tokens per second with a 1–3B parameter Q4 model. Smaller or more aggressively quantized models run faster; larger models require more RAM and may not load on lower-end devices.
Quick start
Built-in model catalogue
Maid ships with a curated selection of GGUF models sourced from Hugging Face. These cover a range of sizes from 1B to 4B parameters, suitable for most Android hardware. The catalogue includes:
| Model | Parameters | Best for |
|---|---|---|
| LFM 2.5 1.2B Thinking | 1.2B | Reasoning tasks, fast devices |
| Qwen3 4B | 4.0B | General purpose, higher quality |
| Phi 3 Mini 4K Instruct | 3.8B | Instruction following |
| TinyLlama 1.1B Chat | 1.1B | Very low RAM, quick responses |
| Gemma 2 2B IT | 2.0B | Balanced speed and quality |
| Gemma 3 1B IT | 1.0B | Low-end devices |
| Gemmasutra Mini 2B v1 | 2.0B | Creative tasks |
| Gemmasutra Small 4B v1a | 4.0B | Creative tasks, higher quality |
| Qwen2.5 1.5B Instruct | 1.5B | Fast general assistant |
| Llama 3.2 1B Instruct | 1.0B | Lightweight Meta model |
| Llama 3.2 3B Instruct | 3.0B | Meta model, good quality |
| Tesslate Tessa T1 3B | 3.0B | Instruction following |
Choosing a quantization
Each model is available in multiple quantizations. Quantization reduces the precision of the model weights to shrink the file size and memory footprint, at the cost of some output quality. Choosing the right quantization is mostly a trade-off between your device's available RAM and how much quality degradation you can accept.
| Quantization | Size | Quality | Recommendation |
|---|---|---|---|
| Q2_K / Q3_K | Smallest | Lowest | Only if RAM is very limited |
| Q4_K_M | Medium | Good | Best default for most devices |
| Q5_K_M / Q6_K | Larger | Better | Flagship devices with 8 GB+ RAM |
| Q8_0 | Large | Near-lossless | 12 GB+ RAM only |
| F16 / BF16 | Full size | Original | Not recommended on mobile |
Q4_K_M is the recommended starting point for most Android devices. It offers a good balance of response quality and file size, and will load comfortably on devices with 6 GB of RAM when using a 1–3B parameter model.
Loading a custom GGUF file
You can load any GGUF model file you have obtained externally — for example from Hugging Face, a self-converted model, or a fine-tuned model shared by the community.
Performance tips
Local inference is CPU-bound on most Android devices. Here are a few things that help get the best performance:
- Close background apps before loading a model to free up RAM.
- Start with Q4_K_M before trying larger quantizations.
- 1B and 2B models will run noticeably faster than 3B–4B models on mid-range hardware.
- Keep the screen on while the model is generating — Android may throttle the CPU when the display is off.
- Plug in to charge if running longer sessions; sustained CPU load drains the battery quickly.