Maid / Guides / llama.cpp

Run AI Locally with llama.cpp

Run GGUF language models entirely on your Android device — no internet connection, no API key, and no cost per query. Every token is generated on your hardware, so your conversations stay completely private.

How it works

Maid uses llama.cpp — the widely-used open-source inference engine — compiled for Android via the llama.rn React Native library. When you select the Llama provider in Maid, the app loads a GGUF model file directly into memory and runs inference on the device CPU. No network requests are made at any point.

GGUF is the standard model format for llama.cpp. It packages the model weights, tokenizer, and metadata into a single file. Maid can load any GGUF file — either from its built-in catalogue or from a file you provide yourself.

Performance depends on your device's CPU and available RAM. On a modern flagship Android phone (e.g. Pixel 9 Pro, Samsung Galaxy S24) you can expect 10–20 tokens per second with a 1–3B parameter Q4 model. Smaller or more aggressively quantized models run faster; larger models require more RAM and may not load on lower-end devices.

Quick start

Open Maid and tap the hamburger menu (≡) in the top-left corner to open the sidebar, then tap the settings icon.

At the top of the Settings screen, tap the API dropdown and select Llama.

Tap the Model dropdown. This opens the built-in model catalogue. Select a model and quantization, then confirm — the download starts automatically.

Once the download finishes, tap Load Model. Maid will initialize llama.cpp with the selected file. Loading can take a few seconds depending on model size.

Navigate back to the chat screen and start a conversation. The model runs entirely on your device — no internet connection is needed.

Built-in model catalogue

Maid ships with a curated selection of GGUF models sourced from Hugging Face. These cover a range of sizes from 1B to 4B parameters, suitable for most Android hardware. The catalogue includes:

Model	Parameters	Best for
LFM 2.5 1.2B Thinking	1.2B	Reasoning tasks, fast devices
Qwen3 4B	4.0B	General purpose, higher quality
Phi 3 Mini 4K Instruct	3.8B	Instruction following
TinyLlama 1.1B Chat	1.1B	Very low RAM, quick responses
Gemma 2 2B IT	2.0B	Balanced speed and quality
Gemma 3 1B IT	1.0B	Low-end devices
Gemmasutra Mini 2B v1	2.0B	Creative tasks
Gemmasutra Small 4B v1a	4.0B	Creative tasks, higher quality
Qwen2.5 1.5B Instruct	1.5B	Fast general assistant
Llama 3.2 1B Instruct	1.0B	Lightweight Meta model
Llama 3.2 3B Instruct	3.0B	Meta model, good quality
Tesslate Tessa T1 3B	3.0B	Instruction following

If a model you want is not listed, you can load any GGUF file manually — see the section below.

Choosing a quantization

Each model is available in multiple quantizations. Quantization reduces the precision of the model weights to shrink the file size and memory footprint, at the cost of some output quality. Choosing the right quantization is mostly a trade-off between your device's available RAM and how much quality degradation you can accept.

Quantization	Size	Quality	Recommendation
Q2_K / Q3_K	Smallest	Lowest	Only if RAM is very limited
Q4_K_M	Medium	Good	Best default for most devices
Q5_K_M / Q6_K	Larger	Better	Flagship devices with 8 GB+ RAM
Q8_0	Large	Near-lossless	12 GB+ RAM only
F16 / BF16	Full size	Original	Not recommended on mobile

Q4_K_M is the recommended starting point for most Android devices. It offers a good balance of response quality and file size, and will load comfortably on devices with 6 GB of RAM when using a 1–3B parameter model.

Loading a custom GGUF file

You can load any GGUF model file you have obtained externally — for example from Hugging Face, a self-converted model, or a fine-tuned model shared by the community.

Transfer the .gguf file to your Android device via USB, a file manager app, or a cloud storage service.

In Maid, go to Settings and confirm Llama is selected as the API.

Tap Add Model File and use the file picker to navigate to the .gguf file.

Once added, select the file from the Model dropdown and tap Load Model.

Performance tips

Local inference is CPU-bound on most Android devices. Here are a few things that help get the best performance:

Close background apps before loading a model to free up RAM.
Start with Q4_K_M before trying larger quantizations.
1B and 2B models will run noticeably faster than 3B–4B models on mid-range hardware.
Keep the screen on while the model is generating — Android may throttle the CPU when the display is off.
Plug in to charge if running longer sessions; sustained CPU load drains the battery quickly.

← All guides Maid overview