logo

Mobile Artificial IntelligenceMobile AI

Maid / Guides / llama.cpp

Run AI Locally with llama.cpp

Run GGUF language models entirely on your Android device — no internet connection, no API key, and no cost per query. Every token is generated on your hardware, so your conversations stay completely private.

How it works

Maid uses llama.cpp — the widely-used open-source inference engine — compiled for Android via the llama.rn React Native library. When you select the Llama provider in Maid, the app loads a GGUF model file directly into memory and runs inference on the device CPU. No network requests are made at any point.

GGUF is the standard model format for llama.cpp. It packages the model weights, tokenizer, and metadata into a single file. Maid can load any GGUF file — either from its built-in catalogue or from a file you provide yourself.

Performance depends on your device's CPU and available RAM. On a modern flagship Android phone (e.g. Pixel 9 Pro, Samsung Galaxy S24) you can expect 10–20 tokens per second with a 1–3B parameter Q4 model. Smaller or more aggressively quantized models run faster; larger models require more RAM and may not load on lower-end devices.

Quick start

1
Open Maid and tap the hamburger menu (≡) in the top-left corner to open the sidebar, then tap the settings icon.
2
At the top of the Settings screen, tap the API dropdown and select Llama.
3
Tap the Model dropdown. This opens the built-in model catalogue. Select a model and quantization, then confirm — the download starts automatically.
4
Once the download finishes, tap Load Model. Maid will initialize llama.cpp with the selected file. Loading can take a few seconds depending on model size.
5
Navigate back to the chat screen and start a conversation. The model runs entirely on your device — no internet connection is needed.

Built-in model catalogue

Maid ships with a curated selection of GGUF models sourced from Hugging Face. These cover a range of sizes from 1B to 4B parameters, suitable for most Android hardware. The catalogue includes:

ModelParametersBest for
LFM 2.5 1.2B Thinking1.2BReasoning tasks, fast devices
Qwen3 4B4.0BGeneral purpose, higher quality
Phi 3 Mini 4K Instruct3.8BInstruction following
TinyLlama 1.1B Chat1.1BVery low RAM, quick responses
Gemma 2 2B IT2.0BBalanced speed and quality
Gemma 3 1B IT1.0BLow-end devices
Gemmasutra Mini 2B v12.0BCreative tasks
Gemmasutra Small 4B v1a4.0BCreative tasks, higher quality
Qwen2.5 1.5B Instruct1.5BFast general assistant
Llama 3.2 1B Instruct1.0BLightweight Meta model
Llama 3.2 3B Instruct3.0BMeta model, good quality
Tesslate Tessa T1 3B3.0BInstruction following
If a model you want is not listed, you can load any GGUF file manually — see the section below.

Choosing a quantization

Each model is available in multiple quantizations. Quantization reduces the precision of the model weights to shrink the file size and memory footprint, at the cost of some output quality. Choosing the right quantization is mostly a trade-off between your device's available RAM and how much quality degradation you can accept.

QuantizationSizeQualityRecommendation
Q2_K / Q3_KSmallestLowestOnly if RAM is very limited
Q4_K_MMediumGoodBest default for most devices
Q5_K_M / Q6_KLargerBetterFlagship devices with 8 GB+ RAM
Q8_0LargeNear-lossless12 GB+ RAM only
F16 / BF16Full sizeOriginalNot recommended on mobile

Q4_K_M is the recommended starting point for most Android devices. It offers a good balance of response quality and file size, and will load comfortably on devices with 6 GB of RAM when using a 1–3B parameter model.

Loading a custom GGUF file

You can load any GGUF model file you have obtained externally — for example from Hugging Face, a self-converted model, or a fine-tuned model shared by the community.

1
Transfer the .gguf file to your Android device via USB, a file manager app, or a cloud storage service.
2
In Maid, go to Settings and confirm Llama is selected as the API.
3
Tap Add Model File and use the file picker to navigate to the .gguf file.
4
Once added, select the file from the Model dropdown and tap Load Model.

Performance tips

Local inference is CPU-bound on most Android devices. Here are a few things that help get the best performance:

  • Close background apps before loading a model to free up RAM.
  • Start with Q4_K_M before trying larger quantizations.
  • 1B and 2B models will run noticeably faster than 3B–4B models on mid-range hardware.
  • Keep the screen on while the model is generating — Android may throttle the CPU when the display is off.
  • Plug in to charge if running longer sessions; sustained CPU load drains the battery quickly.