Use Maise with Maid
Combine Maise's on-device TTS and ASR with Maid's local AI inference for a fully offline AI assistant. Local model, local voice, local transcription — no internet required at any step.
What this gives you
Maid runs AI language models directly on your Android device using llama.cpp. Maise handles voice — converting the AI's text responses into speech, and optionally transcribing your voice back to text. Together, they form a complete AI assistant pipeline where nothing leaves your device:
- You speak — Maise transcribes your voice using distil-Whisper (on-device).
- Maid thinks — The local LLM generates a response using llama.cpp (on-device).
- Maise speaks — The response is read aloud using a Kokoro voice (on-device).
No API keys, no subscriptions, no cloud. The entire loop runs on your phone.
Requirements
- Maise installed and configured as the system TTS engine (see TTS Setup).
- Maid installed with a local GGUF model loaded via the Llama provider (see llama.cpp guide).
- Both apps installed on the same device.
Step 1 — Set up Maise as the system TTS engine
If you haven't already, follow the TTS Setup guide to make Maise the preferred engine in Android's TTS settings. Select a voice you want to hear for AI responses — a clear, natural-sounding voice like en-US-heart-kokoro or en-US-nova-kokoro works well for conversational text.
Step 2 — Enable voice in Maid
Step 3 — Use voice input in Maid (optional)
Maid has a built-in microphone button in the chat input bar. Tapping it starts voice dictation using whatever speech recognition service is currently set as the Android default. If you've configured Maise as the system speech recognizer (see ASR Setup), Maid will use Maise's Whisper-based transcription automatically.
The full offline loop
With both apps configured, here is what the full interaction looks like — entirely on-device, with zero network traffic:
Choosing a model for voice interaction
For voice-based conversations, response latency matters more than it might for text-only use. Shorter responses feel more natural when spoken aloud, and faster models mean less waiting between speaking and hearing the reply.
- Best balance — Gemma 2 2B or Qwen2.5 1.5B at Q4_K_M. Fast enough for natural back-and-forth.
- Lower-end devices — TinyLlama 1.1B or Gemma 3 1B. Very fast, shorter responses.
- Flagship phones — Qwen3 4B or Llama 3.2 3B for better quality while still being usable in conversation.