Maise / Guides / How It Works

How Maise Works

A technical walkthrough of the TTS and ASR pipelines that run entirely on your Android device — from raw text and microphone input to spoken audio and transcribed text.

The runtime: ONNX Runtime

Both TTS and ASR in Maise are powered by ONNX Runtime, the open-source inference engine from Microsoft. ONNX Runtime runs neural network models described in the ONNX format, which Kokoro (TTS) and distil-Whisper (ASR) are both exported to.

On Android, ONNX Runtime uses the device CPU with NNAPI acceleration where available. No GPU-specific code paths are used, which means the models run on any ARM64 Android device regardless of GPU vendor. The minimum requirement is Android 8.0 (API 26) on an ARM64-v8a device.

Text-to-Speech pipeline

Converting text to speech involves four sequential stages. The entire pipeline runs on-device with no network requests.

Text normalisation
Raw input text is cleaned and standardized before any other processing. Numbers are expanded to words ("42" → "forty-two"), common abbreviations are resolved, and punctuation is handled so the phonemizer receives well-formed prose. This step is critical for natural-sounding output — unnormalized text produces artifacts like spelled-out digits or skipped symbols.

Phonemization via Open Phonemizer
The normalized text is converted into a sequence of IPA phonemes by Open Phonemizer. Phonemes are the fundamental units of sound in a language — converting text to phonemes first allows the neural synthesis model to focus on acoustics rather than spelling, improving pronunciation accuracy across languages.

Neural synthesis via Kokoro
The phoneme sequence and a per-voice style embedding are passed to Kokoro, a multi-lingual neural TTS model running under ONNX Runtime. Kokoro is a flow-matching model that maps from the phoneme and style input directly to a raw PCM audio waveform. Each voice is defined by its style embedding — the 68 bundled voices represent different accents, characters, and speaking styles that can be applied to the same underlying model weights.

Streaming playback
Synthesis and playback run in parallel using a producer-consumer architecture. Kokoro generates audio in chunks, and each chunk is handed to the audio player as soon as it is ready. This means audio playback begins well before the full input has been synthesized — noticeably reducing perceived latency on longer texts. Audio output is 24 kHz mono 16-bit PCM, sent directly to the device speaker or any connected audio output.

Automatic Speech Recognition pipeline

Transcribing spoken audio to text involves three stages, all running locally.

Recording with Voice Activity Detection
Audio is captured from the microphone at 16 kHz mono 16-bit PCM — the format expected by Whisper. Rather than requiring you to manually stop recording, Maise uses the WebRTC VAD algorithm to detect when speech ends. When sustained silence is detected after a period of speech, recording stops automatically. You can also tap Stop Recording to end early. The maximum recording duration is 30 seconds.

Log-mel spectrogram
The captured audio is converted into an 80-band log-mel spectrogram, the input representation used by all Whisper models. This transform converts the raw waveform into a time-frequency representation that the encoder processes more efficiently than raw audio samples.

Transcription via distil-Whisper
The spectrogram is fed into distil-whisper/distil-small.en, a distilled version of OpenAI's Whisper model trained by Hugging Face. The distilled variant is roughly 6× faster than Whisper Small while retaining most of its accuracy on English speech. It uses an encoder-decoder Transformer architecture with greedy decoding (highest-probability token at each step) to produce the transcribed text. This model is English-only.

System integration

Maise integrates with Android at the framework level rather than sitting in front of it. This is what makes it system-wide:

Component	Android API	Effect
TTS engine	TextToSpeechService	Any app calling the TTS API uses Maise automatically
Speech recognizer	RecognitionService	Any app using SpeechRecognizer API uses distil-Whisper
Voice keyboard	InputMethodService	Microphone dictation button in any text field

Because Maise registers at the framework level, it requires no integration work from other apps. Any app that already uses the standard Android TTS or ASR APIs gains on-device processing automatically.

← All guides Set up TTS →