How to run MLX Whisper locally on macOS without killing your battery
A deep dive into why we chose Apple's MLX framework for Saydrop, and how we keep Whisper resident in unified memory for sub-second push-to-talk dictation.
When building a privacy-first dictation tool for macOS, you face a core architectural choice: do you offload transcription to a cloud API (like OpenAI’s Whisper API), or do you run it entirely on-device?
We chose strictly on-device for Saydrop. But running a heavyweight transcription model smoothly on a user’s Mac — while they are actively compiling code or on a Zoom call — presents real memory and latency challenges. This post documents the decisions we made and what we learned from running mlx-whisper in production.
Why MLX over whisper.cpp?
While whisper.cpp is a fantastic community project, Apple’s MLX framework was built from the ground up for Apple Silicon’s unified memory architecture. It lets us:
- Load the model directly into unified memory without duplicating it across CPU and GPU RAM.
- Run half-precision inference (
fp16) natively — the GPU cores and CPU share the same physical memory, so there is no copy-on-dispatch overhead. - Stay in pure Python. Saydrop is built on rumps (a thin pyobjc wrapper over AppKit), so there is no Swift or native bridge to maintain.
The practical result: whisper.cpp peaks at 100% CPU during transcription. MLX offloads the matrix multiplications to the GPU cores in the same chip, keeping the CPU mostly free.
Getting started: installation
Saydrop bundles its own Python environment and downloads the model on first launch — you do not need to install anything manually. But if you want to experiment with mlx-whisper directly in your own project:
# Requires Python 3.10+ on Apple Silicon (no Intel support)
pip install mlx-whisper
# Download the model (cached to ~/.cache/huggingface/hub by default)
python -c "import mlx_whisper; mlx_whisper.transcribe('silence.wav', path_or_hf_repo='mlx-community/whisper-medium')"
The model download is roughly 1 GB. On memory-constrained Macs (8 GB unified RAM), choose a smaller model via SAYDROP_MODEL — the catalog includes whisper-small as a lower-footprint option.
In Saydrop, the model is controlled by SAYDROP_MODEL in your .env — the default is whisper-medium, benchmarked to deliver the best balance of latency and accuracy on Apple Silicon. whisper-large-v3-turbo is also available in the settings picker if you prefer it.
Keeping the model hot in memory
The biggest latency win is not quantization — it is eliminating the load time entirely.
On first launch, Saydrop’s onboarding flow downloads the model from Hugging Face and then runs a warmup pass: it feeds one second of silence through the model before the user ever presses the hotkey. That forces all the model weights through MLX’s lazy evaluation and into the GPU’s active working set, so the first real dictation is just as fast as the hundredth.
def warmup(self) -> None:
if self._warmed:
return
silence = np.zeros(CONFIG.sample_rate, dtype=np.float32) # 1 s at 16 kHz
self._run(silence)
self._warmed = True
From that point, the model stays resident in the app process for as long as Saydrop is running. There is no separate inference server, no IPC, no restart overhead — the next hotkey press goes straight to inference.
The actual transcription call
Once recording stops, the pipeline passes the raw float32 audio array directly to mlx_whisper.transcribe. Audio is captured at 16 kHz by sounddevice — Whisper’s native sample rate — so no resampling is needed.
import mlx_whisper
result = mlx_whisper.transcribe(
audio, # np.ndarray, float32, 16 kHz
path_or_hf_repo="mlx-community/whisper-medium",
language=forced_language, # None → auto-detect per clip
initial_prompt=vocabulary_prompt, # personal dictionary hint
fp16=True,
verbose=False,
)
text = (result.get("text") or "").strip()
Three decisions worth explaining:
fp16=True — half-precision inference is noticeably faster on Apple Silicon. For dictation use (short, conversational clips) we have not detected any quality difference vs. fp32.
initial_prompt — Whisper accepts a short text string that biases its decoder toward specific vocabulary. Saydrop builds this from the user’s personal dictionary (proper nouns, technical terms, product names) so those words are recognised consistently without fine-tuning. The dictionary is re-read on every dictation, so edits take effect immediately.
Language detection retry — when the user configures multiple languages, Whisper auto-detects per clip. If the detected language falls outside the allowed set (e.g. it transcribes German when you have configured de,en), Saydrop retries that clip with the primary language forced. This keeps dictation within your chosen languages even when background noise confuses the detector.
Performance on Apple Silicon
Benchmarked on an M1 Pro (10-core, 16 GB unified RAM) with whisper-medium (the default), 3 timed runs per clip after a warmup pass, TTS-generated audio at 16 kHz:
| Metric | Value |
|---|---|
| Transcription latency (~10 s clip) | 0.34 s (p50) |
| Transcription latency (~30 s clip) | 1.53 s (p50) |
| Real-time factor | 0.04–0.05× |
| Resident memory (fp16) | ~1 GB |
| CPU usage during transcription | low — MLX offloads matrix ops to GPU cores |
After the warmup call, subsequent transcriptions are consistently sub-second for normal dictation lengths (5–20 s). The warmup cost is paid once per app launch — not per dictation.
On machines with 8 GB unified RAM, switch to whisper-small via SAYDROP_MODEL to recover headroom. On 16 GB+ machines the medium model sits comfortably resident for the full session.
Silence rejection
One non-obvious problem: Whisper hallucinates on near-silent audio. If you tap the hotkey accidentally and release immediately, the model produces filler text (“Thank you for watching”, “Thanks for your time”, “Please subscribe”) with high confidence.
Saydrop gates every clip with two cheap checks before the audio reaches the model:
Duration gate — clips shorter than 200 ms are rejected outright before the RMS check:
if audio.size < config.sample_rate * 0.2: # < 200 ms
return PipelineResult(reason="too_short")
RMS gate — clips above 200 ms are checked for signal level:
rms = float(np.sqrt(np.mean(np.square(audio, dtype=np.float64))))
if rms < config.silence_rms: # default 0.01
return PipelineResult(reason="silent")
Both checks are microsecond-level NumPy operations. The GPU is never touched for rejected clips. The default RMS threshold of 0.01 works well in most environments; lower it with SAYDROP_SILENCE_RMS=0.005 in a noisy room, or set it to 0 to disable the gate entirely.
Configuration reference
The most useful knobs for tuning performance:
| Variable | Default | Effect |
|---|---|---|
SAYDROP_MODEL | mlx-community/whisper-medium | Switch to whisper-small on 8 GB Macs |
SAYDROP_LANGUAGE | (blank, auto-detect) | Pin one language (de) or allow several (de,en) |
SAYDROP_SILENCE_RMS | 0.01 | Lower in noisy rooms; 0 disables the gate |
SAYDROP_CLEANUP | 1 | 0 disables the Gemma polish step, reduces latency ~0.5 s |
SAYDROP_DEBUG | 0 | 1 logs full transcripts — off by default to protect sensitive speech |
The result
By keeping the model resident and feeding audio directly as a NumPy array — no temp files, no subprocess calls, no network round-trip — we achieve consistent sub-second latency from hotkey release to text appearing in the focused app. Your voice never leaves your machine.
If you want to try it yourself, download Saydrop here. The first-launch onboarding handles model download and warmup automatically.