← all posts

Building Hands-Off Mode: Voice-Controlled Window Management That Actually Works

Arach Tchoupani · March 18, 2026

I’ve been building Lattices, a macOS workspace manager, for a while now. It tiles windows, manages tmux sessions, does OCR search across your desktop.

But I kept wanting to do things without reaching for a hotkey. Deep in code, thinking “I need Chrome next to this terminal” — I just wanted to say it.

So I built hands-off mode.

~1s first
Response time
~2,000
Context tokens
97%
Test accuracy
5
AI providers

The idea

One hotkey. You speak. Things happen. No panel, no UI, no visual chrome.

Press Ctrl+Cmd+M, say “tile Chrome left and iTerm right”, press again. You hear a quick “Got it” followed by “Chrome left, iTerm right” and your windows slide into place. About 2 seconds total.

The harder case: “I have too many terminals, organize them” — the system looks at your desktop, sees 8 iTerm windows scattered across two monitors, and distributes them in a grid. Or “set up for a code review” puts the GitHub PR on the left and your terminal on the right.

Six iterations to get here

The early versions were slow. Even after moving past the initial prototypes, we were still at 6-7 seconds end-to-end. The user would speak, wait, and then their windows would silently rearrange. Not terrible, but not the instant-feeling interaction we wanted.

Several iterations later, it feels natural — about 1 second to first response, 2 seconds end-to-end.

Performance journeyClick to explore
1Naive approach~7s
2Direct API (Vercel AI SDK)-3.5s~3.5s
3Long-running worker-1s~2.5s
4Streaming TTS-0.5s~1s to first audio
5Pre-cached phrases-2s on ack<50ms ack
6Speak first, then actUX win~2s e2e
Total improvement~7s ~2s (first response at ~1s)

The biggest single win was switching to direct API calls instead of shelling out to a CLI. After that, it was death by a thousand optimizations: persistent processes, streaming audio, cached phrases, and a narrate-before-act UX pattern that makes the whole thing feel intentional rather than magic.

How a turn works

Every voice command follows the same pipeline. The key insight: the user should hear feedback before anything moves.

Anatomy of a voice turn
Hotkey
0ms
~400ms
Ack soundparallel
<50ms
LLM inference
500–1200ms
TTS narration
~1s start
Execute
<1ms
Done
<50ms
vox 500ms
LLM 500–1200ms
TTS stream + act ~1.5s
0sfeedback at ~500mswindows move at ~3s

The cached ack sound is important. It plays in under 50ms — before inference even starts — so the user knows the system heard them. Then the AI narrates what it’s about to do, and only then do windows move. The user is never surprised.

The architecture

The system spans three layers: a native Swift app that owns the desktop, a persistent bun worker for inference and TTS, and cloud APIs for the heavy lifting.

System architecture
WebSocketJSON linesactionsTTS streamPCM pipecached audio
Vox
Push-to-talk
Voice capture + STT
Swift App
Menu bar + AX
Desktop control
Bun Worker
stdin/stdout
Inference + TTS
System Prompt
Hot-reload .md
TTS Cache
~/.lattices/
Groq
Llama 3.3 70B
~600ms
xAI
Grok
~1.2s
OpenAI
TTS-1
Streaming PCM
ffplay
PCM audio
HANDSOFF.ARCH.001

Swift and the worker communicate over stdin/stdout JSON lines. The worker stays alive between turns — no cold starts. System prompts are loaded from .md files on disk and hot-reload on save, so prompt iteration doesn’t require rebuilding.

The inference layer wraps the Vercel AI SDK and supports five providers. Swapping models is one line:

const { text } = await infer("tile chrome left", {
  provider: "groq",
  model: "llama-3.3-70b-versatile",
  system: systemPrompt,
  tag: "hands-off",
});

For TTS, we stream OpenAI’s response directly into ffplay via stdin pipe — PCM format, no decoding overhead. Playback begins on the first audio chunk:

const player = spawn("ffplay", [
  "-nodisp", "-autoexit", "-loglevel", "quiet",
  "-f", "s16le", "-ar", "24000", "-ch_layout", "mono", "-"
]);

const reader = res.body.getReader();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  player.stdin.write(value);
}

The context problem

A voice assistant is only as good as the information it has. Our first snapshot sent window titles and positions. That’s it. Two iTerm windows both named “Claude Code” looked identical to the AI.

The fix: send everything.

What the AI sees
desktop-snapshot.json (v2)~2,000 tokens
{
"screens": [
{ "name": "Built-in", "res": "1728×1117", "primary": true }
{ "name": "LG UltraFine", "res": "2560×1440" }
],
"windows": [
{ "app": "iTerm2", "z": 0, "screen": 1,
// terminal enrichment
"cwd": "~/dev/lattices", "hasClaude": true, "tmux": "lattices"
},
{ "app": "iTerm2", "z": 1, "screen": 1,
// terminal enrichment
"cwd": "~/dev/vox", "hasClaude": true, "tmux": "vox"
},
{ "app": "Google Chrome", "z": 2, "screen": 2,
},
{ "app": "Finder", "z": 3, "screen": 2,
}
],
"activeLayer": "L1"
}
"Focus on the lattices Claude Code" → correctly picks wid 423 (cwd: ~/dev/lattices)

A typical snapshot is about 2,000 tokens — nothing for a modern LLM. But the intelligence difference is dramatic. Every window gets its frame, Z-order, and on-screen status. Every terminal tab gets its working directory, running processes, tmux session name, and whether Claude Code is active. All screens with their resolutions. The active layer.

“Focus on the lattices Claude Code” goes from “I see several Claude Code windows” to correctly identifying the one in ~/dev/lattices and focusing it.

Testing at scale

We built an automated test suite: 100 scenarios, 8 categories, 202 individual assertions. Each scenario feeds a transcript and desktop snapshot through the inference pipeline and checks the response against specific conditions.

Test results
97%196/202 checks
Awareness
100%30/30
Tiling
100%36/36
Layouts
100%30/30
Focus
96%23/24
Context
94%17/18
Intelligence
100%30/30
Error handling
67%8/12
Speech quality
100%22/22

The weakest area — error handling at 67% — is a known issue. The model says Photoshop isn’t running (correct) but still sends a focus action (wrong). Function calling with typed parameters would fix this: the model calls tile_window() with a validated window reference, and we reject impossible actions before execution.

What matters most

The dual-model pattern is tempting: fire a fast model (Groq, 500ms) for a spoken ack while a smarter model (Grok, 1.2s) generates the actual actions. By the time the ack finishes playing, the smart response is ready.

And Stage Manager awareness. We figured out how to detect stages, classify windows, and programmatically create new stages by simulating drag events. No other window manager does this. “Put Chrome and iTerm in a new stage” would just work.

But honestly, the thing that matters most is the prompt. Every improvement to the system prompt has a bigger impact than any architectural change. When we added the rule “if spoken says you’ll do something, actions must include it,” empty-actions bugs nearly vanished. When we added concrete JSON examples for each scenario type, accuracy jumped across the board.

The code is fast enough. The architecture is right. Now it’s about making the AI genuinely helpful.


Lattices is open source. The hands-off mode code lives in bin/handsoff-worker.ts (the worker), lib/infer.ts (the inference wrapper), and docs/prompts/hands-off-system.md (the system prompt). The test suite is at test/handsoff-tests.ts.