Building Hands-Off Mode: Voice-Controlled Window Management That Actually Works
I've been building Lattices for a while — it tiles windows, manages tmux sessions, does OCR search across the desktop. Standard workspace manager stuff.
But I kept wanting to do things without reaching for a hotkey. Deep in code, thinking "I need Chrome next to this terminal" — I just wanted to say it. So I built hands-off mode.
What it does
Press Ctrl+Cmd+M, say "tile Chrome left and iTerm right", press again. You hear a quick "Got it" followed by "Chrome left, iTerm right" and your windows slide into place. About 2 seconds total. No panel, no UI, no visual chrome.
The harder case: "I have too many terminals, organize them." The system looks at your desktop, sees 8 iTerm windows scattered across two monitors, and distributes them in a grid. Or "set up for a code review" puts the GitHub PR on the left and your terminal on the right.
It was slow for a long time
The early versions were brutal. 6-7 seconds end-to-end. You'd speak, wait, and then your windows would silently rearrange. It worked, but it felt like talking to someone through a wall.
The biggest single win was switching to direct API calls instead of shelling out to a CLI. After that it was death by a thousand cuts: persistent processes, streaming audio, cached phrases, and a narrate-before-act pattern that makes the whole thing feel deliberate instead of magical.
We're at about 1 second to first response now, 2 seconds end-to-end.
The turn pipeline
Every voice command follows the same flow. The part I'm most happy with: the user hears feedback before anything moves.
A cached ack sound plays in under 50ms — before inference even starts — so you know the system heard you. Then the AI narrates what it's about to do, and only then do windows move. You're never surprised by something happening silently.
That narrate-before-act pattern was one of those things that sounds obvious in retrospect but took us a few iterations to land on. The silent version felt broken even when it was doing the right thing.
Three layers
Swift app owns the desktop. A persistent bun worker handles inference and TTS. Cloud APIs do the heavy lifting.
The interactive diagram is available in the live blog post.
Swift and the worker communicate over stdin/stdout JSON lines. The worker stays alive between turns — no cold starts. System prompts live in .md files on disk and hot-reload on save, so I can iterate on prompts without rebuilding anything.
The inference layer wraps the Vercel AI SDK and supports five providers. Swapping models is one line:
const { text } = await infer("tile chrome left", {
provider: "groq",
model: "llama-3.3-70b-versatile",
system: systemPrompt,
tag: "hands-off",
});For TTS, we stream OpenAI's response directly into ffplay via stdin pipe — PCM, no decoding overhead. Playback starts on the first audio chunk:
const player = spawn("ffplay", [
"-nodisp", "-autoexit", "-loglevel", "quiet",
"-f", "s16le", "-ar", "24000", "-ch_layout", "mono", "-"
]);
const reader = res.body.getReader();
while (true) {
const { done, value } = await reader.read();
if (done) break;
player.stdin.write(value);
}Context makes or breaks it
Our first snapshot sent window titles and positions. That's it. Two iTerm windows both named "Claude Code" looked identical to the AI. Useless.
Compared to the v1 baseline, every window now ships with frame, Z-order, terminal cwd, processes, and tmux session — making "focus on the lattices Claude Code" resolve to a specific window id.
The fix was sending everything. Every window gets its frame, Z-order, and on-screen status. Every terminal tab gets its working directory, running processes, tmux session name, and whether Claude Code is active. All screens with their resolutions. The active layer.
A typical snapshot is about 2,000 tokens — nothing for a modern LLM. But the intelligence difference is dramatic. "Focus on the lattices Claude Code" goes from "I see several Claude Code windows" to correctly identifying the one in ~/dev/lattices and focusing it.
100 scenarios, 202 assertions
We built a test suite that feeds transcripts and desktop snapshots through the inference pipeline and checks responses against specific conditions. 100 scenarios across 8 categories.
The weakest area is error handling at 67%. The model says Photoshop isn't running (correct) but still sends a focus action (wrong). Function calling with typed parameters would fix this — the model calls tile_window() with a validated window reference, and we reject impossible actions before execution. Haven't shipped that yet.
What actually matters
There's a dual-model pattern I keep thinking about: fire a fast model (Groq, 500ms) for a spoken ack while a smarter model (Grok, 1.2s) generates the actual actions. By the time the ack finishes playing, the smart response is ready.
And we figured out Stage Manager awareness — how to detect stages, classify windows, and programmatically create new stages by simulating drag events. No other window manager does this. "Put Chrome and iTerm in a new stage" would just work.
But honestly, the thing that matters most is the prompt. Every improvement to the system prompt has had more impact than any architectural change. When we added the rule "if spoken says you'll do something, actions must include it," empty-action bugs nearly vanished. When we added concrete JSON examples for each scenario type, accuracy jumped across the board.
The code is fast enough. The architecture is right. Now it's about making the AI genuinely helpful — and that's all prompt work.
Lattices is open source. The hands-off mode code lives in bin/handsoff-worker.ts (the worker), bin/infer.ts (the inference wrapper), and docs/prompts/hands-off-system.md (the system prompt). The test suite is at tests/handsoff-tests.ts.