Building Hands-Off Mode: Voice-Controlled Window Management That Actually Works

I've been building Lattices for a while — it tiles windows, manages tmux sessions, does OCR search across the desktop. Standard workspace manager stuff.

But I kept wanting to do things without reaching for a hotkey. Deep in code, thinking "I need Chrome next to this terminal" — I just wanted to say it. So I built hands-off mode.

~1s first

Response time

~2,000

Context tokens

97%

Test accuracy

5

AI providers

What it does

Press Ctrl+Cmd+M, say "tile Chrome left and iTerm right", press again. You hear a quick "Got it" followed by "Chrome left, iTerm right" and your windows slide into place. About 2 seconds total. No panel, no UI, no visual chrome.

The harder case: "I have too many terminals, organize them." The system looks at your desktop, sees 8 iTerm windows scattered across two monitors, and distributes them in a grid. Or "set up for a code review" puts the GitHub PR on the left and your terminal on the right.

It was slow for a long time

The early versions were brutal. 6-7 seconds end-to-end. You'd speak, wait, and then your windows would silently rearrange. It worked, but it felt like talking to someone through a wall.

Performance journey

Naive approach

~7s

Direct API (Vercel AI SDK)

~3.5s

Long-running worker

~2.5s

Streaming TTS

~1s to first audio

Pre-cached phrases

<50ms ack

Speak first, then act

~2s e2e

Total improvement ~7s → ~2s (first response at ~1s)

The biggest single win was switching to direct API calls instead of shelling out to a CLI. After that it was death by a thousand cuts: persistent processes, streaming audio, cached phrases, and a narrate-before-act pattern that makes the whole thing feel deliberate instead of magical.

We're at about 1 second to first response now, 2 seconds end-to-end.

The turn pipeline

Every voice command follows the same flow. The part I'm most happy with: the user hears feedback before anything moves.

Anatomy of a voice turn

● Hotkey 0ms

● Vox (STT) ~400ms

● Ack sound <50ms

● LLM inference 500–1200ms

● TTS narration ~1s start

● Execute <1ms

● Done <50ms

A cached ack sound plays in under 50ms — before inference even starts — so you know the system heard you. Then the AI narrates what it's about to do, and only then do windows move. You're never surprised by something happening silently.

That narrate-before-act pattern was one of those things that sounds obvious in retrospect but took us a few iterations to land on. The silent version felt broken even when it was doing the right thing.

Three layers

Swift app owns the desktop. A persistent bun worker handles inference and TTS. Cloud APIs do the heavy lifting.

System architecture

Swift menu bar app

Hotkeys · AX · CG · SkyLight · visual feedback

bun worker (persistent)

JSON over stdin/stdout · system prompts on disk

Inference + TTS

Vercel AI SDK · 5 providers · streaming PCM

The interactive diagram is available in the live blog post.

Swift and the worker communicate over stdin/stdout JSON lines. The worker stays alive between turns — no cold starts. System prompts live in .md files on disk and hot-reload on save, so I can iterate on prompts without rebuilding anything.

The inference layer wraps the Vercel AI SDK and supports five providers. Swapping models is one line:

const { text } = await infer("tile chrome left", {
  provider: "groq",
  model: "llama-3.3-70b-versatile",
  system: systemPrompt,
  tag: "hands-off",
});

For TTS, we stream OpenAI's response directly into ffplay via stdin pipe — PCM, no decoding overhead. Playback starts on the first audio chunk:

const player = spawn("ffplay", [
  "-nodisp", "-autoexit", "-loglevel", "quiet",
  "-f", "s16le", "-ar", "24000", "-ch_layout", "mono", "-"
]);

const reader = res.body.getReader();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  player.stdin.write(value);
}

Context makes or breaks it

Our first snapshot sent window titles and positions. That's it. Two iTerm windows both named "Claude Code" looked identical to the AI. Useless.

What the AI sees (v2 — full context) ~2,000 tokens

{
"screens": [{ "name": "Built-in", "res": "1728×1117" }
{ "name": "LG UltraFine", "res": "2560×1440" }
],
"windows": [{ "app": "iTerm2", "title": "Claude Code", // cwd "~/dev/lattices" }
{ "app": "iTerm2", "title": "Claude Code", // cwd "~/dev/vox" }
{ "app": "Google Chrome", "title": "GitHub — lattices PR #42" }
{ "app": "Finder", "title": "Downloads" }
]
}

Compared to the v1 baseline, every window now ships with frame, Z-order, terminal cwd, processes, and tmux session — making "focus on the lattices Claude Code" resolve to a specific window id.

The fix was sending everything. Every window gets its frame, Z-order, and on-screen status. Every terminal tab gets its working directory, running processes, tmux session name, and whether Claude Code is active. All screens with their resolutions. The active layer.

A typical snapshot is about 2,000 tokens — nothing for a modern LLM. But the intelligence difference is dramatic. "Focus on the lattices Claude Code" goes from "I see several Claude Code windows" to correctly identifying the one in ~/dev/lattices and focusing it.

100 scenarios, 202 assertions

We built a test suite that feeds transcripts and desktop snapshots through the inference pipeline and checks responses against specific conditions. 100 scenarios across 8 categories.

Test results

95% 202/202 checks

Awareness

100%

Tiling

100%

Layouts

100%

Focus

96%

Context

94%

Intelligence

100%

Error handling

67%

Speech quality

100%

The weakest area is error handling at 67%. The model says Photoshop isn't running (correct) but still sends a focus action (wrong). Function calling with typed parameters would fix this — the model calls tile_window() with a validated window reference, and we reject impossible actions before execution. Haven't shipped that yet.

What actually matters

There's a dual-model pattern I keep thinking about: fire a fast model (Groq, 500ms) for a spoken ack while a smarter model (Grok, 1.2s) generates the actual actions. By the time the ack finishes playing, the smart response is ready.

And we figured out Stage Manager awareness — how to detect stages, classify windows, and programmatically create new stages by simulating drag events. No other window manager does this. "Put Chrome and iTerm in a new stage" would just work.

But honestly, the thing that matters most is the prompt. Every improvement to the system prompt has had more impact than any architectural change. When we added the rule "if spoken says you'll do something, actions must include it," empty-action bugs nearly vanished. When we added concrete JSON examples for each scenario type, accuracy jumped across the board.

The code is fast enough. The architecture is right. Now it's about making the AI genuinely helpful — and that's all prompt work.

Lattices is open source. The hands-off mode code lives in bin/handsoff-worker.ts (the worker), bin/infer.ts (the inference wrapper), and docs/prompts/hands-off-system.md (the system prompt). The test suite is at tests/handsoff-tests.ts.