Skip to main content
ELI5: voice has two homes. The desktop app owns live audio plumbing (transports, turn detection, microphones). Workflows own which agent speaks with which voice in multi-agent rooms. You declare the what; the app runs the how.

What the workflow author controls

In multi-agent rooms (roster), each agent can speak:
roster:
  - id: analyst
    runtime_key: openai-realtime-api
    voice: openai-realtime     # voice backend
    voice_name: cedar          # specific voice
  - id: narrator
    runtime_key: gemini-genai-sdk
    voice: gemini-tts
    voice_name: Kore
Single-agent workflows have no voice field today — a lone agent block is text-first; voice sessions start from the desktop surfaces.

What the app controls (not workflow-configurable)

Transport (WebSocket vs WebRTC), input/output modalities, and turn detection (server VAD) are selected by the desktop runtime per session. SIP telephony and custom vendor voices are policy-blocked (telephony trunk and vendor program access deeda does not hold).

What’s implemented (probed)

OpenAI Realtime: WebSocket + WebRTC transports, in-session function tools, remote MCP attachment, transcription-only sessions, input-audio transcription, server-VAD turn detection, and the full client-event set. Local voice: on-device STT with cloud-assisted TTS (Gemini TTS via gcloud auth). See OpenAI and Local for row-level status.

When to use what

TaskRoute
Voice conversation with one assistantDesktop app voice surface (local-voice / realtime) — no workflow needed
Multi-expert spoken panelroster with per-agent voice/voice_name
Transcribe audio onlyopenai-realtime-api transcription session
Offline/private voicelocal-voice (on-device STT; TTS needs gcloud auth)

See also