What the workflow author controls
In multi-agent rooms (roster), each agent can speak:
agent block
is text-first; voice sessions start from the desktop surfaces.
What the app controls (not workflow-configurable)
Transport (WebSocket vs WebRTC), input/output modalities, and turn detection (server VAD) are selected by the desktop runtime per session. SIP telephony and custom vendor voices are policy-blocked (telephony trunk and vendor program access deeda does not hold).What’s implemented (probed)
OpenAI Realtime: WebSocket + WebRTC transports, in-session function tools, remote MCP attachment, transcription-only sessions, input-audio transcription, server-VAD turn detection, and the full client-event set. Local voice: on-device STT with cloud-assisted TTS (Gemini TTS via gcloud auth). See OpenAI and Local for row-level status.When to use what
| Task | Route |
|---|---|
| Voice conversation with one assistant | Desktop app voice surface (local-voice / realtime) — no workflow needed |
| Multi-expert spoken panel | roster with per-agent voice/voice_name |
| Transcribe audio only | openai-realtime-api transcription session |
| Offline/private voice | local-voice (on-device STT; TTS needs gcloud auth) |
See also
- Runtimes — knobs and starter
- Workflow Schema — roster field shapes