Voice integration guide
Add voice to an existing gecx-chat integration with one config line and one component. The same ChatSession carries both text and voice turns; identity, governance, and the message stream are shared.
Quickstart
'use client';
import { useChatSession, VoiceToggle } from 'gecx-chat/react';
import { createProxyTransport } from 'gecx-chat';
export default function SupportPage() {
const chat = useChatSession({
config: {
transport: createProxyTransport({ endpoint: '/api/gecx-chat-proxy' }),
voice: { tokenEndpoint: '/api/voice-token' },
},
});
return (
<>
<ChatThread messages={chat.messages} onSend={chat.send} />
<VoiceToggle voice={chat.voice} />
</>
);
}
That's it. The voice key on useChatSession's config enables voice; <VoiceToggle> mounts the mic affordance and surfaces the listening / speaking state. The proxy /api/voice-token route mints ephemeral tokens (one copy-paste from apps/proxy-reference/src/server.ts).
chat.voice is a lazy getter — nothing (microphone, WebSocket, provider factory) happens until something first reads it. <VoiceToggle> does that read on mount and renders a mic-permission probe with inline remediation if access is blocked.
Latency instrumentation
VoiceSession emits a metrics event with firstAudioMs — the latency from start() to first audible frame, targeting < 400 ms. The showcase /voice page renders this as a live overlay. Pipe it into your own analytics sink if you want production tracking:
voice.on('metrics', (m) => {
if (typeof m.firstAudioMs === 'number') {
analytics.track('voice_first_audio', { ms: m.firstAudioMs });
}
});
Permissions
Microphone access flows through PermissionManager. VoiceSession.start() calls permissions.ensure('microphone') before opening the realtime session and forwards the captured MediaStream to the provider. If access is denied or HTTPS is unavailable, the appropriate PERMISSION_* error surfaces through ChatSession.error and <VoiceToggle> renders inline guidance. See Permissions.
Even faster: npx gecx scaffold --with-voice ./my-app generates the page and the token route.
Environment-driven defaults
If you don't want to spell out the provider, set voice: 'auto' and let the SDK pick by environment:
environment | Provider | Mic needed? | Token endpoint needed? |
|---|---|---|---|
test | scripted createMockVoiceProvider() | no | no |
development | createWebAudioMockVoiceProvider() | yes (real prompt) | no |
staging, production | createGeminiLiveProvider() | yes | yes — defaults to /chat/voice-token |
useChatSession({
config: {
environment: 'production',
transport: createProxyTransport({ endpoint: '/api/gecx-chat-proxy' }),
voice: 'auto',
},
});
createChatClient().validateConfig() returns an error (not a warning) when an explicit voice: { provider: 'gemini-live' } is missing tokenEndpoint in production or staging. Run gecx doctor --voice-token-endpoint <url> to probe the live route.
Lower-level API
If you need full control — for example, swapping providers mid-session, or using a partner adapter — drive VoiceSession directly. <VoiceComposer> accepts a VoiceSession instance and is what <VoiceToggle> wraps.
import { VoiceSession, createMockVoiceProvider } from 'gecx-chat/voice';
import { VoiceComposer, TranscriptDisplay } from 'gecx-chat/react/voice';
const vs = new VoiceSession({
provider: createMockVoiceProvider(),
chatSession: chat.session!,
mode: 'push-to-talk',
});
<VoiceComposer voiceSession={vs} />
<TranscriptDisplay messages={chat.messages} />
The high-level config path is built on top of this; both stay public.
Latency metrics
VoiceSession.metrics exposes a rolling firstAudioMs — the time from the user finishing their turn (push-to-talk release or open-mic end-of-turn) to the first audio frame of the response. Subscribe to the 'metrics' event for live updates:
voice.on('metrics', ({ metrics }) => {
console.log('first audio ms:', metrics.firstAudioMs);
});
p95 first-audio under 400 ms is the design target. The Voice Lab page in apps/showcase displays a live overlay; copy the pattern into your dev tools layer if useful.
Modes
- Push-to-talk. User presses (or holds
Space) → mic captures → release sends. Default and recommended for most chat surfaces. - Open-mic. Continuous capture with server VAD detecting turn boundaries. Lower friction; better for voice-first surfaces.
voice: { mode: 'open-mic' }
Accessibility
- The mic button is a
<button>witharia-pressedandaria-label. <TranscriptDisplay>uses ARIA live regions:politefor interim,assertivefor final.- The push-to-talk key is configurable via the
pushToTalkKeyprop on<VoiceToggle>and<VoiceComposer>. - Reduced motion is respected —
<AudioMessage>falls back to a static bar chart. <VoiceToggle>surfaces adata-voice-stateattribute (idle | listening | speaking | denied | …) so apps can style transitions without re-implementing the state machine.
Governance
Audio bytes are session-ephemeral by default. To retain them in exportConversation bundles:
import type { DataGovernancePolicy } from 'gecx-chat';
const policy: DataGovernancePolicy = {
retention: { audio: 'session' },
consent: { flags: ['voice_recording'] },
};
Grant voice_recording at runtime after the user agrees to a "record this conversation" prompt:
client.governance.grantConsentFlag('voice_recording', { source: 'modal' });
forgetMe always clears both transcripts and any retained audio.
Barge-in
Server-VAD providers (Gemini Live, scripted mock) emit audio-cue: speech-started automatically. The SDK's barge-in arbiter aborts the in-flight assistant turn within 300 ms and emits audio-cue: barge-in so your UI can show the interruption. To trigger barge-in manually:
await voice.interrupt();
Lazy construction
Configuring voice does not request mic permission or open any connection. The VoiceSession is constructed only when something first reads session.voice (or renders <VoiceToggle> against a non-null voice). The browser microphone prompt only appears when the user actually presses the mic. This is load-bearing: pages that include voice config but never expose voice (e.g. a settings page reusing the same client) do not surface a permission prompt.
Gotchas
useChatSessioncaptures itsconfigon mount. Changingconfig.voiceafterwards has no effect — to switch providers, force a remount viakey.voice: 'auto'inproductionwill POST/chat/voice-token. If your proxy mounts the route elsewhere, passvoice: { tokenEndpoint: '/your/path' }.- The
webRTCVoiceTransportexport is a stub; production voice flows through the Gemini Live provider over WebSocket.
Common errors
| Error | Fix |
|---|---|
VOICE_DEVICE_PERMISSION_DENIED | Serve over HTTPS, prompt for mic permission. <VoiceToggle> shows inline remediation when this is hard-denied. |
VOICE_CONSENT_REQUIRED | Grant voice_recording flag before requesting retention. |
VOICE_PROVIDER_UNAVAILABLE | For Gemini Live, verify /chat/voice-token is wired and reachable. |
VOICE_TOKEN_INVALID | Check proxy clock skew + token scopes. |
VOICE_TOKEN_ENDPOINT_MISSING | Add voice.tokenEndpoint (or use voice: 'auto'). |
Full table: error-codes reference.
Verification
pnpm size # voice subpath ≤ 15 KB gzip
pnpm test # unit tests (resolveVoiceConfig, lazy voice, permission probe)
gecx doctor --voice-token-endpoint http://localhost:8787/chat/voice-token
pnpm --filter showcase dev # visit http://localhost:3000/voice
pnpm e2e --grep voice # Playwright coverage
docs/guides/voice-integration.md