Voice integration guide

Add voice to an existing gecx-chat integration with one config line and one component. The same ChatSession carries both text and voice turns; identity, governance, and the message stream are shared.

Quickstart

'use client';

import { useChatSession, VoiceToggle } from 'gecx-chat/react';
import { createProxyTransport } from 'gecx-chat';

export default function SupportPage() {
  const chat = useChatSession({
    config: {
      transport: createProxyTransport({ endpoint: '/api/gecx-chat-proxy' }),
      voice: { tokenEndpoint: '/api/voice-token' },
    },
  });

  return (
    <>
      <ChatThread messages={chat.messages} onSend={chat.send} />
      <VoiceToggle voice={chat.voice} />
    </>
  );
}

That's it. The voice key on useChatSession's config enables voice; <VoiceToggle> mounts the mic affordance and surfaces the listening / speaking state. The proxy /api/voice-token route mints ephemeral tokens (one copy-paste from apps/proxy-reference/src/server.ts).

chat.voice is a lazy getter — nothing (microphone, WebSocket, provider factory) happens until something first reads it. <VoiceToggle> does that read on mount and renders a mic-permission probe with inline remediation if access is blocked.

Latency instrumentation

VoiceSession emits a metrics event with firstAudioMs — the latency from start() to first audible frame, targeting < 400 ms. The showcase /voice page renders this as a live overlay. Pipe it into your own analytics sink if you want production tracking:

voice.on('metrics', (m) => {
  if (typeof m.firstAudioMs === 'number') {
    analytics.track('voice_first_audio', { ms: m.firstAudioMs });
  }
});

Permissions

Microphone access flows through PermissionManager. VoiceSession.start() calls permissions.ensure('microphone') before opening the realtime session and forwards the captured MediaStream to the provider. If access is denied or HTTPS is unavailable, the appropriate PERMISSION_* error surfaces through ChatSession.error and <VoiceToggle> renders inline guidance. See Permissions.

Even faster: npx gecx scaffold --with-voice ./my-app generates the page and the token route.

Environment-driven defaults

If you don't want to spell out the provider, set voice: 'auto' and let the SDK pick by environment:

environmentProviderMic needed?Token endpoint needed?
testscripted createMockVoiceProvider()nono
developmentcreateWebAudioMockVoiceProvider()yes (real prompt)no
staging, productioncreateGeminiLiveProvider()yesyes — defaults to /chat/voice-token
useChatSession({
  config: {
    environment: 'production',
    transport: createProxyTransport({ endpoint: '/api/gecx-chat-proxy' }),
    voice: 'auto',
  },
});

createChatClient().validateConfig() returns an error (not a warning) when an explicit voice: { provider: 'gemini-live' } is missing tokenEndpoint in production or staging. Run gecx doctor --voice-token-endpoint <url> to probe the live route.

Lower-level API

If you need full control — for example, swapping providers mid-session, or using a partner adapter — drive VoiceSession directly. <VoiceComposer> accepts a VoiceSession instance and is what <VoiceToggle> wraps.

import { VoiceSession, createMockVoiceProvider } from 'gecx-chat/voice';
import { VoiceComposer, TranscriptDisplay } from 'gecx-chat/react/voice';

const vs = new VoiceSession({
  provider: createMockVoiceProvider(),
  chatSession: chat.session!,
  mode: 'push-to-talk',
});

<VoiceComposer voiceSession={vs} />
<TranscriptDisplay messages={chat.messages} />

The high-level config path is built on top of this; both stay public.

Latency metrics

VoiceSession.metrics exposes a rolling firstAudioMs — the time from the user finishing their turn (push-to-talk release or open-mic end-of-turn) to the first audio frame of the response. Subscribe to the 'metrics' event for live updates:

voice.on('metrics', ({ metrics }) => {
  console.log('first audio ms:', metrics.firstAudioMs);
});

p95 first-audio under 400 ms is the design target. The Voice Lab page in apps/showcase displays a live overlay; copy the pattern into your dev tools layer if useful.

Modes

  • Push-to-talk. User presses (or holds Space) → mic captures → release sends. Default and recommended for most chat surfaces.
  • Open-mic. Continuous capture with server VAD detecting turn boundaries. Lower friction; better for voice-first surfaces.
voice: { mode: 'open-mic' }

Accessibility

  • The mic button is a <button> with aria-pressed and aria-label.
  • <TranscriptDisplay> uses ARIA live regions: polite for interim, assertive for final.
  • The push-to-talk key is configurable via the pushToTalkKey prop on <VoiceToggle> and <VoiceComposer>.
  • Reduced motion is respected — <AudioMessage> falls back to a static bar chart.
  • <VoiceToggle> surfaces a data-voice-state attribute (idle | listening | speaking | denied | …) so apps can style transitions without re-implementing the state machine.

Governance

Audio bytes are session-ephemeral by default. To retain them in exportConversation bundles:

import type { DataGovernancePolicy } from 'gecx-chat';

const policy: DataGovernancePolicy = {
  retention: { audio: 'session' },
  consent: { flags: ['voice_recording'] },
};

Grant voice_recording at runtime after the user agrees to a "record this conversation" prompt:

client.governance.grantConsentFlag('voice_recording', { source: 'modal' });

forgetMe always clears both transcripts and any retained audio.

Barge-in

Server-VAD providers (Gemini Live, scripted mock) emit audio-cue: speech-started automatically. The SDK's barge-in arbiter aborts the in-flight assistant turn within 300 ms and emits audio-cue: barge-in so your UI can show the interruption. To trigger barge-in manually:

await voice.interrupt();

Lazy construction

Configuring voice does not request mic permission or open any connection. The VoiceSession is constructed only when something first reads session.voice (or renders <VoiceToggle> against a non-null voice). The browser microphone prompt only appears when the user actually presses the mic. This is load-bearing: pages that include voice config but never expose voice (e.g. a settings page reusing the same client) do not surface a permission prompt.

Gotchas

  • useChatSession captures its config on mount. Changing config.voice afterwards has no effect — to switch providers, force a remount via key.
  • voice: 'auto' in production will POST /chat/voice-token. If your proxy mounts the route elsewhere, pass voice: { tokenEndpoint: '/your/path' }.
  • The webRTCVoiceTransport export is a stub; production voice flows through the Gemini Live provider over WebSocket.

Common errors

ErrorFix
VOICE_DEVICE_PERMISSION_DENIEDServe over HTTPS, prompt for mic permission. <VoiceToggle> shows inline remediation when this is hard-denied.
VOICE_CONSENT_REQUIREDGrant voice_recording flag before requesting retention.
VOICE_PROVIDER_UNAVAILABLEFor Gemini Live, verify /chat/voice-token is wired and reachable.
VOICE_TOKEN_INVALIDCheck proxy clock skew + token scopes.
VOICE_TOKEN_ENDPOINT_MISSINGAdd voice.tokenEndpoint (or use voice: 'auto').

Full table: error-codes reference.

Verification

pnpm size                       # voice subpath ≤ 15 KB gzip
pnpm test                       # unit tests (resolveVoiceConfig, lazy voice, permission probe)
gecx doctor --voice-token-endpoint http://localhost:8787/chat/voice-token
pnpm --filter showcase dev      # visit http://localhost:3000/voice
pnpm e2e --grep voice           # Playwright coverage
Source: docs/guides/voice-integration.md