Evaluation

gecx eval runs scenario-replay tests against a ChatSession, scores them with deterministic and LLM-judge scorers, and prints a pass/fail table plus a JSON report. It supports a regression gate that compares the current run against a committed baseline.

Quick start

pnpm gecx eval ./apps/showcase/scenarios

You'll see a per-scenario pass/fail table, plus aggregate metrics (pass rate, deflection rate, p50/p95 latency, hallucination rate, tool-call accuracy) and which judge providers are configured.

For CI:

pnpm gecx eval ./apps/showcase/scenarios \
  --baseline ./apps/showcase/baseline.eval.json \
  --fail-on-regress \
  --json --output eval-report.json

Exit code 0 means scenarios passed and no regressions tripped. Exit 1 means a scenario failed or a regression threshold was exceeded.

Writing a scenario

Two formats, same shape. Both are loaded by walking the target directory for *.scenario.ts, *.scenario.yaml, and *.scenario.yml.

TypeScript

// apps/showcase/scenarios/welcome.scenario.ts
export default {
  id: 'welcome',
  name: 'Welcome flow greets the user',
  tags: ['support'],
  when: [{ user: 'hello' }],
  then: [
    { scorer: 'message-contains', args: { pattern: 'help' } },
    { scorer: 'no-handoff' },
    { scorer: 'no-error' },
  ],
};

YAML

# apps/showcase/scenarios/invoice-lookup.scenario.yaml
id: invoice-lookup
name: Invoice lookup calls the invoice tool
tags: [support, tool-call]
when:
  - user: Please pull invoice INV-5678
then:
  - scorer: tool-called
    args:
      name: lookup_invoice
  - scorer: no-handoff
  - scorer: no-error

Anatomy

  • id — unique within the suite.
  • name — human-readable label for the report.
  • tags — optional. Use --filter <tag> on the CLI to run a subset. Tag any scenario that uses an LLM-judge scorer with llm-judge so CI can gate it on a secret.
  • when — turns. Either { user: '...' } to send a user message, { approveToolCall: 'tag' } to approve the next pending tool call, or { denyToolCall: 'tag' } to deny it.
  • then — expectations evaluated in order. Each one names a scorer plus its args.

Built-in scorers

ScorerArgsWhat it checks
message-contains{ pattern }Assistant text (case-insensitive substring)
message-matches-regex{ pattern, flags? }Regex over assistant text
final-message-equals{ pattern }Last assistant message equals string
tool-called{ name }A tool with that name was requested
tool-called-with-input{ name, input }Same, plus matching input keys
no-tool-called{ name? }No (or a specific) tool was requested
tool-call-accuracy{ expected: [...] }Fraction of expected tools matched
handoff-triggeredhandoff_status_changed emitted with a non-none status
no-handoffNo handoff happened
error-code{ code }A ChatSdkError with that code was emitted
no-errorNo errors emitted
latency-p50-under{ ms }p50 of assistant_response_completed.durationMs below threshold
latency-p95-under{ ms }p95 of the same
llm-judge-not-hallucinating{ groundingContext, provider? }LLM judges whether output is grounded
llm-judge-helpfulness{ criteria?, provider? }LLM judges whether output is helpful
llm-judge-tone-matches{ tone, provider? }LLM judges whether tone matches

provider defaults to anthropic. Set it to openai or gemini to use a different judge.

Custom scorers

import { defineScorer } from 'gecx-chat/eval';

const allCapsTitle = defineScorer<{ minLength: number }>({
  id: 'all-caps-title',
  describe: (args) => `assistant title is ALL CAPS and >= ${args.minLength} chars`,
  async run(ctx, args) {
    const first = ctx.record.messages.find((m) => m.role === 'agent');
    const text = first?.parts.map((p) => (p.type === 'text' ? p.text : '')).join('') ?? '';
    const pass = text === text.toUpperCase() && text.length >= args.minLength;
    return { status: pass ? 'passed' : 'failed', score: pass ? 1 : 0 };
  },
});

Register it into the runner by passing your own ScorerRegistry:

import { createScorerRegistry, registerBuiltinScorers, runEval } from 'gecx-chat/eval';

const registry = registerBuiltinScorers(createScorerRegistry());
registry.register(allCapsTitle);
await runEval({ dir: './scenarios', registry });

LLM-judge providers

ProviderEnv varDefault model
AnthropicANTHROPIC_API_KEYclaude-sonnet-4-6
OpenAIOPENAI_API_KEYgpt-4o-mini
GeminiGEMINI_API_KEYgemini-2.5-flash

Each adapter posts directly to the provider's HTTP API — no SDK dependency. If a key isn't set, scorers needing that provider return status: 'skipped' (the scenario itself does not fail). Override the model per provider in the eval config:

// eval.config.json
{
  "providers": {
    "anthropic": { "model": "claude-sonnet-4-6" }
  }
}

Then pnpm gecx eval ./scenarios --config eval.config.json.

Regression gates

--baseline <path> compares the current run to a previously committed report and applies thresholds. Defaults:

ThresholdDefaultTrips when
deflectionDropPp5Deflection rate drops > 5 percentage points
p95LatencyIncreasePct20p95 latency rises > 20%
ttftP95IncreasePct20p95 TTFT rises > 20%
passRateDropPp0Any pass-rate drop
hallucinationRateIncreasePp2Hallucination rate rises > 2 percentage points

Override in eval.config.json under regressionThresholds, or in a separate file passed as --regression-config <path>. New scenario failures (not present in baseline) always trip the gate regardless of metric thresholds.

To update the baseline after intentional improvements:

pnpm gecx eval ./scenarios --update-baseline ./baseline.eval.json

CI integration

See docs/recipes/eval-github-action.md for a full GitHub Actions workflow that runs the eval on every PR, uploads the JSON report as an artifact, and gates on regressions.

Reference

  • CLI flag reference: docs/reference/eval-cli.md
  • JSON schemas: schemas/eval-scenario.schema.json, schemas/eval-report.schema.json, schemas/eval-config.schema.json
Source: docs/guides/evaluation.md