GitHub Actions: gate PRs on eval regressions
Run gecx eval on every PR, upload the JSON report as an artifact, and block merges when a scenario fails or a regression threshold trips.
Workflow
Copy this into .github/workflows/evaluation.yml:
name: evaluation
on:
pull_request:
branches: [main]
workflow_dispatch:
jobs:
evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4
with:
node-version: 22
cache: pnpm
- run: pnpm install --frozen-lockfile
- run: pnpm build:packages
- name: Run scenarios
run: |
pnpm gecx eval ./apps/showcase/scenarios \
--baseline ./apps/showcase/baseline.eval.json \
--fail-on-regress \
--json --output eval-report.json
env:
# Judge keys are optional. Scorers needing a missing provider are skipped.
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
- uses: actions/upload-artifact@v4
if: always()
with:
name: eval-report
path: eval-report.json
Gating LLM-judge scorers
LLM-judge scorers cost real API calls. Tag scenarios that use them with llm-judge and gate the workflow on a repo var:
- name: Run scenarios (LLM judges only when enabled)
if: ${{ vars.EVAL_LLM_JUDGE_ENABLED == 'true' || github.event_name == 'workflow_dispatch' }}
run: pnpm gecx eval ./apps/showcase/scenarios --baseline ./apps/showcase/baseline.eval.json --fail-on-regress
- name: Run scenarios (deterministic only)
if: ${{ vars.EVAL_LLM_JUDGE_ENABLED != 'true' && github.event_name == 'pull_request' }}
run: pnpm gecx eval ./apps/showcase/scenarios --baseline ./apps/showcase/baseline.eval.json --fail-on-regress
(Deterministic scorers always run; the LLM-judge ones simply return status: 'skipped' when no key is set.)
Posting a PR comment summary
Append after the upload step:
- name: Comment summary on PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const report = JSON.parse(fs.readFileSync('eval-report.json', 'utf8'));
const m = report.metrics;
const body = [
'### GECX eval results',
'',
`- Pass rate: ${(m.passRate * 100).toFixed(1)}% (${m.passedScenarios}/${m.totalScenarios})`,
`- Deflection: ${(m.deflectionRate * 100).toFixed(1)}%`,
`- Latency p95: ${Math.round(m.latencyP95Ms)}ms`,
`- TTFT p95: ${Math.round(m.ttftP95Ms)}ms`,
`- Hallucination rate: ${(m.hallucinationRate * 100).toFixed(1)}%`,
`- Tool-call accuracy: ${(m.toolCallAccuracy * 100).toFixed(1)}%`,
].join('\n');
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
body,
});
Updating the baseline
When intentional improvements ship and you want to lock them in:
pnpm gecx eval ./apps/showcase/scenarios --update-baseline ./apps/showcase/baseline.eval.json
git add apps/showcase/baseline.eval.json
git commit -m "chore(eval): refresh baseline"
The committed baseline becomes the new floor for future PRs.
Source:
docs/recipes/eval-github-action.md