Skip to main content

LLM Evaluation & Testing Architecture

Overview

LLM applications cannot be tested with traditional unit tests. The outputs are non-deterministic, quality is subjective, and regressions are subtle — a model update or prompt change can degrade answer quality without triggering any error. Production LLM systems require a dedicated evaluation architecture that measures output quality continuously, catches regressions before deployment, and provides confidence that prompt changes improve rather than harm performance.

This playbook covers the infrastructure for systematic LLM evaluation — from offline benchmark suites to online production monitoring, including automated scoring, human review workflows, and CI/CD integration that gates deployments on quality metrics.

Three evaluation dimensions require different approaches: correctness (does the output answer the question accurately), safety (does the output follow policies and avoid harmful content), and quality (is the output well-structured, concise, and useful). Each dimension needs its own evaluation methodology and metrics.

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│ Evaluation Triggers │
│ ┌────────────┐ ┌──────────────┐ ┌──────────────────────────┐│
│ │ CI/CD │ │ Prompt │ │ Model Update ││
│ │ Pipeline │ │ Change PR │ │ (Provider Release) ││
│ └────────────┘ └──────────────┘ └──────────────────────────┘│
└────────────────────────────┬────────────────────────────────────┘

┌────────────────────────────▼────────────────────────────────────┐
│ Dataset Management │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────────────┐ │
│ │ Golden │ │ Production │ │ Adversarial │ │
│ │ Test Sets │ │ Samples │ │ Test Cases │ │
│ │ (curated) │ │ (sampled) │ │ (edge cases) │ │
│ └──────────────┘ └──────────────┘ └───────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘

┌────────────────────────────▼────────────────────────────────────┐
│ Evaluation Pipeline │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────────────┐ │
│ │ Run LLM │ │ Score │ │ Compare │ │
│ │ Inference │ │ Outputs │ │ Against Baseline │ │
│ └──────────────┘ └──────────────┘ └───────────────────────┘ │
│ │
│ Scoring Methods: │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────────────┐ │
│ │ LLM-as-Judge │ │ Heuristic │ │ Human Review │ │
│ │ (automated) │ │ (regex/code) │ │ (annotation UI) │ │
│ └──────────────┘ └──────────────┘ └───────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘

┌────────────────────────────▼────────────────────────────────────┐
│ Quality Gate │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────────────┐ │
│ │ Pass/Fail │ │ Regression │ │ Report │ │
│ │ Decision │ │ Detection │ │ Generation │ │
│ └──────────────┘ └──────────────┘ └───────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Evaluation Triggers initiate evaluation runs. CI/CD pipelines run evaluations on every prompt/code change. Prompt change PRs trigger comparison against the current baseline. Provider model updates (e.g., OpenAI releases a new GPT-4 version) trigger regression checks.

Dataset Management maintains curated test sets. Golden test sets contain human-verified input/expected-output pairs for core use cases. Production samples are randomly sampled from live traffic to represent real-world distribution. Adversarial test cases cover edge cases, prompt injections, and known failure modes.

Evaluation Pipeline runs inference against test sets and scores outputs. Three scoring methods: LLM-as-Judge uses a separate LLM (typically GPT-4) to evaluate quality against rubrics. Heuristic scoring applies deterministic checks (JSON validity, keyword presence, length constraints). Human review provides ground-truth annotation for high-stakes decisions.

Quality Gate makes the deploy/no-deploy decision. It compares scores against baselines, detects regressions across evaluation dimensions, and generates reports for stakeholders.

Infrastructure Components

ComponentPurposeImplementation
Test dataset storeVersioned storage for evaluation datasetsS3 + DVC, PostgreSQL, Langfuse datasets
Evaluation runnerExecute LLM inference across test setsLangSmith evaluations, Langfuse, custom scripts
LLM-as-JudgeAutomated quality scoring with LLMGPT-4 with scoring rubrics, custom evaluators
Heuristic evaluatorsDeterministic output checksPython functions (regex, JSON schema, assertions)
Human review UIAnnotation interface for human scoringLangfuse annotation, Argilla, Label Studio
Baseline storeHistorical evaluation scores for comparisonPostgreSQL, S3 + Parquet
CI/CD integrationRun evaluations in pipeline, gate deploymentsGitHub Actions, GitLab CI, custom hooks
Report generatorEvaluation summary with regressions highlightedCustom templates, Langfuse dashboards
Production monitorOnline quality scoring on sampled trafficLangfuse online scoring, custom pipeline
Safety evaluatorTest for harmful, biased, or policy-violating outputsSlashLLM red teaming, custom safety suite

Evaluation Platforms

LayerRecommendedAlternative
Evaluation frameworkLangSmith — datasets, evaluators, comparison viewsLangfuse evaluations
Production tracing with scoringLangfuse — attach scores to production tracesLangSmith monitoring
Human annotationArgilla — open-source annotation platformLabel Studio, Langfuse annotation queue
Safety testingSlashLLM — red teaming and adversarial testingGarak, custom prompt injection suite

Evaluation Methods

MethodBest ForTrade-offs
LLM-as-JudgeScalable quality scoring, style assessmentCosts tokens, judge LLM has its own biases
Heuristic/codeFormat validation, keyword checks, lengthLimited to measurable properties
Human reviewGround-truth quality, subjective dimensionsExpensive, slow, does not scale
Reference comparisonFactual accuracy (compare to known answer)Requires curated reference answers
Pairwise comparisonComparing two model/prompt versionsRequires paired outputs, LLM judge

CI/CD Integration

LayerRecommendedAlternative
PipelineGitHub Actions with evaluation stepGitLab CI, Jenkins
Quality gateCustom script reading LangSmith/Langfuse scoresDeployment webhook
AlertingSlack/PagerDuty on regression detectionEmail reports

Deployment Workflow

Phase 1 — Build Evaluation Foundation (Week 1-2)

  1. Curate initial golden test set — 50-100 input/expected-output pairs covering core use cases
  2. Implement LLM-as-Judge evaluator with typed rubrics (correctness 1-5, helpfulness 1-5, safety pass/fail)
  3. Run first baseline evaluation and record scores in version-controlled baseline store
  4. Set up LangSmith or Langfuse evaluation project

Example LLM-as-Judge Rubric:

correctness_rubric = """
Score the response's factual correctness on a scale of 1-5:
5: Completely accurate, all facts verifiable
4: Mostly accurate, minor details may be imprecise
3: Partially accurate, mix of correct and incorrect
2: Mostly inaccurate, key facts wrong
1: Completely inaccurate or fabricated

Input: {input}
Expected: {expected_output}
Actual: {actual_output}

Score (1-5):
Explanation:
"""

Phase 2 — CI/CD Integration (Week 3-4)

  1. Add evaluation step to CI/CD pipeline — triggered on prompt changes and code changes
  2. Compare new evaluation scores against stored baseline
  3. Implement quality gate — block deployment if any dimension regresses beyond threshold
  4. Generate evaluation report as PR comment showing dimension-by-dimension comparison
  5. Add adversarial test cases — prompt injection attempts, edge cases, boundary inputs

GitHub Actions Integration Example:

- name: Run LLM Evaluation
run: |
python eval/run_evaluation.py \
--dataset eval/golden_test_set.jsonl \
--baseline eval/baseline_scores.json \
--output eval/results.json

- name: Check Quality Gate
run: |
python eval/quality_gate.py \
--results eval/results.json \
--threshold-correctness 3.8 \
--threshold-safety 1.0 \
--max-regression 0.2

Phase 3 — Production Quality Monitoring (Month 2+)

  1. Sample 1-5% of production traffic for automated evaluation
  2. Apply LLM-as-Judge scoring to sampled traces in Langfuse
  3. Build quality dashboard showing daily scores across evaluation dimensions
  4. Set up alerts when production quality drops below evaluation thresholds
  5. Implement human review queue — route low-scoring production outputs to annotators
  6. Build feedback loop — add corrected production samples to golden test set

Phase 4 — Advanced Evaluation (Month 3+)

  1. Implement pairwise evaluation for A/B testing prompt versions
  2. Build RAG-specific evaluators: retrieval relevance, answer faithfulness, context utilization
  3. Add safety evaluation suite using SlashLLM red teaming capabilities
  4. Create domain-specific evaluators for specialized use cases
  5. Build evaluation leaderboard comparing model/prompt versions across all dimensions

Security Considerations

  • Evaluation data security — Golden test sets may contain sensitive data from production. Encrypt datasets at rest and restrict access to the evaluation pipeline.
  • LLM-as-Judge manipulation — If evaluation inputs are derived from user data, adversaries could craft inputs that score artificially high with the judge LLM. Use diverse judges and heuristic cross-checks.
  • Safety evaluation coverage — Ensure evaluation includes adversarial prompts, prompt injection attacks, and policy-violating inputs. Safety evaluation is not optional — it gates production deployment.
  • Evaluation cost — LLM-as-Judge evaluations consume tokens. Budget for evaluation costs (typically 5-10% of production LLM spend) and use caching for repeated evaluations.
  • Bias in evaluation — LLM judges can have systematic biases (preferring verbose responses, specific formatting). Calibrate judges with human-annotated reference scores and monitor for drift.