LangSmith

LLM application development platform — tracing, debugging, testing, evaluation, and monitoring from the LangChain team.

Overview

LangSmith is the development and operations platform built by the LangChain team for debugging, testing, evaluating, and monitoring LLM applications. While Langfuse focuses on production observability, LangSmith is tightly integrated with the LangChain/LangGraph ecosystem and emphasizes the development lifecycle: from prototyping prompts to running evaluation suites to monitoring production deployments.

LangSmith addresses the core challenge of LLM development: outputs are non-deterministic, making traditional debugging and testing approaches insufficient. LangSmith provides trace-level visibility into chain execution, automated evaluation frameworks, and dataset management for regression testing.

Architecture

┌──────────────────────────────────────────────────────┐
│                   LangSmith Platform                  │
│                                                       │
│  ┌─────────────────────────────────────────────────┐ │
│  │                 Tracing Engine                    │ │
│  │                                                   │ │
│  │  • Auto-instrumentation for LangChain/LangGraph  │ │
│  │  • Manual tracing via SDK (@traceable decorator)  │ │
│  │  • Nested spans (chain → retriever → LLM → tool) │ │
│  │  • Token usage, latency, input/output capture     │ │
│  └────────────────────────┬────────────────────────┘ │
│                           │                           │
│  ┌────────────┐  ┌────────┴────────┐  ┌───────────┐ │
│  │ Playground │  │ Evaluation      │  │ Dataset   │ │
│  │            │  │ Framework       │  │ Management│ │
│  │ • Prompt   │  │                 │  │           │ │
│  │   testing  │  │ • Custom evals  │  │ • Golden  │ │
│  │ • Model    │  │ • LLM-as-judge  │  │   datasets│ │
│  │   compare  │  │ • Heuristic     │  │ • Version │ │
│  │ • Variable │  │   evaluators    │  │   control │ │
│  │   sweeps   │  │ • CI/CD         │  │ • Sampling│ │
│  │            │  │   integration   │  │   from    │ │
│  │            │  │                 │  │   traces  │ │
│  └────────────┘  └─────────────────┘  └───────────┘ │
│                                                       │
│  ┌─────────────────────────────────────────────────┐ │
│  │              Monitoring & Alerting               │ │
│  │                                                   │ │
│  │  • Production trace monitoring                    │ │
│  │  • Feedback collection (user thumbs up/down)      │ │
│  │  • Regression detection                           │ │
│  │  • Cost and latency dashboards                    │ │
│  └─────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘

Core Components

Component	Function
Tracing	Capture every step of LLM chain execution with inputs, outputs, tokens, latency
Playground	Interactive prompt testing with model comparison and variable sweeps
Datasets	Manage golden test datasets; sample interesting traces into datasets
Evaluators	Run automated evaluations (correctness, relevance, faithfulness)
Monitoring	Production dashboards for latency, cost, error rates, and user feedback
Hub	Community prompt repository for sharing and versioning prompts

Use Cases

LLM Application Debugging

Trace the full execution of a chain to identify where failures occur:

from langsmith import traceable

@traceable
def rag_pipeline(query: str) -> str:
    """Each function call becomes a nested span in LangSmith."""
    docs = retrieve(query)
    context = format_context(docs)
    response = generate(query, context)
    return response

@traceable
def retrieve(query: str) -> list:
    # Retrieval step traced with inputs/outputs
    return vector_store.similarity_search(query, k=5)

@traceable
def generate(query: str, context: str) -> str:
    # LLM call traced with prompt, response, tokens
    return llm.invoke(f"Context: {context}\n\nQuestion: {query}")

Result: In the LangSmith UI, see the full chain execution tree with timing, inputs/outputs, and token usage at every step.

Automated Evaluation in CI/CD

Run evaluation suites against test datasets as part of CI/CD:

from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()

# Define evaluator
def correctness_evaluator(run, example):
    """Check if the output contains the expected answer."""
    predicted = run.outputs["output"]
    expected = example.outputs["expected_answer"]
    
    # Use LLM-as-judge for semantic comparison
    score = llm_judge(predicted, expected)
    return {"score": score, "key": "correctness"}

# Run evaluation against dataset
results = evaluate(
    rag_pipeline,
    data="rag-golden-dataset",  # Dataset in LangSmith
    evaluators=[correctness_evaluator],
    experiment_prefix="v2.3-prompt-update",
)

# Assert quality threshold in CI
assert results.aggregate_metrics["correctness"]["mean"] >= 0.8

Prompt Iteration and Testing

Use the LangSmith Playground to test prompt variations:

Select a trace from production
Modify the prompt in the Playground
Run against multiple models simultaneously
Compare outputs side-by-side
Save winning prompt version to Hub

Production Monitoring

Monitor deployed LLM applications:

Track latency trends, error rates, and cost over time
Collect user feedback (thumbs up/down) linked to specific traces
Sample traces with low feedback scores for debugging
Set up alerts for quality degradation or cost spikes

Pros and Cons

Pros

Deep LangChain integration — Auto-instrumentation for LangChain and LangGraph applications
Evaluation framework — Built-in support for custom evaluators, LLM-as-judge, and CI/CD integration
Dataset management — Create golden datasets from production traces for regression testing
Playground — Interactive prompt testing with model comparison
Tracing depth — Full chain execution tree with nested spans

Cons

LangChain-centric — Best experience requires LangChain/LangGraph; manual instrumentation needed for other frameworks
SaaS-primary — Self-hosted option is limited; most features require LangSmith cloud
Pricing — Free tier is limited; enterprise features require paid plan
Overlap with Langfuse — Similar capabilities; teams must choose one tracing platform
Less mature monitoring — Production monitoring features are newer compared to dedicated observability tools

Deployment Patterns

Pattern 1: Development + Evaluation (Most Common)

Use LangSmith during development for debugging and evaluation; use a different tool (Langfuse, Datadog) for production monitoring:

Development → LangSmith (tracing, playground, evals)
Production  → Langfuse / Datadog (monitoring, alerting)

Pattern 2: Full Lifecycle (LangChain-Heavy Teams)

Use LangSmith across development and production for teams fully committed to the LangChain ecosystem:

Development → LangSmith (tracing, playground, evals)
CI/CD       → LangSmith (automated evaluation pipeline)
Production  → LangSmith (monitoring, feedback, alerting)

Pattern 3: Evaluation-Only

Use LangSmith only for its evaluation framework, regardless of what tracing tool you use:

# LangSmith evaluation works with any LLM application
from langsmith.evaluation import evaluate

results = evaluate(
    my_custom_pipeline,  # Any callable
    data="my-test-dataset",
    evaluators=[my_evaluator],
)

Integration with AI Infrastructure

CI/CD Pipelines: LangSmith evaluations integrate into DevOps CI/CD for AI systems as quality gates
Observability Stack: Complements the AI observability stack — LangSmith for development, Langfuse for production
LLM Orchestration: Native integration with LangChain chains and agents
Monitoring: Tracing data feeds into LLM monitoring and tracing architecture

LangSmith vs Langfuse

Dimension	LangSmith	Langfuse
Best for	LangChain-native development + eval	Framework-agnostic production monitoring
Tracing	Auto for LangChain; manual for others	Manual instrumentation (any framework)
Evaluation	Built-in framework + CI/CD	Scores API (more manual)
Playground	Interactive prompt testing	Prompt management (versioning)
Self-hosted	Limited	Full self-hosted option
Open Source	Partially (tracing SDK)	Yes (MIT license)

For a detailed comparison, see Langfuse vs Arize Phoenix →.

Overview​

Architecture​

Core Components​

Use Cases​

LLM Application Debugging​

Automated Evaluation in CI/CD​

Prompt Iteration and Testing​

Production Monitoring​

Pros and Cons​

Pros​

Cons​

Deployment Patterns​

Pattern 1: Development + Evaluation (Most Common)​

Pattern 2: Full Lifecycle (LangChain-Heavy Teams)​

Pattern 3: Evaluation-Only​

Integration with AI Infrastructure​

LangSmith vs Langfuse​

Related​