Skip to main content

Building RAG Systems

Retrieval-Augmented Generation (RAG) grounds LLM responses with your own data, reducing hallucinations and enabling domain-specific knowledge.

RAG Architecture

┌──────────────────────────────────────────────────────┐
│ RAG Pipeline │
│ │
│ ┌─────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Query │────▶│ Embedding │────▶│ Vector Search │ │
│ └─────────┘ │ Model │ │ (Top-K) │ │
│ └──────────┘ └──────┬───────┘ │
│ │ │
│ Retrieved Docs │
│ │ │
│ ┌─────────┐ ┌──────────┐ ┌──────▼───────┐ │
│ │ Response │◀────│ LLM │◀────│ Prompt + │ │
│ └─────────┘ └──────────┘ │ Context │ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────┘

Vector Database Options

DatabaseTypeBest ForScaling
ChromaEmbeddedPrototyping, small datasetsSingle node
PineconeManagedProduction, zero-opsFully managed
WeaviateSelf-hosted/CloudHybrid searchHorizontal
QdrantSelf-hosted/CloudPerformance-criticalHorizontal
pgvectorPostgreSQL extensionExisting Postgres usersVertical

Building a RAG Pipeline

Step 1: Document Loading and Chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader

# Load documents
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()

# Chunk documents
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n## ", "\n### ", "\n\n", "\n", " "],
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} documents")

Step 2: Create Embeddings and Store

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Create embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Store in vector database
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db",
collection_name="devops_docs"
)

Step 3: Retrieval and Generation

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Create retriever
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximum Marginal Relevance
search_kwargs={"k": 5, "fetch_k": 10}
)

# Custom prompt
prompt = PromptTemplate.from_template("""
You are a DevOps expert assistant. Answer the question based only
on the provided context. If the context doesn't contain the answer,
say "I don't have enough information to answer that."

Context:
{context}

Question: {question}

Answer:
""")

# Build RAG chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
chain_type_kwargs={"prompt": prompt},
return_source_documents=True,
)

# Query
result = qa_chain.invoke({"query": "How do I set up Prometheus monitoring?"})
print(result["result"])

Chunking Strategies

StrategyBest ForChunk Size
Fixed sizeGeneral purpose500-1000 tokens
RecursiveStructured docs (Markdown)500-1500 tokens
SemanticDense technical contentVariable
DocumentShort docs (FAQs, tickets)Full document

Overlap Guidelines

  • 100-200 tokens overlap for technical documentation
  • 50-100 tokens for conversational content
  • More overlap = better context continuity, but higher storage cost

Evaluation Metrics

MetricMeasuresTarget
Retrieval PrecisionRelevant docs retrieved> 80%
Answer FaithfulnessAnswer matches context> 90%
Answer RelevancyAnswer addresses question> 85%
Context RecallRequired info retrieved> 75%

Production Considerations

  1. Hybrid search — combine vector similarity with keyword search (BM25)
  2. Re-ranking — use a cross-encoder to re-rank retrieved documents
  3. Caching — cache frequent queries and embeddings
  4. Monitoring — track retrieval quality, latency, and user feedback
  5. Incremental updates — add new documents without re-indexing everything

Next Steps