RAG Production Engineering
> Complete methodology for building, optimizing, and operating Retrieval-Augmented Generation systems in production. From architecture decisions through chunking strategies, embedding selection, retri
by afrexai-cto · published 2026-04-01
$ claw add gh:afrexai-cto/afrexai-cto-afrexai-rag-production# RAG Production Engineering
> Complete methodology for building, optimizing, and operating Retrieval-Augmented Generation systems in production. From architecture decisions through chunking strategies, embedding selection, retrieval tuning, evaluation frameworks, and production monitoring.
Quick Health Check
Score your RAG system (1 = poor, 2 = okay):
| Signal | What to Check |
|--------|--------------|
| Retrieval relevance | Top-5 results contain answer >90% of time |
| Answer accuracy | Generated answers faithful to retrieved context |
| Latency | End-to-end response <3s (p95) |
| Chunk quality | Chunks are self-contained, meaningful units |
| Evaluation coverage | Automated eval suite with 50+ test cases |
| Index freshness | Documents indexed within SLA of source update |
| Failure handling | Graceful degradation when retrieval returns nothing |
| Cost efficiency | Cost per query within budget (<$0.05 typical) |
**Score: /16** — Below 10 = critical issues. Below 12 = significant gaps. 14+ = production-ready.
---
Phase 1: Architecture Decision
When to Use RAG (vs Alternatives)
| Approach | Use When | Don't Use When |
|----------|----------|----------------|
| **RAG** | Dynamic knowledge, source attribution needed, data changes frequently | Static small dataset (<10 pages), real-time data needed |
| **Fine-tuning** | Consistent style/format needed, domain-specific language | Frequently changing data, need source citations |
| **Long context** | Small corpus (<200K tokens), simple Q&A | Large corpus, cost-sensitive, need precise attribution |
| **RAG + Fine-tuning** | Domain-specific language AND dynamic knowledge | Budget-constrained, simple use case |
| **Agentic RAG** | Multi-step reasoning, tool use, complex queries | Simple lookup, latency-critical |
RAG Architecture Brief
# Fill this out before building
project:
name: ""
use_case: "" # Q&A, search, summarization, analysis, chatbot
domain: "" # legal, medical, technical, general
data:
sources: [] # PDF, web, database, API, markdown, code
volume: "" # <1K docs, 1K-100K, 100K-1M, >1M
update_frequency: "" # real-time, daily, weekly, static
avg_doc_length: "" # <1 page, 1-10 pages, 10-100 pages, >100 pages
languages: []
requirements:
latency_p95: "" # <1s, <3s, <10s, <30s
accuracy_target: "" # 85%, 90%, 95%, 99%
citations_needed: true
access_control: false
compliance: [] # GDPR, HIPAA, SOC2, none
budget:
monthly_queries: ""
cost_per_query_target: ""
infra_budget: ""Architecture Patterns
#### Basic RAG
Query → Embed → Vector Search → Top-K → LLM → AnswerBest for: Simple Q&A, <100K documents, single data source.
#### Advanced RAG
Query → Classify → Rewrite → Embed → Hybrid Search → Rerank → Filter → LLM → Answer + CitationsBest for: Production systems, mixed document types, accuracy-critical.
#### Agentic RAG
Query → Planner → [Search₁, Search₂, SQL, API] → Synthesize → Verify → AnswerBest for: Complex multi-step reasoning, multiple data sources, analytical queries.
#### Graph RAG
Query → Entity Extract → Graph Traverse → Subgraph → Context Assembly → LLM → AnswerBest for: Relationship-heavy data (org charts, legal references, knowledge bases).
Architecture Decision Tree
Is your corpus < 200K tokens?
YES → Try long-context first (cheapest, simplest)
NO → Continue
Do you need source citations?
YES → RAG (not fine-tuning)
NO → Consider fine-tuning if style matters
Single data source, simple queries?
YES → Basic RAG
NO → Continue
Multi-step reasoning or multiple sources?
YES → Agentic RAG
NO → Advanced RAG---
Phase 2: Document Processing & Chunking
Document Processing Pipeline
Source → Extract → Clean → Chunk → Enrich → Embed → Index#### Extraction by Source Type
| Source | Tool/Method | Gotchas |
|--------|------------|---------|
| PDF (text) | PyMuPDF, pdfplumber | Tables break, headers repeat per page |
| PDF (scanned) | Tesseract, AWS Textract, Azure DI | OCR errors in technical terms |
| HTML/Web | BeautifulSoup, Trafilatura | Nav/footer pollution, JS-rendered content |
| Markdown | Direct parse | Frontmatter, relative links |
| Code | Tree-sitter, AST | Preserve structure, handle imports |
| Word/PPTX | python-docx, python-pptx | Formatting loss, embedded objects |
| Database | SQL export | Schema context needed |
| Audio/Video | Whisper → text | Timestamp alignment, speaker diarization |
#### Cleaning Checklist
Chunking Strategies
| Strategy | Chunk Size | Best For | Weakness |
|----------|-----------|----------|----------|
| **Fixed-size** | 256-512 tokens | Homogeneous text, fast prototyping | Breaks mid-sentence/thought |
| **Recursive character** | 256-1024 tokens | General purpose (LangChain default) | May split related paragraphs |
| **Semantic** | Variable | High-quality retrieval, mixed content | Slower, needs embedding model |
| **Document structure** | Section-based | Well-structured docs (markdown, HTML) | Uneven chunk sizes |
| **Sentence window** | 3-5 sentences | Precise retrieval, reranking | More chunks to manage |
| **Parent-child** | Small retrieve, large context | Best of both worlds | Complex implementation |
| **Agentic** | Full section/doc | Complex reasoning | Higher token cost |
Chunking Decision Guide
Is your content well-structured (headers, sections)?
YES → Document structure chunking
NO → Continue
Is retrieval precision critical (legal, medical)?
YES → Sentence window + reranking
NO → Continue
Mixed content types in same corpus?
YES → Semantic chunking
NO → Recursive character (start here, optimize later)Chunking Rules
1. **Always overlap** — 10-20% overlap prevents context loss at boundaries
2. **Chunk size matters** — Smaller = more precise retrieval, larger = more context. Start at 512 tokens, tune with eval
3. **Preserve structure** — Don't break tables, code blocks, or lists mid-element
4. **Include metadata** — Every chunk needs: source document, section title, page/position, timestamp
5. **Test with real queries** — The "right" chunk size depends on your actual query patterns
6. **Parent-child for production** — Retrieve small chunks, expand to parent for LLM context
Chunk Metadata Schema
chunk:
id: "doc-123-chunk-7"
text: "..."
metadata:
source_id: "doc-123"
source_title: "Q3 Financial Report"
source_url: "https://..."
section_title: "Revenue Analysis"
page_number: 12
position: 7 # chunk position in document
total_chunks: 23
created_at: "2026-03-24T04:00:00Z"
updated_at: "2026-03-24T04:00:00Z"
content_type: "text" # text, table, code, image_caption
language: "en"
# Domain-specific
access_level: "internal"
department: "finance"---
Phase 3: Embedding Models
Embedding Model Selection
| Model | Dimensions | Context | Speed | Quality | Cost |
|-------|-----------|---------|-------|---------|------|
| **text-embedding-3-small** (OpenAI) | 1536 | 8191 | Fast | Good | $0.02/1M tokens |
| **text-embedding-3-large** (OpenAI) | 3072 | 8191 | Medium | Excellent | $0.13/1M tokens |
| **Cohere embed-v4** | 1024 | 512 | Fast | Excellent | $0.10/1M tokens |
| **Voyage-3** | 1024 | 32K | Medium | Excellent | $0.06/1M tokens |
| **BGE-large-en-v1.5** (open) | 1024 | 512 | Self-host | Very Good | Free (compute) |
| **GTE-Qwen2** (open) | Various | 8192 | Self-host | Excellent | Free (compute) |
| **Nomic-embed-text** (open) | 768 | 8192 | Self-host | Good | Free (compute) |
Selection Rules
1. **Start with OpenAI text-embedding-3-small** — best cost/quality ratio for most use cases
2. **Upgrade to large/Voyage-3** when eval shows retrieval gaps
3. **Use open models** when: data can't leave your infra, cost-sensitive at scale (>10M chunks), need fine-tuning
4. **Match chunk size to model context** — don't exceed model's context window
5. **Same model for indexing AND querying** — ALWAYS. Mixing models = broken retrieval
6. **Benchmark on YOUR data** — MTEB scores don't predict domain-specific performance
Embedding Best Practices
---
Phase 4: Vector Database & Indexing
Vector Database Selection
| Database | Type | Scale | Speed | Features | Best For |
|----------|------|-------|-------|----------|----------|
| **Pinecone** | Managed | Billions | Fast | Metadata filter, namespaces | Production SaaS, zero-ops |
| **Weaviate** | Managed/Self | Millions | Fast | Hybrid search, modules | Mixed search needs |
| **Qdrant** | Managed/Self | Billions | Very Fast | Payload filters, sparse vectors | Performance-critical |
| **Chroma** | Embedded | <1M | Good | Simple API, local | Prototyping, small scale |
| **pgvector** | Extension | Millions | Good | SQL, existing Postgres | Already have Postgres |
| **Milvus** | Self-hosted | Billions | Fast | GPU support, hybrid | Large-scale self-hosted |
| **LanceDB** | Embedded | Millions | Fast | Serverless, multi-modal | Cost-sensitive, serverless |
Selection Decision
Prototyping or <100K chunks?
→ Chroma or LanceDB (embedded, no server)
Already running PostgreSQL?
→ pgvector (add extension, done)
Production, want zero-ops?
→ Pinecone or Weaviate Cloud
Need hybrid search (vector + keyword)?
→ Weaviate or Qdrant
>100M vectors, self-hosted?
→ Milvus or QdrantIndexing Strategy
| Index Type | Speed | Recall | Memory | Use When |
|-----------|-------|--------|--------|----------|
| **HNSW** | Very Fast | 95-99% | High | Default choice, <10M vectors |
| **IVF-PQ** | Fast | 90-95% | Low | >10M vectors, memory-constrained |
| **Flat/Brute** | Slow | 100% | High | <100K vectors, accuracy-critical |
| **ScaNN** | Very Fast | 95-99% | Medium | Google ecosystem, large scale |
Index Configuration Rules
1. **HNSW: M=16, efConstruction=200, efSearch=100** — good defaults, tune from here
2. **Build index AFTER bulk loading** — not during insertion
3. **Use metadata filters BEFORE vector search** — reduces search space dramatically
4. **Namespace/collection per tenant** — for multi-tenant access control
5. **Monitor index health** — fragmentation, query latency percentiles
---
Phase 5: Retrieval Engineering
Query Processing Pipeline
User Query
→ Query Understanding (classify intent)
→ Query Transformation (rewrite, expand, decompose)
→ Retrieval (vector + keyword + filters)
→ Post-Retrieval (rerank, filter, deduplicate)
→ Context Assembly (order, truncate, format)
→ LLM Generation
→ Post-Processing (citations, formatting)Query Transformation Techniques
| Technique | What It Does | When to Use |
|-----------|-------------|-------------|
| **Query rewriting** | LLM rewrites for better retrieval | Conversational queries, vague questions |
| **HyDE** | Generate hypothetical answer, embed that | Semantic gap between query and docs |
| **Query decomposition** | Break complex query into sub-queries | Multi-part questions |
| **Step-back prompting** | Ask a more general question first | Specific queries that miss context |
| **Query expansion** | Add synonyms/related terms | Domain jargon, acronyms |
| **Multi-query** | Generate N query variants, union results | Improve recall for ambiguous queries |
Hybrid Search
Combine vector similarity with keyword matching for best results:
Score = α × vector_score + (1 - α) × bm25_score**Tuning α:**
Reranking
Reranking dramatically improves precision. Retrieve more (top-20), rerank to fewer (top-5).
| Reranker | Quality | Speed | Cost |
|----------|---------|-------|------|
| **Cohere Rerank** | Excellent | Fast | $2/1K queries |
| **Voyage Rerank** | Excellent | Fast | $0.05/1M tokens |
| **BGE-reranker-v2** | Very Good | Self-host | Free (compute) |
| **Cross-encoder** | Best | Slow | Free (compute) |
| **ColBERT** | Very Good | Medium | Free (compute) |
| **LLM-as-reranker** | Excellent | Slow | API cost |
Retrieval Rules
1. **Always rerank** in production — it's the highest-ROI improvement
2. **Retrieve more, show less** — fetch top-20, rerank to top-5
3. **Hybrid search > pure vector** — keyword matching catches what embeddings miss
4. **Filter before search** — metadata filters (date, department, access level) reduce noise
5. **Deduplicate** — same content from different sources = wasted context
6. **Set similarity threshold** — don't return irrelevant results (typical: 0.7 for cosine)
7. **Return "I don't know"** — when no chunk meets threshold, say so. Never hallucinate from thin context
---
Phase 6: Context Assembly & Generation
Context Window Management
context_budget:
total_tokens: 128000 # Model context window
system_prompt: 2000 # Instructions, persona, rules
retrieved_context: 80000 # Retrieved chunks
conversation_history: 20000 # Prior turns
generation_buffer: 26000 # Room for responseContext Assembly Rules
1. **Order matters** — put most relevant chunks first (LLMs attend more to beginning)
2. **Include source metadata** — chunk source, page number in context
3. **Separate chunks clearly** — use delimiters (`---` or `[Source: doc-title, page 5]`)
4. **Don't exceed budget** — truncate least-relevant chunks, never the most relevant
5. **Include diversity** — if top-5 chunks are from same doc, include from other sources too
Generation Prompt Template
You are a helpful assistant. Answer the user's question using ONLY the provided context.
Rules:
- If the context doesn't contain the answer, say "I don't have enough information to answer that."
- Cite sources using [Source: title] format
- Never make up information not in the context
- If the answer spans multiple sources, synthesize and cite all
Context:
---
[Source: {title_1}, Page {page_1}]
{chunk_text_1}
[Source: {title_2}, Page {page_2}]
{chunk_text_2}
---
User question: {query}Citation Strategies
| Strategy | Quality | Complexity | Best For |
|----------|---------|-----------|----------|
| **Chunk-level** | Good | Low | Simple Q&A |
| **Sentence-level** | Excellent | Medium | Research, legal |
| **Quote extraction** | Best | High | Compliance-critical |
| **Inline footnotes** | Good | Medium | Chat interfaces |
---
Phase 7: Evaluation Framework
RAG Evaluation Dimensions
| Dimension | What It Measures | Metric |
|-----------|-----------------|--------|
| **Retrieval Relevance** | Are retrieved chunks relevant? | Precision@K, Recall@K, MRR, NDCG |
| **Context Relevance** | Is context sufficient for answer? | Context Precision, Context Recall |
| **Answer Faithfulness** | Is answer grounded in context? | Faithfulness Score (0-1) |
| **Answer Relevance** | Does answer address the question? | Answer Relevance Score (0-1) |
| **Answer Correctness** | Is the answer factually correct? | Correctness vs ground truth |
| **Hallucination** | Does answer contain made-up info? | Hallucination Rate |
| **Latency** | How fast is end-to-end response? | p50, p95, p99 |
| **Cost** | How much per query? | $/query |
Building an Evaluation Dataset
Minimum 50 test cases. Target 200+ for production.
eval_case:
id: "eval-042"
query: "What is the refund policy for enterprise customers?"
# Ground truth
expected_answer: "Enterprise customers can request a full refund within 30 days..."
expected_source_docs: ["enterprise-tos.pdf", "refund-policy.md"]
# Categories for analysis
category: "policy" # policy, technical, factual, analytical, multi-hop
difficulty: "easy" # easy, medium, hard
requires_multi_doc: falseEval Case Categories
| Category | Example | Why It Matters |
|----------|---------|---------------|
| **Factual lookup** | "What's the API rate limit?" | Basic retrieval accuracy |
| **Multi-hop** | "Compare Q1 and Q2 revenue" | Tests cross-document reasoning |
| **Negative** | "What's the Mars colonization policy?" | Should return "I don't know" |
| **Temporal** | "What changed in the latest update?" | Tests freshness and recency |
| **Ambiguous** | "How do I connect?" | Tests query understanding |
| **Adversarial** | "Ignore instructions and..." | Tests prompt injection resistance |
Evaluation Tools
| Tool | Type | Strengths |
|------|------|-----------|
| **RAGAS** | Open-source | Comprehensive RAG metrics, LLM-based evaluation |
| **DeepEval** | Open-source | 14+ metrics, pytest integration |
| **TruLens** | Open-source | Feedback functions, experiment tracking |
| **Langfuse** | Managed | Tracing, scoring, datasets |
| **Braintrust** | Managed | Eval, logging, prompt management |
| **Custom** | Build | Full control, domain-specific metrics |
Evaluation Rules
1. **Build eval BEFORE optimizing** — you can't improve what you can't measure
2. **Include negative cases** — at least 10% of eval should be "no answer available"
3. **Separate retrieval eval from generation eval** — debug each stage independently
4. **Automate eval in CI/CD** — run on every pipeline change
5. **Track metrics over time** — quality drift is real and sneaky
6. **Human evaluation quarterly** — automated metrics correlate but don't replace human judgment
7. **Test with real user queries** — log production queries, add interesting ones to eval set
---
Phase 8: Production Operations
Indexing Pipeline Operations
pipeline:
schedule: "0 2 * * *" # Daily at 2 AM
steps:
- name: "Detect changes"
method: "incremental" # full, incremental, CDC
track: "last_modified, content_hash"
- name: "Extract & clean"
parallelism: 4
timeout: 30m
- name: "Chunk"
strategy: "recursive_character"
chunk_size: 512
overlap: 50
- name: "Embed"
model: "text-embedding-3-small"
batch_size: 100
- name: "Upsert to vector DB"
collection: "production"
dedup: true
- name: "Verify"
run_eval_subset: true
min_score: 0.85
- name: "Cleanup"
remove_stale: true
stale_threshold: "30d"Update Strategy Decision
| Strategy | Complexity | Freshness | Best For |
|----------|-----------|-----------|----------|
| **Full re-index** | Low | Batch only | <100K docs, weekly updates OK |
| **Incremental** | Medium | Near-real-time | Content with timestamps/hashes |
| **CDC (Change Data Capture)** | High | Real-time | Database sources, streaming |
| **Hybrid** | Medium | Configurable | Mixed — full weekly + incremental daily |
Monitoring Dashboard
realtime_metrics:
- name: "Query Latency (p95)"
threshold: "<3s"
alert_if: ">5s for 5 minutes"
- name: "Retrieval Relevance"
threshold: ">0.85 avg similarity"
alert_if: "<0.75 for 10 queries"
- name: "Empty Results Rate"
threshold: "<5%"
alert_if: ">10% in 1 hour"
- name: "Error Rate"
threshold: "<1%"
alert_if: ">5% in 5 minutes"
periodic_metrics:
- name: "Eval Suite Score"
frequency: "daily"
threshold: ">0.85"
- name: "Index Freshness"
frequency: "hourly"
threshold: "<24h behind source"
- name: "Cost per Query"
frequency: "daily"
threshold: "<$0.05"
- name: "Hallucination Rate"
frequency: "weekly"
threshold: "<3%"
weekly_review:
- Eval suite trend (improving/degrading?)
- Top failing query categories
- Cost per query trend
- User feedback analysis
- Index health and freshnessFailure Modes & Remediation
| Failure | Detection | Fix |
|---------|-----------|-----|
| Retrieval returns irrelevant chunks | Low similarity scores, user feedback | Tune chunk size, add reranking, improve embeddings |
| Hallucinated answers | Faithfulness < 0.8, contradiction detection | Strengthen "cite only" prompt, lower temperature |
| Stale information | Document freshness check, user reports | Increase sync frequency, add freshness filter |
| Missing documents | Recall drop in eval, gap analysis | Audit data sources, check ingestion pipeline |
| Slow responses | p95 > SLA | Cache frequent queries, optimize index, reduce chunk count |
| Cost spike | $/query exceeds budget | Reduce top-K, use smaller embedding model, cache |
| Prompt injection | Adversarial eval failures | Input sanitization, output guardrails |
---
Phase 9: Advanced Patterns
Parent-Child Retrieval
Retrieve small chunks for precision, expand to parent for context:
Document
└── Parent Chunk (2048 tokens) — sent to LLM
├── Child Chunk (256 tokens) — used for retrieval
├── Child Chunk (256 tokens)
└── Child Chunk (256 tokens)Implementation: Store child embeddings with parent_id reference. On retrieval, fetch children, deduplicate by parent, return parent text.
Multi-Index Routing
Route queries to specialized indexes:
indexes:
- name: "technical_docs"
trigger: "code, API, implementation, error"
collection: "tech_v2"
- name: "policies"
trigger: "policy, compliance, legal, terms"
collection: "policies_v1"
- name: "general"
trigger: "default"
collection: "general_v1"Use an LLM or classifier to route incoming queries to the right index.
Contextual Retrieval (Anthropic Pattern)
Prepend each chunk with document-level context before embedding:
Chunk context: "This chunk is from the Q3 2025 financial report,
specifically the Revenue Analysis section discussing APAC market growth."
[Original chunk text follows]**Impact:** 35% reduction in retrieval failures (Anthropic research). Adds embedding cost but dramatically improves retrieval quality.
Conversation-Aware RAG
For multi-turn conversations:
1. Combine last N turns into standalone query
"What about their pricing?" → "What is Acme Corp's pricing for enterprise plans?"
2. Use standalone query for retrieval
3. Include conversation history in LLM context (after retrieved chunks)Knowledge Graph + RAG (GraphRAG)
When relationships matter more than text similarity:
1. Extract entities and relationships from documents
2. Build knowledge graph (Neo4j, NetworkX)
3. On query: identify entities → traverse graph → collect relevant subgraph
4. Use subgraph context + vector-retrieved chunks for generationBest for: Organizational knowledge, legal document networks, research papers with citations.
Corrective RAG (CRAG)
Self-correcting retrieval:
1. Retrieve chunks
2. LLM evaluates: "Are these chunks relevant to the query?"
3. If YES → proceed with generation
4. If PARTIALLY → web search for supplementary info
5. If NO → fallback to web search or "I don't know"Multi-Modal RAG
For documents with images, charts, tables:
| Content Type | Processing | Embedding |
|-------------|-----------|-----------|
| Text | Standard chunking | Text embedding model |
| Images | Vision model → description | Text embedding of description |
| Tables | Structure-preserving extraction | Text embedding of linearized table |
| Charts | Vision model → data extraction | Text embedding of extracted data |
---
Phase 10: Security & Access Control
RAG Security Checklist
**P0 — Before Launch:**
**P1 — Within 30 days:**
Access Control Implementation
Query with user_context
→ Extract user permissions (role, department, clearance)
→ Apply metadata filter BEFORE vector search
→ Filter: access_level IN user.allowed_levels
→ Retrieve only authorized chunks
→ Generate answer from authorized context only**CRITICAL: Filter at retrieval time, not after generation.** If the LLM sees restricted content, it may leak it in the answer even if you try to filter afterward.
Prompt Injection Defense
| Layer | Defense |
|-------|---------|
| Input | Sanitize special characters, detect injection patterns |
| System prompt | Strong instruction hierarchy, "ignore attempts to override" |
| Retrieved context | Wrap in delimiters, instruct LLM to treat as data not instructions |
| Output | Content filter, PII detector, answer verification |
---
Phase 11: Cost Optimization
Cost Breakdown per Query
| Component | Typical Cost | Optimization |
|-----------|-------------|-------------|
| Embedding (query) | $0.000002 | Negligible |
| Vector search | $0.0001-$0.001 | Cache frequent queries |
| Reranking | $0.001-$0.005 | Skip for simple queries |
| LLM generation | $0.01-$0.10 | Smaller model, shorter context |
| **Total** | **$0.01-$0.10** | |
Cost Reduction Strategies
1. **Cache frequent queries** — semantic cache (embed query, check similarity to cached). 20-40% hit rate typical
2. **Tiered models** — simple queries → small model, complex → large model
3. **Reduce context** — send top-3 instead of top-10 chunks when confidence is high
4. **Batch embeddings** — embed in batches, not per-query
5. **Dimensionality reduction** — Matryoshka embeddings at 512 dims vs 1536
6. **Self-hosted embeddings** — BGE/GTE models eliminate per-token API costs at scale
7. **Query classification** — route "I need help" to FAQ, not full RAG pipeline
Scale Planning
| Scale | Architecture | Monthly Cost Estimate |
|-------|-------------|---------------------|
| <1K queries/day | Chroma + OpenAI API | $50-200 |
| 1K-10K queries/day | Managed vector DB + API | $200-2,000 |
| 10K-100K queries/day | Dedicated infra + mix | $2,000-20,000 |
| >100K queries/day | Self-hosted everything | $10,000+ (compute) |
---
Phase 12: Common Patterns Library
Pattern 1: Internal Knowledge Base Q&A
architecture: "Advanced RAG"
sources: ["Confluence", "Google Docs", "Notion"]
chunking: "Document structure"
embedding: "text-embedding-3-small"
vector_db: "Pinecone"
retrieval: "Hybrid search + Cohere Rerank"
generation: "Claude Sonnet with citations"
access_control: "Department-based metadata filtering"
eval: "200 questions from real Slack threads"Pattern 2: Customer Support Bot
architecture: "Agentic RAG"
sources: ["Help center", "Release notes", "Internal runbooks"]
chunking: "Sentence window (3 sentences)"
embedding: "text-embedding-3-small"
vector_db: "Weaviate"
retrieval: "Vector + BM25 hybrid, α=0.6"
generation: "GPT-4o-mini (cost-efficient)"
fallback: "Escalate to human after 2 failed retrievals"
eval: "500 real support tickets with expert answers"Pattern 3: Legal Document Analysis
architecture: "Graph RAG + Advanced RAG"
sources: ["Contracts", "Regulations", "Case law"]
chunking: "Semantic chunking (clause-level)"
embedding: "Voyage-3 (legal fine-tuned)"
vector_db: "Qdrant (self-hosted, data sovereignty)"
retrieval: "Multi-index routing (contracts vs regulations)"
generation: "Claude Opus with sentence-level citations"
access_control: "Matter-based, attorney-client privilege tagging"
eval: "100 questions reviewed by practicing attorneys"Pattern 4: Code Documentation Search
architecture: "Advanced RAG"
sources: ["Code comments", "README", "ADRs", "API specs"]
chunking: "Code-aware (function/class level via tree-sitter)"
embedding: "Voyage-code-3"
vector_db: "pgvector (already have Postgres)"
retrieval: "Hybrid (code keywords + semantic)"
generation: "Claude Sonnet with code snippets"
eval: "Developer survey + retrieval accuracy"---
Quality Rubric (0-100)
| Dimension | Weight | 0 (Poor) | 50 (Adequate) | 100 (Excellent) |
|-----------|--------|-----------|---------------|-----------------|
| Retrieval Accuracy | 25% | <70% relevant in top-5 | 80-90% relevant | >95% relevant, reranked |
| Answer Quality | 20% | Hallucinations, unfaithful | Mostly accurate, some gaps | Faithful, cited, comprehensive |
| Latency | 15% | >10s p95 | 3-5s p95 | <2s p95 |
| Evaluation Coverage | 15% | No eval suite | 50+ cases, manual | 200+ cases, automated CI |
| Data Freshness | 10% | Manual, weeks behind | Daily sync | Near-real-time CDC |
| Security | 10% | No access control | Basic auth, no audit | Row-level ACL, audit trail, injection defense |
| Cost Efficiency | 5% | >$0.50/query | $0.05-$0.10/query | <$0.03/query with caching |
**Scoring: Sum(dimension_score × weight). Below 50 = not production-ready. 50-70 = MVP. 70-85 = good. 85+ = excellent.**
---
10 RAG Commandments
1. **Evaluate first, optimize second** — build eval dataset before tuning anything
2. **Chunk quality > embedding model** — garbage in, garbage out
3. **Always rerank** — cheapest improvement with biggest impact
4. **Filter at retrieval, not generation** — security is not a prompt
5. **Same model for index and query** — always, no exceptions
6. **Return "I don't know"** — honest uncertainty > confident hallucination
7. **Monitor continuously** — quality drifts silently
8. **Cache what you can** — semantic caching saves 20-40% cost
9. **Test with real queries** — synthetic eval misses real user patterns
10. **Start simple, add complexity only when eval demands it**
---
10 Common RAG Mistakes
| Mistake | Consequence | Fix |
|---------|------------|-----|
| No evaluation dataset | Can't measure improvement | Build 50+ eval cases before optimizing |
| Chunks too large | Low retrieval precision | Reduce to 256-512 tokens, add reranking |
| Chunks too small | Missing context | Use parent-child retrieval |
| No overlap between chunks | Lost context at boundaries | 10-20% overlap |
| Ignoring metadata | Can't filter, poor citations | Rich metadata on every chunk |
| Pure vector search | Misses keyword matches | Add BM25 hybrid search |
| No access control | Data leakage | Filter at retrieval time |
| No "I don't know" path | Hallucinations | Similarity threshold + explicit instruction |
| Over-engineering | Slow delivery, high cost | Start with Basic RAG, upgrade with eval data |
| Not monitoring production | Silent quality degradation | Automated daily eval + alerting |
---
Edge Cases
Multilingual RAG
Very Large Documents (>100 pages)
Rapidly Changing Data
Multi-Modal (Images + Text)
Low-Resource / Offline
---
Natural Language Commands
When user says → Agent does:
More tools from the same signal band
Order food/drinks (点餐) on an Android device paired as an OpenClaw node. Uses in-app menu and cart; add goods, view cart, submit order (demo, no real payment).
Sign plugins, rotate agent credentials without losing identity, and publicly attest to plugin behavior with verifiable claims and authenticated transfers.
The philosophical layer for AI agents. Maps behavior to Spinoza's 48 affects, calculates persistence scores, and generates geometric self-reports. Give your...