RAG Strategy Lab
Compare retrieval strategies on diverse datasets: from portfolio questions to recipe searches to movie analysis
What is this?
An interactive comparison of RAG (Retrieval-Augmented Generation) strategies across three diverse knowledge bases (63 portfolio docs, 381 recipes, 4K movie reviews). Compare quality, speed, and cost trade-offs.
Educational tool: Compare Basic (vector only), Hybrid (vector + keywords), Re-ranking (LLM judgment), and No RAG baseline.
Best viewed on desktop for full side-by-side comparisons and metrics tables.
β Learn about strategies below, then try live comparisons.
This reveals how RAG systems work and when each strategy excels!
My RAG Journey: Started with the AI Assistant on my homepage (Basic RAG), then built this lab to compare strategies and understand trade-offs.
Select Knowledge Base
(Loading counts...)Select Strategies to Compare
(2 selected, 2-4 allowed)Counterintuitive FindingReal Data
Basic RAG wins consistently on this knowledge base - Not because it's "better engineered," but because it's *small and focused* (~63 documents about one person's career).
Production RAG wisdom says "Hybrid wins 80% of the time" - but that's for *large, diverse* knowledge bases (thousands+ documents across many topics). Small, domain-focused knowledge bases naturally have cleaner semantic clusters (see Embedding Visualization). At scale (like Google), keyword matching becomes essential to disambiguate similar concepts.
When Does Each Strategy Win?
Industry best practices - click to see side-by-side comparison
When Does Each Strategy Win?
Industry best practices - click to see side-by-side comparison
Basic RAG
π― Winning Here!Best for:
- βSemantic/conceptual queries: "Tell me about healthcare experience"
- βNatural language questions: "How did she scale CM-SHARE?"
- βSmall, focused knowledge bases: Single domain, clear boundaries
Why it wins: With only ~65 documents about one person's career, semantic clusters are naturally clean and distinct. There's no ambiguity - "healthcare" always means her healthcare work, not general healthcare topics. Small scale = simple strategies work great.
Examples: NotebookLM (personal notes), focused chatbots, specialized documentation sites
When scope is contained, semantic search alone is often sufficient. See the Embedding Visualization to see how 65 documents form clean, separated clusters.
Hybrid RAG
π Production (80%)Best for:
- βExact keyword matches: Company names, acronyms (FHIR, CM-SHARE)
- βTechnical terms: "HL7", "FDA compliance"
- βLarge, diverse knowledge bases: Thousands+ documents across many topics
Why it wins: At scale (Google, enterprise wikis), semantic similarity alone creates ambiguity. "Healthcare AI" could match hundreds of documents - keyword boost helps pinpoint the exact concept you need.
Examples: Perplexity AI (web-scale search), GitHub Copilot (millions of repos), Elastic/Algolia with vector
Most production RAG at scale uses hybrid. Imagine 10,000+ documents: the clean clusters you see now would overlap heavily. At that scale, keyword precision becomes essential.
Re-ranking
β‘ Premium QualityBest for:
- βSubjective queries: "Best for...", "Recommend..."
- βComplex context: Multi-part questions, synthesis needed
- βHigh-stakes decisions: Worth the extra cost/latency
Why it wins: LLM understands query intent and provides nuanced relevance judgment. Cast a wide net with retrieval, then intelligently re-rank with AI.
Examples: You.com, Perplexity Pro mode, premium search features
Used ~20% of the time for queries where LLM judgment adds value beyond simple similarity. Higher cost (3x) but best quality for complex queries.
No RAG
π Baseline OnlyBest for:
- βGeneral knowledge: Questions within LLM training data
- βBaseline testing: Measuring RAG value
- βDemonstrating hallucination: Shows why RAG is needed
Why it wins: It doesn't. This is for comparison only.
Examples: Proof-of-concept demos, A/B testing, educational examples
Not production-ready for domain-specific knowledge. LLMs hallucinate confidently when lacking context. Always use RAG for factual, up-to-date, or specialized information.
Quick decision: Small dataset? β Basic RAG. Large dataset or technical terms? β Hybrid RAG. High-stakes queries? β Re-ranking. Proving RAG value? β Compare with No RAG.
π― Try It Yourself:
Use the sample questions below - notice how different strategies perform on the Portfolio Knowledge knowledge base!
- βBeginner - Semantic queries: Basic RAG excels (small, focused domain)
- βIntermediate - Keyword matching: Hybrid wins for exact terms (FHIR, HL7, company names)
- βAdvanced - Complex synthesis: Re-ranking excels for subjective queries across all databases
Product Evolution
Basic RAG β Hybrid Search β Re-ranking
Incremental improvements with measurable ROI
Basic RAG
Vector Search Only
Semantic similarity via embeddings. Simple, fast, and effective for most queries.
Foundation for everything
Hybrid Search
Vector + Keyword
Combines semantic similarity with keyword matching using weighted scoring (80/20 split).
Worth it for production scale
Re-ranking
LLM-based Scoring
Claude scores each retrieved document for true semantic relevance. Cast wide net, then intelligently re-rank.
For complex queries requiring judgment
π‘PM Lessons Learned
Best Strategy for Personal Sites & AI Chatbots
For portfolios, personal websites, and small knowledge bases (under 100 docs): Basic RAG is usually your best choice.
Why Basic RAG wins here:
- βSmall knowledge bases naturally cluster well - With ~65 docs about one person's career, semantic similarity creates clean separations
- βSingle domain expertise - No ambiguity when all content is healthcare PM work
- βCost-effective - 70-80% query coverage at ~$0.0003 per query vs re-ranking at 10x cost
- βFast responses - 1-2s latency keeps chatbot conversations fluid
At scale (enterprise wikis, Google Search): Hybrid and Re-ranking become essential because semantic similarity creates ambiguity across thousands of documents. But for personal sites? Basic RAG handles 70-80% of queries beautifully.
π― Bottom line: Start with Basic RAG for your portfolio chatbot. Only add complexity (Hybrid/Re-ranking) if you measure specific gaps.
Meta Discovery: Build to Debug
Plot twist: After implementing re-ranking, it performed WORSE than basic strategies (40% vs 44%). Testing revealed bugs: weak scoring prompts, insufficient context window (500 chars), and LLM giving "AI/ML Projects" higher scores than "Clinical Experience" for a clinical query.
The fix: Increased context to 1000 characters, improved prompt to emphasize query intent, added debug logging. Result? Re-ranking jumped from 40% β 92% quality (130% improvement!).
PM moment: Building features is easy. Building features that actually work well requires testing, debugging, and iteration. Your demo becomes your debugger.
RAG Tutorial by LangChain
Comprehensive guide to building RAG applications with practical examples and best practices
RAG Deep Dive by Pinecone
Technical exploration of retrieval strategies, vector databases, and semantic search fundamentals
Anthropic's RAG Documentation
Official guide on implementing RAG with Claude, including prompt engineering and context optimization
Original RAG Paper (arXiv)
Foundational research paper introducing Retrieval-Augmented Generation for knowledge-intensive NLP tasks