Skip to main content
RAG
Back to AI Lab

RAG Strategy Lab

Compare retrieval strategies on diverse datasets: from portfolio questions to recipe searches to movie analysis

Claude Haiku 4.5Vector SearchHybrid (BM25)Re-ranking

What is this?

An interactive comparison of RAG (Retrieval-Augmented Generation) strategies across three diverse knowledge bases (63 portfolio docs, 381 recipes, 4K movie reviews). Compare quality, speed, and cost trade-offs.

Educational tool: Compare Basic (vector only), Hybrid (vector + keywords), Re-ranking (LLM judgment), and No RAG baseline.

Best viewed on desktop for full side-by-side comparisons and metrics tables.

β†’ Learn about strategies below, then try live comparisons.

This reveals how RAG systems work and when each strategy excels!

My RAG Journey: Started with the AI Assistant on my homepage (Basic RAG), then built this lab to compare strategies and understand trade-offs.

Select Knowledge Base

(Loading counts...)

Select Strategies to Compare

(2 selected, 2-4 allowed)

Counterintuitive FindingReal Data

Basic RAG wins consistently on this knowledge base - Not because it's "better engineered," but because it's *small and focused* (~63 documents about one person's career).

Production RAG wisdom says "Hybrid wins 80% of the time" - but that's for *large, diverse* knowledge bases (thousands+ documents across many topics). Small, domain-focused knowledge bases naturally have cleaner semantic clusters (see Embedding Visualization). At scale (like Google), keyword matching becomes essential to disambiguate similar concepts.

When Does Each Strategy Win?

Industry best practices - click to see side-by-side comparison

Basic RAG

🎯 Winning Here!

Best for:

  • βœ“Semantic/conceptual queries: "Tell me about healthcare experience"
  • βœ“Natural language questions: "How did she scale CM-SHARE?"
  • βœ“Small, focused knowledge bases: Single domain, clear boundaries
Speed:
5/5
Cost:
1/5
Quality:
3/5

Why it wins: With only ~65 documents about one person's career, semantic clusters are naturally clean and distinct. There's no ambiguity - "healthcare" always means her healthcare work, not general healthcare topics. Small scale = simple strategies work great.

Examples: NotebookLM (personal notes), focused chatbots, specialized documentation sites

When scope is contained, semantic search alone is often sufficient. See the Embedding Visualization to see how 65 documents form clean, separated clusters.

Hybrid RAG

🏭 Production (80%)

Best for:

  • βœ“Exact keyword matches: Company names, acronyms (FHIR, CM-SHARE)
  • βœ“Technical terms: "HL7", "FDA compliance"
  • βœ“Large, diverse knowledge bases: Thousands+ documents across many topics
Speed:
4/5
Cost:
2/5
Quality:
4/5

Why it wins: At scale (Google, enterprise wikis), semantic similarity alone creates ambiguity. "Healthcare AI" could match hundreds of documents - keyword boost helps pinpoint the exact concept you need.

Examples: Perplexity AI (web-scale search), GitHub Copilot (millions of repos), Elastic/Algolia with vector

Most production RAG at scale uses hybrid. Imagine 10,000+ documents: the clean clusters you see now would overlap heavily. At that scale, keyword precision becomes essential.

Re-ranking

⚑ Premium Quality

Best for:

  • βœ“Subjective queries: "Best for...", "Recommend..."
  • βœ“Complex context: Multi-part questions, synthesis needed
  • βœ“High-stakes decisions: Worth the extra cost/latency
Speed:
2/5
Cost:
4/5
Quality:
5/5

Why it wins: LLM understands query intent and provides nuanced relevance judgment. Cast a wide net with retrieval, then intelligently re-rank with AI.

Examples: You.com, Perplexity Pro mode, premium search features

Used ~20% of the time for queries where LLM judgment adds value beyond simple similarity. Higher cost (3x) but best quality for complex queries.

No RAG

πŸ“Š Baseline Only

Best for:

  • βœ“General knowledge: Questions within LLM training data
  • βœ“Baseline testing: Measuring RAG value
  • βœ“Demonstrating hallucination: Shows why RAG is needed
Speed:
5/5
Cost:
3/5
Quality:
1/5

Why it wins: It doesn't. This is for comparison only.

Examples: Proof-of-concept demos, A/B testing, educational examples

Not production-ready for domain-specific knowledge. LLMs hallucinate confidently when lacking context. Always use RAG for factual, up-to-date, or specialized information.

Quick decision: Small dataset? β†’ Basic RAG. Large dataset or technical terms? β†’ Hybrid RAG. High-stakes queries? β†’ Re-ranking. Proving RAG value? β†’ Compare with No RAG.

β–Ά

🎯 Try It Yourself:

Use the sample questions below - notice how different strategies perform on the Portfolio Knowledge knowledge base!

  • ●Beginner - Semantic queries: Basic RAG excels (small, focused domain)
  • ●Intermediate - Keyword matching: Hybrid wins for exact terms (FHIR, HL7, company names)
  • ●Advanced - Complex synthesis: Re-ranking excels for subjective queries across all databases

Try sample questions for Portfolio Knowledge:

Beginner - Semantic Queries

Intermediate - Keyword Matches

Advanced - Complex Synthesis

Product Evolution

Basic RAG β†’ Hybrid Search β†’ Re-ranking
Incremental improvements with measurable ROI

Phase 1πŸš€

Basic RAG

Vector Search Only

Semantic similarity via embeddings. Simple, fast, and effective for most queries.

Works for 70-80% of queries
~1-2s latency
Lowest cost per query
Misses exact keyword matches
βœ“ Shipped

Foundation for everything

Phase 2⚑

Hybrid Search

Vector + Keyword

Combines semantic similarity with keyword matching using weighted scoring (80/20 split).

Catches exact terms (FHIR, HL7)
Covers 85-90% of queries
~2-3s latency (acceptable)
Marginal gains for small KB
βœ“ Deployed

Worth it for production scale

Phase 3🧠

Re-ranking

LLM-based Scoring

Claude scores each retrieved document for true semantic relevance. Cast wide net, then intelligently re-rank.

LLM understands query intent
Nuanced relevance judgment
~8-12s latency (2-3x slower)
3x cost increase
βœ“ Shipped

For complex queries requiring judgment

πŸ’‘PM Lessons Learned

1.Ship the simplest thing first - Basic RAG worked for 70-80% of queries. That's good enough to validate the concept.
2.Measure before optimizing - Built this playground to quantify the improvement. Hybrid adds 10-15% coverage for ~50% latency cost.
3.Don't assumeβ€”test at scale - Re-ranking shines for ambiguous queries requiring LLM judgment, not for simple lookups on small KBs.
4.Make trade-offs visible - This playground shows cost, latency, and quality side-by-side. Data-driven decisions beat opinions.
πŸ’‘

Best Strategy for Personal Sites & AI Chatbots

For portfolios, personal websites, and small knowledge bases (under 100 docs): Basic RAG is usually your best choice.

Why Basic RAG wins here:

  • βœ“Small knowledge bases naturally cluster well - With ~65 docs about one person's career, semantic similarity creates clean separations
  • βœ“Single domain expertise - No ambiguity when all content is healthcare PM work
  • βœ“Cost-effective - 70-80% query coverage at ~$0.0003 per query vs re-ranking at 10x cost
  • βœ“Fast responses - 1-2s latency keeps chatbot conversations fluid

At scale (enterprise wikis, Google Search): Hybrid and Re-ranking become essential because semantic similarity creates ambiguity across thousands of documents. But for personal sites? Basic RAG handles 70-80% of queries beautifully.

🎯 Bottom line: Start with Basic RAG for your portfolio chatbot. Only add complexity (Hybrid/Re-ranking) if you measure specific gaps.

πŸ›

Meta Discovery: Build to Debug

Plot twist: After implementing re-ranking, it performed WORSE than basic strategies (40% vs 44%). Testing revealed bugs: weak scoring prompts, insufficient context window (500 chars), and LLM giving "AI/ML Projects" higher scores than "Clinical Experience" for a clinical query.

The fix: Increased context to 1000 characters, improved prompt to emphasize query intent, added debug logging. Result? Re-ranking jumped from 40% β†’ 92% quality (130% improvement!).

PM moment: Building features is easy. Building features that actually work well requires testing, debugging, and iteration. Your demo becomes your debugger.