SAFE

AI LabGuardrails & Safety

AI Guardrails

I discovered 5 critical gaps in my own AI system. Here's what I learned about protecting AI in production.

OWASP LLM Top 105 CategoriesReal Incidents

What you'll learn

Real failures — Companies that learned the hard way

5 categories — Universal framework for any AI system

My journey — Gaps I found and fixed in my own AI

Assessment — Find out what YOUR AI needs

Honest narrative: I shipped with gaps, found them through research, and fixed them. This page documents that journey.

These Actually Happened

Real companies. Real consequences. Each could have been prevented.

🔓

Samsung

2023

Source Code Leaked to ChatGPT

Samsung engineers pasted proprietary source code and internal meeting notes into ChatGPT for debugging help.

Confidential semiconductor data became part of OpenAI training data. Samsung banned ChatGPT company-wide.

Missing:PII/Sensitive Data Detection

🚗

Chevrolet

2023

Chatbot Sold Car for $1

A dealership chatbot was manipulated through prompt injection to agree to sell a Chevy Tahoe for $1.

Viral embarrassment. Demonstrated how easily AI can be manipulated without proper boundaries.

Missing:Prompt Injection Detection

✈️

Air Canada

2024

Chatbot Made Unauthorized Promises

Air Canada's chatbot promised a bereavement discount policy that didn't exist. Customer sued and won.

Court ruled Air Canada liable for chatbot's statements. Paid damages plus legal costs.

Missing:Output Validation

The 5 Categories

Every AI system needs protection at these 5 points. The strictness varies by risk.

🧠

YOUR AI

Users

INPUT

OUTPUT

Users

Guardrails protect at every stage of the AI interaction pipeline

🛡️

Input Guardrails

Protect your system from what users send IN. Filter, validate, and sanitize before the AI ever sees it.

Prompt injection detection

PII blocking/redaction

Input length limits

Critical for: Any user-facing AI.

🔍

Output Guardrails

Validate what the AI sends OUT. Catch harmful, inaccurate, or off-brand responses before users see them.

Content moderation

Fact-checking against sources

Brand voice validation

Critical for: Customer-facing AI, legal/financial contexts, or anywhere reputation matters.

⚙️

Operational Guardrails

Keep your system stable and your costs controlled. Prevent runaway usage and graceful degradation.

Rate limiting

Cost caps (hourly/daily/monthly)

Circuit breakers

Critical for: Any production AI.

🤖

Agent Guardrails

Control what AI agents can DO. Limit their tools, actions, and scope to prevent autonomous chaos.

Tool access controls

Action allowlists

Iteration limits

Critical for: Any AI with tool access, code execution, or ability to take real-world actions.

👤

Human-in-the-Loop

Keep humans involved for high-stakes decisions. Some things shouldn't be fully automated.

Approval workflows

Escalation triggers

Confidence thresholds

Critical for: Financial decisions, healthcare, legal, hiring, or any irreversible actions.

What I Learned

I built two AI systems. One had comprehensive guardrails. The other... didn't.

AI Assistant

I used AI to help identify what guardrails were needed. 11 guardrails implemented from day one.

RAG Playground

I assumed protections would carry over. They didn't. 5 critical gaps discovered through research.

“I got caught in vibe coding and forgot the most critical thing—protecting my corner of the internet.”

AI Assistant

11/11

100% coverage

RAG Playground

9/11

82% coverage

Guardrail	AI Assistant	RAG Playground
Rate Limiting Prevents abuse by limiting requests per user over time.	10 requests per minute with Redis-backed tracking	5 requests per minute with in-memory tracking
Cost Caps Prevents runaway API costs with tiered spending limits.	3-tier caps: $1/hour, $10/day, $100/month	3-tier caps: $0.50/hour, $5/day, $50/month
Circuit Breaker Graceful degradation when external services fail.	3 services protected: OpenAI, Pinecone, Redis	Claude API protected with fail-fast pattern
Input Validation Validates and sanitizes user input before processing.	Length limits, encoding checks, sanitization	Basic length validation only
Prompt Injection Detection Detects and blocks attempts to manipulate the AI.	4 detection types: role hijack, instruction override, context manipulation, system prompt extraction	4 detection types: override, extraction, manipulation, jailbreak
PII Detection Detects and handles personally identifiable information.	7 PII types: email, phone, SSN, credit card, IP, zip code, date of birth	Blocks critical PII (SSN, credit cards) before processing
Content Moderation Screens input for harmful or inappropriate content.	OpenAI Moderation API integration	OpenAI Moderation API integration
Output Validation Validates AI responses before sending to users.	5 checks: length, format, content safety, relevance, quality	Quality scoring with fallback for low-confidence responses
Context Sufficiency Check Ensures AI has enough context to answer accurately.	Semantic similarity threshold for RAG responses	Optional
Semantic Cache Caches similar queries to reduce costs and latency.	Vector-based similarity matching for cache hits	Optional
Graceful Error Handling User-friendly error messages that don't expose internals.	Sanitized error messages with logging	Basic try-catch with generic messages

2 Remaining Optimizations

These are performance optimizations, not security guardrails. Here's why they're deprioritized:

Context Sufficiency Check

Requires tuning similarity thresholds specific to the knowledge base. The RAG Playground uses multiple document types (resume, projects, work history) which need different thresholds. The current output validation catches low-quality responses, making this an optimization rather than a necessity.

To implement: Per-collection similarity thresholds + user feedback data to calibrate

Semantic Cache

At current traffic levels (~100 queries/day), the cost savings don't justify the added infrastructure complexity. Semantic caching requires vector storage for cache keys, similarity matching on every request, and cache invalidation logic. The cost caps already protect against runaway spending.

To implement: Higher traffic volume (1000+ queries/day) to justify infrastructure investment

PM lesson: Not every optimization is worth the complexity. These features would add infrastructure overhead with minimal security benefit at current scale.