Skip to main content
SAFE
AI LabGuardrails & Safety

AI Guardrails

I discovered 5 critical gaps in my own AI system. Here's what I learned about protecting AI in production.

OWASP LLM Top 105 CategoriesReal Incidents

What you'll learn

1

Real failures β€” Companies that learned the hard way

2

5 categories β€” Universal framework for any AI system

3

My journey β€” Gaps I found and fixed in my own AI

4

Assessment β€” Find out what YOUR AI needs

Honest narrative: I shipped with gaps, found them through research, and fixed them. This page documents that journey.

These Actually Happened

Real companies. Real consequences. Each could have been prevented.

πŸ”“

Samsung

2023

Source Code Leaked to ChatGPT

Samsung engineers pasted proprietary source code and internal meeting notes into ChatGPT for debugging help.

Confidential semiconductor data became part of OpenAI training data. Samsung banned ChatGPT company-wide.

Missing:PII/Sensitive Data Detection
πŸš—

Chevrolet

2023

Chatbot Sold Car for $1

A dealership chatbot was manipulated through prompt injection to agree to sell a Chevy Tahoe for $1.

Viral embarrassment. Demonstrated how easily AI can be manipulated without proper boundaries.

Missing:Prompt Injection Detection
✈️

Air Canada

2024

Chatbot Made Unauthorized Promises

Air Canada's chatbot promised a bereavement discount policy that didn't exist. Customer sued and won.

Court ruled Air Canada liable for chatbot's statements. Paid damages plus legal costs.

Missing:Output Validation

The 5 Categories

Every AI system needs protection at these 5 points. The strictness varies by risk.

🧠
YOUR AI
Users
INPUT
AI
OUTPUT
Users

Guardrails protect at every stage of the AI interaction pipeline

πŸ›‘οΈ

Input Guardrails

Protect your system from what users send IN. Filter, validate, and sanitize before the AI ever sees it.

Prompt injection detection
PII blocking/redaction
Input length limits

Critical for: Any user-facing AI.

πŸ”

Output Guardrails

Validate what the AI sends OUT. Catch harmful, inaccurate, or off-brand responses before users see them.

Content moderation
Fact-checking against sources
Brand voice validation

Critical for: Customer-facing AI, legal/financial contexts, or anywhere reputation matters.

βš™οΈ

Operational Guardrails

Keep your system stable and your costs controlled. Prevent runaway usage and graceful degradation.

Rate limiting
Cost caps (hourly/daily/monthly)
Circuit breakers

Critical for: Any production AI.

πŸ€–

Agent Guardrails

Control what AI agents can DO. Limit their tools, actions, and scope to prevent autonomous chaos.

Tool access controls
Action allowlists
Iteration limits

Critical for: Any AI with tool access, code execution, or ability to take real-world actions.

πŸ‘€

Human-in-the-Loop

Keep humans involved for high-stakes decisions. Some things shouldn't be fully automated.

Approval workflows
Escalation triggers
Confidence thresholds

Critical for: Financial decisions, healthcare, legal, hiring, or any irreversible actions.

What I Learned

I built two AI systems. One had comprehensive guardrails. The other... didn't.

AI Assistant

I used AI to help identify what guardrails were needed. 11 guardrails implemented from day one.

RAG Playground

I assumed protections would carry over. They didn't. 5 critical gaps discovered through research.

β€œI got caught in vibe coding and forgot the most critical thingβ€”protecting my corner of the internet.”

AI Assistant

11/11

100% coverage

RAG Playground

9/11

82% coverage

GuardrailAI AssistantRAG Playground

Rate Limiting

Prevents abuse by limiting requests per user over time.

10 requests per minute with Redis-backed tracking

5 requests per minute with in-memory tracking

Cost Caps

Prevents runaway API costs with tiered spending limits.

3-tier caps: $1/hour, $10/day, $100/month

3-tier caps: $0.50/hour, $5/day, $50/month

Circuit Breaker

Graceful degradation when external services fail.

3 services protected: OpenAI, Pinecone, Redis

Claude API protected with fail-fast pattern

Input Validation

Validates and sanitizes user input before processing.

Length limits, encoding checks, sanitization

Basic length validation only

Prompt Injection Detection

Detects and blocks attempts to manipulate the AI.

4 detection types: role hijack, instruction override, context manipulation, system prompt extraction

4 detection types: override, extraction, manipulation, jailbreak

PII Detection

Detects and handles personally identifiable information.

7 PII types: email, phone, SSN, credit card, IP, zip code, date of birth

Blocks critical PII (SSN, credit cards) before processing

Content Moderation

Screens input for harmful or inappropriate content.

OpenAI Moderation API integration

OpenAI Moderation API integration

Output Validation

Validates AI responses before sending to users.

5 checks: length, format, content safety, relevance, quality

Quality scoring with fallback for low-confidence responses

Context Sufficiency Check

Ensures AI has enough context to answer accurately.

Semantic similarity threshold for RAG responses

Optional

Semantic Cache

Caches similar queries to reduce costs and latency.

Vector-based similarity matching for cache hits

Optional

Graceful Error Handling

User-friendly error messages that don't expose internals.

Sanitized error messages with logging

Basic try-catch with generic messages

2 Remaining Optimizations

These are performance optimizations, not security guardrails. Here's why they're deprioritized:

?
Context Sufficiency Check

Requires tuning similarity thresholds specific to the knowledge base. The RAG Playground uses multiple document types (resume, projects, work history) which need different thresholds. The current output validation catches low-quality responses, making this an optimization rather than a necessity.

To implement: Per-collection similarity thresholds + user feedback data to calibrate

?
Semantic Cache

At current traffic levels (~100 queries/day), the cost savings don't justify the added infrastructure complexity. Semantic caching requires vector storage for cache keys, similarity matching on every request, and cache invalidation logic. The cost caps already protect against runaway spending.

To implement: Higher traffic volume (1000+ queries/day) to justify infrastructure investment

PM lesson: Not every optimization is worth the complexity. These features would add infrastructure overhead with minimal security benefit at current scale.

All 5 gaps now fixed

Content Moderation (OpenAI API)
Prompt Injection Detection
PII Detection & Blocking
Circuit Breaker Pattern
Output Quality Validation
πŸ’‘

The PM Lesson

Knowledge gaps are only a problem if you don't address them.
Ship β†’ Learn β†’ Fix β†’ Repeat.

What Does Your AI Need?

Answer a few questions to get personalized guardrail recommendations.

How to use: Check each statement that applies to your AI system. Your recommendations update as you select.

Go Deeper

Curated resources for implementing your own guardrails.

1
2
3
4

These resources are curated from my own learning journey. Start with Step 1 and work your way through.