[ OK ]c56d0e1d-5b15-487e-ba56-89651d24d531 — full content available
[ INFO ]category: System Design difficulty: unknown freq: first seen: 2026-03-13
[UNKNOWN][SYSTEM DESIGN]
$catproblem.md
At OpenAI, a system design interview for an AI Chatbot often begins with an intentionally ambiguous problem statement to see if you can proactively gather requirements and handle extreme scale.
Sample Problem Statement
"Design an end-to-end AI Chatbot system capable of providing intelligent, real-time responses to millions of concurrent users. The system must maintain context across long-running conversations, handle diverse inputs like text and document uploads (PDFs/CSVs), and ensure content safety while minimizing latency."
Key Requirements & Constraints to Address
During the interview, you are expected to clarify both functional and non-functional requirements:
Functional Requirements:
Prompt/Response: User sends a text prompt; AI provides an intelligent response.
Context Management: Persist chat history to maintain conversation threads.
External Knowledge: Ability to search the web or private documents for up-to-date info (RAG).
Safety: Implement checks to filter toxic content.
Non-Functional Requirements:
Scale: Support ~1,000 concurrent requests or 10M+ daily active users.
Latency: Streaming responses in real-time, with full responses ideally under 3 seconds (P95cap P sub 95𝑃95).
Availability: High uptime (99.9%+) to ensure constant access.
Expected Components for a High-Level Design
Frontend/API Gateway: Handles WebSocket/Server-Sent Events (SSE) for streaming tokens back to the user.
Orchestration Layer: Coordinates between the LLM, memory, and external tools.
Conversational Memory: Uses a caching layer (e.g., Redis) or database to store and retrieve recent message context.
RAG Pipeline: Includes a Vector Database (e.g., Pinecone, Weaviate) to store and search document embeddings for factual grounding.
Safety Guardrails: A dedicated service to scan prompts and generated text for policy violations.
Common "Deep Dive" Scenarios
OpenAI interviewers often "push" your design after the initial high-level draft:
100x Scale: What happens if the user base jumps from 10 million to 1 billion? What breaks first (database writes, message fanout, or GPU availability)?
Inference Costs: How would you balance accuracy versus computational cost using a process-reward model or smaller "mini" models?
Stale Knowledge: How does the system handle a "hallucination" where it overrides new API data with outdated model knowledge?
Would you like a detailed breakdown of the database schema or the specific caching strategies used to optimize LLM inference?