RAG System Design

[ OK ] 351 — full content available

[ INFO ] category: System Design · Domain Specific difficulty: hard freq: high first seen: 2026-01-13

[HARD][DOMAIN SPECIFIC][HIGH]Machine LearningRAGInformation RetrievalLLMSystem Design

$ cat problem.md

Design a production-grade Retrieval-Augmented Generation (RAG) system that lets employees ask natural-language questions across a constantly-updating corpus of PDF, Word, HTML, and internal wiki pages and receive concise, trustworthy answers with inline citations. The service must handle 5 k QPS with <1 s P99 end-to-end latency, support incremental document updates within 5 minutes of publication, guarantee that answers are generated only from documents the user is authorized to read, and never exceed the 128 k token context limit of the hosted LLM. Outline the full data flow from raw files to streamed response, the embedding and re-ranking strategy, the hybrid retrieval stack, the prompt template that enforces grounded generation, the access-control enforcement point, and the evaluation pipeline you would run before every release. Be prepared to justify chunk-size, top-k, re-ranker choice, and fallback behavior when no relevant documents are retrieved.

user@intervues:~/openai$