System Design - Design a Quota System — Snowflake

Problem Statement

Design a Retrieval-Augmented Generation (RAG) system that combines large language models with external knowledge retrieval to generate accurate, up-to-date, and contextually relevant responses. The system must retrieve relevant documents or passages from a knowledge base, then use those as context for an LLM to generate answers.

This problem is increasingly common at AI companies and tests understanding of both information retrieval systems and LLM integration. The system must handle challenges like semantic search, context window limitations, retrieval quality, and inference latency.

Real-world examples include ChatGPT with web search, customer support chatbots with company documentation, and code assistants with codebase context.

Key Requirements

Functional

Document ingestion -- process and index large collections of documents from various sources (PDFs, web pages, databases)
Semantic search -- retrieve relevant documents based on semantic similarity to user queries, not just keyword matching
Context preparation -- select and format retrieved documents to fit within LLM context windows
Response generation -- use LLM to generate answers grounded in retrieved context
Citation tracking -- attribute generated responses to source documents for verification
Query understanding -- handle ambiguous queries, follow-up questions, and multi-turn conversations

Non-Functional

Low latency -- end-to-end response time should be acceptable for interactive applications (typically under 5 seconds)
High retrieval quality -- the system must retrieve relevant documents consistently to avoid hallucinations
Cost efficiency -- minimize LLM inference costs while maintaining quality
Freshness -- support updating the knowledge base without full system rebuild
Accuracy -- generated responses should be factually correct and grounded in retrieved context

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Embedding and Retrieval Pipeline (Most Emphasized)

How you convert documents and queries into vectors and retrieve relevant matches is foundational.

Embedding models to use (OpenAI embeddings, sentence transformers, domain-specific models)
Vector database selection (Pinecone, Weaviate, Milvus, FAISS) based on scale and latency requirements
Chunking strategy for long documents (fixed size, semantic splitting, overlapping windows)
Hybrid search combining dense embeddings with sparse keyword search (BM25)
Re-ranking retrieved documents using cross-encoder models for higher precision

Interviewers want to see a complete end-to-end architecture from document ingestion to response generation.

Document processing pipeline for extracting text, generating embeddings, and indexing
Query processing flow including embedding generation and retrieval
LLM integration layer for prompt construction and inference
Caching strategies at multiple levels (embedding cache, retrieval cache, response cache)
Asynchronous processing for document ingestion to handle large-scale updates

Fitting retrieved documents into LLM context windows and constructing effective prompts is critical.

Selecting top-k documents vs dynamic selection based on relevance threshold
Summarization or extraction of key passages from long documents
Prompt template design with clear instructions and formatted context
Handling context overflow when retrieved documents exceed token limits
Preserving conversation history in multi-turn interactions

Ensuring the system retrieves the right documents and generates accurate responses requires careful evaluation.

Evaluation metrics for retrieval quality (precision@k, recall@k, MRR, NDCG)
End-to-end evaluation metrics for generated responses (factual accuracy, relevance, citation quality)
Test datasets with ground truth question-answer pairs
A/B testing framework for comparing retrieval or generation strategies
Monitoring for retrieval failures and hallucinations in production

Interviewers push on how to make the system fast and cost-effective at scale.

Ask about the knowledge domain (general vs specialized), scale (number of documents, query volume), latency requirements, acceptable accuracy levels, and whether the system needs to support real-time updates to the knowledge base.