Practice/OpenAI/RAG-Based Chatbot System
RAG-Based Chatbot System
System DesignMust
Problem Statement
Design an intelligent chatbot system that uses Retrieval-Augmented Generation (RAG) to answer user queries. The system should be similar to enterprise AI assistants like Glean, which combine information retrieval with language model capabilities to provide contextually relevant responses.
This is not just a basic RAG implementation -- it is a complete chatbot system with additional complexity layers including multi-user access, permission management, data privacy, quality assurance, and performance optimization.
Key Requirements
Functional
- Query answering -- accept user questions in natural language and return accurate, contextual answers
- Document retrieval -- retrieve relevant documents/passages from a large corpus to ground LLM responses
- Source citation -- every answer must cite the specific documents/passages it drew from
- Multi-turn conversations -- maintain conversation context and history across multiple exchanges
- Document ingestion -- ingest and process documents from multiple sources (PDFs, wikis, Slack, email, etc.)
- Permission-aware search -- users only see results from documents they are authorized to access
Non-Functional
- Response quality -- answers must be factual, relevant, and minimize hallucinations
- Low latency -- end-to-end response time under a few seconds
- Scalability -- handle millions of documents and thousands of concurrent users
- Data privacy -- enforce document-level access controls throughout the pipeline
- Freshness -- newly ingested documents are searchable within minutes
What Interviewers Focus On
Based on real interview patterns for this question, interviewers expect you to cover both the ML pipeline and the system architecture. The depth of follow-up questions is significant.
1. Embedding Strategy and Chunking
How do you convert raw documents into searchable representations?
Hints to consider:
- Chunking strategy: split documents into passages of 256-512 tokens with overlap (e.g., 50-token sliding window)
- Why chunk size matters: too large = diluted relevance signal; too small = lost context
- Embedding model selection: trade-off between quality (e.g., OpenAI
text-embedding-3-large) and cost/latency (smaller open-source models like e5-large)
- Metadata preservation: attach source document ID, section title, page number, permissions to each chunk
- Hierarchical chunking: paragraph-level chunks for retrieval, document-level summaries for context
- Re-embedding pipeline: when the embedding model changes, all documents need re-processing
2. Vector Database and Retrieval
How do you store embeddings and efficiently retrieve relevant passages?
Hints to consider:
- Vector database options: Pinecone (managed, simple), Weaviate (hybrid search), Chroma (lightweight), pgvector (if already using PostgreSQL)
- Approximate nearest neighbor (ANN) algorithms: HNSW (high recall, more memory) vs. IVF (lower memory, tunable recall)
- Hybrid search: combine vector similarity (semantic) with keyword search (BM25) for better recall
- Reranking: retrieve top-50 with fast ANN, then rerank to top-5 with a cross-encoder model for precision
- Index sharding: partition by tenant/workspace for permission isolation and horizontal scaling
- Filtering: pre-filter by metadata (permissions, date range, source type) before vector search
3. Prompt Engineering and LLM Integration
How do you construct prompts that incorporate retrieved context and produce high-quality answers?
Hints to consider:
- Prompt structure: system instructions + retrieved passages (with source labels) + conversation history + user query
- Context window management: prioritize most relevant passages; truncate or summarize if exceeding token limits
- Source attribution: label each passage in the prompt (e.g.,
[Source A], [Source B]) and instruct the LLM to cite them
- Hallucination mitigation: instruct the model to say "I don't know" when retrieved context does not support an answer
- Streaming: stream LLM output token-by-token for perceived low latency (SSE or WebSocket)
- Model selection: trade-off between capability (GPT-4 class) and cost/speed (GPT-3.5 class) -- consider routing based on query complexity
4. Multi-Turn Conversation Management
How do you handle follow-up questions and maintain conversation state?
Hints to consider:
- Conversation history: store prior turns and include relevant history in the prompt
- Query rewriting: rewrite follow-up queries to be self-contained (e.g., "What about their pricing?" -> "What is Acme Corp's pricing?") using the LLM
- History summarization: for long conversations, summarize earlier turns to stay within context limits
- Session management: persist conversations per user, allow resumption
- Context carryover: retrieved documents from prior turns may still be relevant -- cache and reuse
5. Security and Permission-Aware Search
Enterprise RAG must enforce document access controls at every layer.
Hints to consider:
- Permission model: inherit permissions from source systems (Google Drive ACLs, Confluence spaces, Slack channels)
- Index-time tagging: store permission metadata with each chunk in the vector database
- Query-time filtering: filter vector search results by the requesting user's permissions before passing to the LLM
- Why not post-filter only: the LLM must never see unauthorized content, even in intermediate steps
- Permission sync: background job to keep permission metadata current as source system ACLs change
- Tenant isolation: in multi-tenant deployments, physically or logically separate indexes per tenant