Practice/OpenAI/RAG-Based Chatbot System

RAG-Based Chatbot System

System DesignMust

Problem Statement

Design an intelligent chatbot system that uses Retrieval-Augmented Generation (RAG) to answer user queries. The system should be similar to enterprise AI assistants like Glean, which combine information retrieval with language model capabilities to provide contextually relevant responses.

This is not just a basic RAG implementation -- it is a complete chatbot system with additional complexity layers including multi-user access, permission management, data privacy, quality assurance, and performance optimization.

Key Requirements

Functional

Query answering -- accept user questions in natural language and return accurate, contextual answers
Document retrieval -- retrieve relevant documents/passages from a large corpus to ground LLM responses
Source citation -- every answer must cite the specific documents/passages it drew from
Multi-turn conversations -- maintain conversation context and history across multiple exchanges
Document ingestion -- ingest and process documents from multiple sources (PDFs, wikis, Slack, email, etc.)
Permission-aware search -- users only see results from documents they are authorized to access

Non-Functional

Response quality -- answers must be factual, relevant, and minimize hallucinations
Low latency -- end-to-end response time under a few seconds
Scalability -- handle millions of documents and thousands of concurrent users
Data privacy -- enforce document-level access controls throughout the pipeline
Freshness -- newly ingested documents are searchable within minutes

What Interviewers Focus On

Based on real interview patterns for this question, interviewers expect you to cover both the ML pipeline and the system architecture. The depth of follow-up questions is significant.

1. Embedding Strategy and Chunking

How do you convert raw documents into searchable representations?

Hints to consider:

Chunking strategy: split documents into passages of 256-512 tokens with overlap (e.g., 50-token sliding window)
Why chunk size matters: too large = diluted relevance signal; too small = lost context
Embedding model selection: trade-off between quality (e.g., OpenAI text-embedding-3-large) and cost/latency (smaller open-source models like e5-large)
Metadata preservation: attach source document ID, section title, page number, permissions to each chunk
Hierarchical chunking: paragraph-level chunks for retrieval, document-level summaries for context
Re-embedding pipeline: when the embedding model changes, all documents need re-processing

2. Vector Database and Retrieval

How do you store embeddings and efficiently retrieve relevant passages?

Hints to consider:

Vector database options: Pinecone (managed, simple), Weaviate (hybrid search), Chroma (lightweight), pgvector (if already using PostgreSQL)
Approximate nearest neighbor (ANN) algorithms: HNSW (high recall, more memory) vs. IVF (lower memory, tunable recall)
Hybrid search: combine vector similarity (semantic) with keyword search (BM25) for better recall
Reranking: retrieve top-50 with fast ANN, then rerank to top-5 with a cross-encoder model for precision
Index sharding: partition by tenant/workspace for permission isolation and horizontal scaling
Filtering: pre-filter by metadata (permissions, date range, source type) before vector search

3. Prompt Engineering and LLM Integration

How do you construct prompts that incorporate retrieved context and produce high-quality answers?

Hints to consider:

Prompt structure: system instructions + retrieved passages (with source labels) + conversation history + user query
Context window management: prioritize most relevant passages; truncate or summarize if exceeding token limits
Source attribution: label each passage in the prompt (e.g., [Source A], [Source B]) and instruct the LLM to cite them
Hallucination mitigation: instruct the model to say "I don't know" when retrieved context does not support an answer
Streaming: stream LLM output token-by-token for perceived low latency (SSE or WebSocket)
Model selection: trade-off between capability (GPT-4 class) and cost/speed (GPT-3.5 class) -- consider routing based on query complexity

4. Multi-Turn Conversation Management

How do you handle follow-up questions and maintain conversation state?

Hints to consider:

Conversation history: store prior turns and include relevant history in the prompt
Query rewriting: rewrite follow-up queries to be self-contained (e.g., "What about their pricing?" -> "What is Acme Corp's pricing?") using the LLM
History summarization: for long conversations, summarize earlier turns to stay within context limits
Session management: persist conversations per user, allow resumption
Context carryover: retrieved documents from prior turns may still be relevant -- cache and reuse

5. Security and Permission-Aware Search

Enterprise RAG must enforce document access controls at every layer.

Hints to consider:

Permission model: inherit permissions from source systems (Google Drive ACLs, Confluence spaces, Slack channels)
Index-time tagging: store permission metadata with each chunk in the vector database
Query-time filtering: filter vector search results by the requesting user's permissions before passing to the LLM
Why not post-filter only: the LLM must never see unauthorized content, even in intermediate steps
Permission sync: background job to keep permission metadata current as source system ACLs change
Tenant isolation: in multi-tenant deployments, physically or logically separate indexes per tenant

Practice/OpenAI/RAG-Based Chatbot System

RAG-Based Chatbot System

System DesignMust

Problem Statement

Key Requirements

Functional

Query answering -- accept user questions in natural language and return accurate, contextual answers
Document retrieval -- retrieve relevant documents/passages from a large corpus to ground LLM responses
Source citation -- every answer must cite the specific documents/passages it drew from
Multi-turn conversations -- maintain conversation context and history across multiple exchanges
Document ingestion -- ingest and process documents from multiple sources (PDFs, wikis, Slack, email, etc.)
Permission-aware search -- users only see results from documents they are authorized to access

Non-Functional

Response quality -- answers must be factual, relevant, and minimize hallucinations
Low latency -- end-to-end response time under a few seconds
Scalability -- handle millions of documents and thousands of concurrent users
Data privacy -- enforce document-level access controls throughout the pipeline
Freshness -- newly ingested documents are searchable within minutes

What Interviewers Focus On

Based on real interview patterns for this question, interviewers expect you to cover both the ML pipeline and the system architecture. The depth of follow-up questions is significant.

1. Embedding Strategy and Chunking

How do you convert raw documents into searchable representations?

Hints to consider:

Chunking strategy: split documents into passages of 256-512 tokens with overlap (e.g., 50-token sliding window)
Why chunk size matters: too large = diluted relevance signal; too small = lost context
Embedding model selection: trade-off between quality (e.g., OpenAI text-embedding-3-large) and cost/latency (smaller open-source models like e5-large)
Metadata preservation: attach source document ID, section title, page number, permissions to each chunk
Hierarchical chunking: paragraph-level chunks for retrieval, document-level summaries for context
Re-embedding pipeline: when the embedding model changes, all documents need re-processing

2. Vector Database and Retrieval

How do you store embeddings and efficiently retrieve relevant passages?

Hints to consider:

Vector database options: Pinecone (managed, simple), Weaviate (hybrid search), Chroma (lightweight), pgvector (if already using PostgreSQL)
Approximate nearest neighbor (ANN) algorithms: HNSW (high recall, more memory) vs. IVF (lower memory, tunable recall)
Hybrid search: combine vector similarity (semantic) with keyword search (BM25) for better recall
Reranking: retrieve top-50 with fast ANN, then rerank to top-5 with a cross-encoder model for precision
Index sharding: partition by tenant/workspace for permission isolation and horizontal scaling
Filtering: pre-filter by metadata (permissions, date range, source type) before vector search

3. Prompt Engineering and LLM Integration

How do you construct prompts that incorporate retrieved context and produce high-quality answers?

Hints to consider:

Prompt structure: system instructions + retrieved passages (with source labels) + conversation history + user query
Context window management: prioritize most relevant passages; truncate or summarize if exceeding token limits
Source attribution: label each passage in the prompt (e.g., [Source A], [Source B]) and instruct the LLM to cite them
Hallucination mitigation: instruct the model to say "I don't know" when retrieved context does not support an answer
Streaming: stream LLM output token-by-token for perceived low latency (SSE or WebSocket)
Model selection: trade-off between capability (GPT-4 class) and cost/speed (GPT-3.5 class) -- consider routing based on query complexity

4. Multi-Turn Conversation Management

How do you handle follow-up questions and maintain conversation state?

Hints to consider:

Conversation history: store prior turns and include relevant history in the prompt
Query rewriting: rewrite follow-up queries to be self-contained (e.g., "What about their pricing?" -> "What is Acme Corp's pricing?") using the LLM
History summarization: for long conversations, summarize earlier turns to stay within context limits
Session management: persist conversations per user, allow resumption
Context carryover: retrieved documents from prior turns may still be relevant -- cache and reuse

5. Security and Permission-Aware Search

Enterprise RAG must enforce document access controls at every layer.

Hints to consider:

Permission model: inherit permissions from source systems (Google Drive ACLs, Confluence spaces, Slack channels)
Index-time tagging: store permission metadata with each chunk in the vector database
Query-time filtering: filter vector search results by the requesting user's permissions before passing to the LLM
Why not post-filter only: the LLM must never see unauthorized content, even in intermediate steps
Permission sync: background job to keep permission metadata current as source system ACLs change
Tenant isolation: in multi-tenant deployments, physically or logically separate indexes per tenant