Design a Retrieval-Augmented Generation (RAG) platform that lets users upload documents, process them through an embedding pipeline, and query a knowledge base to receive AI-generated answers grounded in the retrieved context. Users connect various content sources such as PDFs, Word documents, and wiki pages. The system chunks each document, generates vector embeddings, and indexes them so that natural-language queries retrieve the most relevant passages. Those passages are then fed as context to a large language model, which produces cited, accurate responses.
The design must handle multi-tenant workloads where each organization's data is strictly isolated. Ingestion pipelines are asynchronous and potentially long-running, so the system must provide visibility into processing status and handle retries gracefully. On the query side, low latency and high relevance are paramount. Interviewers look for end-to-end thinking across data ingestion, indexing quality, retrieval strategy, LLM prompt construction, cost control, and production observability.
Based on real interview experiences, these are the areas interviewers probe most deeply:
Ingestion involves extraction, chunking, embedding, and indexing, each of which can fail independently. Interviewers want to see a durable, observable, and resumable workflow.
Hints to consider:
The quality of retrieval depends heavily on how documents are split and embedded. Interviewers probe whether you understand the tradeoffs in chunk size, overlap, and embedding model selection.
Hints to consider:
Pure vector search can miss exact keyword matches, while pure keyword search misses semantic meaning. Interviewers expect a discussion of hybrid approaches.
Hints to consider:
Mixing tenants in the same index without strict isolation risks cross-tenant data leakage through prompts. Interviewers treat this as a critical security concern.
Hints to consider:
Calling an LLM on every query is expensive and introduces latency and reliability concerns. Interviewers want to see practical strategies for managing this dependency.
Hints to consider:
Confirm the expected document volume, average document size, and number of tenants. Ask whether real-time ingestion is needed or if batch processing on a schedule is acceptable. Clarify the LLM provider (self-hosted versus managed API) and any latency or cost constraints. Determine whether the system should support conversational follow-up queries with session context or only single-turn Q and A. Ask about compliance requirements such as data residency or right-to-delete.
Sketch the major components: an API gateway for document uploads and query requests, a job orchestrator (Temporal, SQS plus Lambda, or similar) managing the ingestion pipeline stages, a document store (S3) for raw files, a chunk store (PostgreSQL) for extracted text and metadata, a vector index (Pinecone, pgvector, or Elasticsearch with dense vectors) for embedding-based retrieval, a keyword index (Elasticsearch BM25) for sparse retrieval, a retrieval service that merges results from both indexes, an LLM gateway that constructs prompts and streams responses, and a caching layer (Redis) for repeated queries. Show Kafka connecting the upload API to the ingestion pipeline and propagating status updates.
Walk through the lifecycle of a document upload. The user uploads a PDF via the API, which stores the raw file in S3 and publishes an ingestion job to Kafka. A worker picks up the job, extracts text using a parser (Apache Tika or similar), splits it into semantic chunks with overlap, and writes the chunks to PostgreSQL with metadata. A second stage reads the chunks, generates embeddings via a model-serving endpoint, and upserts the vectors into the index with the tenant namespace. A final stage updates the document status to INDEXED in the metadata store. If any stage fails, the job enters a retry queue with exponential backoff; after maximum retries it moves to a dead-letter queue and the document status is set to FAILED. Users see status transitions in near real-time via polling or server-sent events.
Cover query flow: the retrieval service executes parallel vector and keyword searches, fuses results, applies a re-ranker, selects the top-K chunks, constructs a prompt with system instructions and retrieved context, and streams the LLM response to the client with citations. Discuss cost control: cache frequent queries, enforce per-tenant token budgets, and use smaller models for low-priority requests. Address reprocessing: when a new embedding model is deployed, enqueue all existing chunks for re-embedding without re-extraction, using a versioned index that swaps atomically once complete. Mention monitoring: track ingestion throughput and failure rates, query latency broken down by retrieval and generation phases, cache hit rates, and LLM token usage per tenant. Briefly cover disaster recovery: replicate the vector index across availability zones, back up PostgreSQL and S3, and design for index rebuild from stored chunks if the vector store is lost.