Design an end-to-end RAG application with ingestion pipeline — SAP

Problem Statement

Design a Retrieval-Augmented Generation (RAG) platform that lets users upload documents, process them through an embedding pipeline, and query a knowledge base to receive AI-generated answers grounded in the retrieved context. Users connect various content sources such as PDFs, Word documents, and wiki pages. The system chunks each document, generates vector embeddings, and indexes them so that natural-language queries retrieve the most relevant passages. Those passages are then fed as context to a large language model, which produces cited, accurate responses.

The design must handle multi-tenant workloads where each organization's data is strictly isolated. Ingestion pipelines are asynchronous and potentially long-running, so the system must provide visibility into processing status and handle retries gracefully. On the query side, low latency and high relevance are paramount. Interviewers look for end-to-end thinking across data ingestion, indexing quality, retrieval strategy, LLM prompt construction, cost control, and production observability.

Key Requirements

Functional

Multi-format document ingestion -- Users upload PDFs, DOCX files, and plain text or connect wiki sources; the system extracts text, chunks it, and shows ingestion progress
Corpus management -- Users reprocess, delete, and version documents, and view what content is currently indexed
Grounded question answering -- Users ask natural-language questions and receive answers with inline citations linking back to source documents and passages
Tenant isolation and API access -- Organizations restrict corpus access to authorized teams, and developers integrate queries into their applications via API keys

Non-Functional

Scalability -- Support thousands of concurrent users querying corpora containing millions of chunks, with ingestion pipelines processing hundreds of documents per minute
Reliability -- Guarantee durable ingestion with retry and dead-letter handling; tolerate partial failures without corrupting the index
Latency -- Return query responses within 3 seconds end-to-end including retrieval and LLM generation; show ingestion status updates within 2 seconds of state changes
Consistency -- Strong consistency for corpus metadata (which documents are indexed); eventual consistency acceptable for newly ingested chunks appearing in query results

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Multi-Stage Ingestion Pipeline Design

Ingestion involves extraction, chunking, embedding, and indexing, each of which can fail independently. Interviewers want to see a durable, observable, and resumable workflow.

Hints to consider:

Model ingestion as a series of stages orchestrated by a workflow engine or Saga pattern with explicit state transitions and compensation logic
Decouple upload acceptance from processing using a message queue so the API returns immediately with a job ID
Persist intermediate results (extracted text, chunk boundaries) so a failure during embedding does not require re-extraction
Support reprocessing when embedding models change by replaying from the stored chunks without re-uploading the original files

2. Chunking and Embedding Strategy

The quality of retrieval depends heavily on how documents are split and embedded. Interviewers probe whether you understand the tradeoffs in chunk size, overlap, and embedding model selection.

Hints to consider:

Use semantic-aware chunking that respects paragraph and section boundaries rather than splitting at fixed character counts
Apply sliding-window overlap between chunks so information at boundaries is not lost
Choose embedding dimensionality and model based on the tradeoff between retrieval quality, index size, and inference cost
Store chunk metadata (source document, page number, section heading) alongside the vector to enable precise citation generation

3. Hybrid Retrieval for Relevance

Pure vector search can miss exact keyword matches, while pure keyword search misses semantic meaning. Interviewers expect a discussion of hybrid approaches.

Hints to consider:

Combine dense vector search with sparse keyword search (BM25) using reciprocal rank fusion or a learned re-ranker
Apply metadata filters (tenant, document type, date range) before or during retrieval to narrow the candidate set
Tune the number of retrieved chunks (top-K) to balance context relevance against LLM token limits and cost
Consider a two-stage pipeline: fast candidate retrieval followed by a cross-encoder re-ranker for the final top-N

4. Multi-Tenant Data Isolation

Mixing tenants in the same index without strict isolation risks cross-tenant data leakage through prompts. Interviewers treat this as a critical security concern.

Hints to consider:

Enforce tenant isolation at the retrieval layer using mandatory tenant filters on every query, not just application-level checks
Consider namespace partitioning within a shared index versus fully separate indexes per tenant, weighing operational overhead against isolation guarantees
Encrypt tenant data at rest with per-tenant keys so a storage-layer breach does not expose all tenants
Audit all query and ingestion requests with tenant context for compliance and debugging

5. LLM Integration, Cost Control, and Observability

Calling an LLM on every query is expensive and introduces latency and reliability concerns. Interviewers want to see practical strategies for managing this dependency.

Hints to consider:

Cache query-response pairs for repeated or similar questions using semantic similarity on the query embedding
Implement token budgets per tenant to control cost, with graceful fallback to retrieval-only results when budgets are exhausted
Stream LLM responses to the client token-by-token to reduce perceived latency
Log retrieved chunks, prompts, and responses for each query to enable relevance evaluation, debugging, and model improvement

Suggested Approach

Step 1: Clarify Requirements

Confirm the expected document volume, average document size, and number of tenants. Ask whether real-time ingestion is needed or if batch processing on a schedule is acceptable. Clarify the LLM provider (self-hosted versus managed API) and any latency or cost constraints. Determine whether the system should support conversational follow-up queries with session context or only single-turn Q and A. Ask about compliance requirements such as data residency or right-to-delete.

Step 2: High-Level Architecture

Sketch the major components: an API gateway for document uploads and query requests, a job orchestrator (Temporal, SQS plus Lambda, or similar) managing the ingestion pipeline stages, a document store (S3) for raw files, a chunk store (PostgreSQL) for extracted text and metadata, a vector index (Pinecone, pgvector, or Elasticsearch with dense vectors) for embedding-based retrieval, a keyword index (Elasticsearch BM25) for sparse retrieval, a retrieval service that merges results from both indexes, an LLM gateway that constructs prompts and streams responses, and a caching layer (Redis) for repeated queries. Show Kafka connecting the upload API to the ingestion pipeline and propagating status updates.

Step 3: Deep Dive on Ingestion Pipeline

Walk through the lifecycle of a document upload. The user uploads a PDF via the API, which stores the raw file in S3 and publishes an ingestion job to Kafka. A worker picks up the job, extracts text using a parser (Apache Tika or similar), splits it into semantic chunks with overlap, and writes the chunks to PostgreSQL with metadata. A second stage reads the chunks, generates embeddings via a model-serving endpoint, and upserts the vectors into the index with the tenant namespace. A final stage updates the document status to INDEXED in the metadata store. If any stage fails, the job enters a retry queue with exponential backoff; after maximum retries it moves to a dead-letter queue and the document status is set to FAILED. Users see status transitions in near real-time via polling or server-sent events.

Step 4: Address Secondary Concerns

Cover query flow: the retrieval service executes parallel vector and keyword searches, fuses results, applies a re-ranker, selects the top-K chunks, constructs a prompt with system instructions and retrieved context, and streams the LLM response to the client with citations. Discuss cost control: cache frequent queries, enforce per-tenant token budgets, and use smaller models for low-priority requests. Address reprocessing: when a new embedding model is deployed, enqueue all existing chunks for re-embedding without re-extraction, using a versioned index that swaps atomically once complete. Mention monitoring: track ingestion throughput and failure rates, query latency broken down by retrieval and generation phases, cache hit rates, and LLM token usage per tenant. Briefly cover disaster recovery: replicate the vector index across availability zones, back up PostgreSQL and S3, and design for index rebuild from stored chunks if the vector store is lost.

Problem Statement

Key Requirements

Functional

Multi-format document ingestion -- Users upload PDFs, DOCX files, and plain text or connect wiki sources; the system extracts text, chunks it, and shows ingestion progress
Corpus management -- Users reprocess, delete, and version documents, and view what content is currently indexed
Grounded question answering -- Users ask natural-language questions and receive answers with inline citations linking back to source documents and passages
Tenant isolation and API access -- Organizations restrict corpus access to authorized teams, and developers integrate queries into their applications via API keys

Non-Functional

Scalability -- Support thousands of concurrent users querying corpora containing millions of chunks, with ingestion pipelines processing hundreds of documents per minute
Reliability -- Guarantee durable ingestion with retry and dead-letter handling; tolerate partial failures without corrupting the index
Latency -- Return query responses within 3 seconds end-to-end including retrieval and LLM generation; show ingestion status updates within 2 seconds of state changes
Consistency -- Strong consistency for corpus metadata (which documents are indexed); eventual consistency acceptable for newly ingested chunks appearing in query results

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Multi-Stage Ingestion Pipeline Design

Ingestion involves extraction, chunking, embedding, and indexing, each of which can fail independently. Interviewers want to see a durable, observable, and resumable workflow.

Hints to consider:

Model ingestion as a series of stages orchestrated by a workflow engine or Saga pattern with explicit state transitions and compensation logic
Decouple upload acceptance from processing using a message queue so the API returns immediately with a job ID
Persist intermediate results (extracted text, chunk boundaries) so a failure during embedding does not require re-extraction
Support reprocessing when embedding models change by replaying from the stored chunks without re-uploading the original files

2. Chunking and Embedding Strategy

The quality of retrieval depends heavily on how documents are split and embedded. Interviewers probe whether you understand the tradeoffs in chunk size, overlap, and embedding model selection.

Hints to consider:

Use semantic-aware chunking that respects paragraph and section boundaries rather than splitting at fixed character counts
Apply sliding-window overlap between chunks so information at boundaries is not lost
Choose embedding dimensionality and model based on the tradeoff between retrieval quality, index size, and inference cost
Store chunk metadata (source document, page number, section heading) alongside the vector to enable precise citation generation

3. Hybrid Retrieval for Relevance

Pure vector search can miss exact keyword matches, while pure keyword search misses semantic meaning. Interviewers expect a discussion of hybrid approaches.

Hints to consider:

Combine dense vector search with sparse keyword search (BM25) using reciprocal rank fusion or a learned re-ranker
Apply metadata filters (tenant, document type, date range) before or during retrieval to narrow the candidate set
Tune the number of retrieved chunks (top-K) to balance context relevance against LLM token limits and cost
Consider a two-stage pipeline: fast candidate retrieval followed by a cross-encoder re-ranker for the final top-N

4. Multi-Tenant Data Isolation

Mixing tenants in the same index without strict isolation risks cross-tenant data leakage through prompts. Interviewers treat this as a critical security concern.

Hints to consider:

Enforce tenant isolation at the retrieval layer using mandatory tenant filters on every query, not just application-level checks
Consider namespace partitioning within a shared index versus fully separate indexes per tenant, weighing operational overhead against isolation guarantees
Encrypt tenant data at rest with per-tenant keys so a storage-layer breach does not expose all tenants
Audit all query and ingestion requests with tenant context for compliance and debugging

5. LLM Integration, Cost Control, and Observability

Calling an LLM on every query is expensive and introduces latency and reliability concerns. Interviewers want to see practical strategies for managing this dependency.

Hints to consider:

Cache query-response pairs for repeated or similar questions using semantic similarity on the query embedding
Implement token budgets per tenant to control cost, with graceful fallback to retrieval-only results when budgets are exhausted
Stream LLM responses to the client token-by-token to reduce perceived latency
Log retrieved chunks, prompts, and responses for each query to enable relevance evaluation, debugging, and model improvement