Practice/Google/Design an ANN index system

Design an ANN index system

System DesignOptional

Problem Statement

You are asked to design an approximate nearest neighbor (ANN) search system that operates over a corpus of 10 billion high-dimensional vectors. The system powers features like semantic search, recommendation, and retrieval-augmented generation where queries must return the top-K most similar vectors within strict latency bounds.

At this scale, exact nearest neighbor search is computationally infeasible. You must select an appropriate ANN algorithm, decide how to shard and replicate the index across a cluster, and define a strategy for keeping the index fresh as new vectors are continuously ingested. The design must balance recall quality, query latency, memory consumption, and indexing throughput.

Consider how the system handles the tension between index build time (which can take hours for billions of vectors) and the need to surface newly added or updated vectors within minutes. A base-plus-delta index architecture, where a periodically rebuilt base index is supplemented by a smaller real-time delta index, is a common pattern worth exploring.

Key Requirements

Functional

Top-K similarity search -- Given a query vector, return the K most similar vectors (by cosine or inner product) with configurable K and a minimum recall target of 95%.
Continuous ingestion -- Accept new vectors in real time and make them searchable within minutes, not hours.
Index management -- Support full index rebuilds, incremental updates, and rollback to a previous index version without downtime.
Filtered search -- Allow queries to include metadata filters (e.g., category, timestamp range) that are applied alongside the vector similarity search.

Non-Functional

Scalability -- Index and serve 10 billion vectors, each 768 dimensions, with horizontal scaling as the corpus grows.
Latency -- P99 query latency under 50 milliseconds for top-100 retrieval across the full 10B corpus.
Availability -- 99.99% uptime with zero-downtime index deployments and graceful degradation under partial failures.
Cost efficiency -- Minimize memory footprint through quantization and compression while maintaining recall above 95%.

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. ANN Algorithm Selection and Tradeoffs

Interviewers expect you to compare multiple ANN approaches and justify your choice based on the specific constraints of the problem. They probe whether you understand the recall-latency-memory tradeoff triangle. Hints to consider:

Compare HNSW (high recall, high memory) with IVF+PQ (lower memory via product quantization, but requires a training step)
Think about how product quantization compresses 768-dimensional vectors from 3 KB to under 100 bytes and what that means for 10B vectors
Consider a hybrid approach: use IVF+PQ for the base index to fit in memory, and HNSW for the smaller delta index where memory is less constrained
Explore how the number of IVF clusters and PQ sub-quantizers affect recall and latency

2. Sharding and Query Fanout

With 10 billion vectors, no single machine can hold the full index. Interviewers want to see your sharding strategy and how you aggregate results from multiple shards. Hints to consider:

Think about whether you shard by vector ID (random distribution) or by cluster assignment (locality-aware sharding)
Consider the fanout-merge pattern: query all shards in parallel, each returns local top-K, then merge and re-rank at the aggregator
Explore how you handle stragglers — one slow shard can dominate P99 latency
Discuss replication within each shard for both availability and load balancing

3. Base-Plus-Delta Index Architecture

Keeping 10B vectors searchable while continuously ingesting new data is a core challenge. Interviewers probe how you balance index freshness with build cost. Hints to consider:

Think about a base index that is rebuilt periodically (e.g., every 6 hours) from a snapshot in Cassandra, plus a delta index that holds vectors ingested since the last build
Consider how queries merge results from both base and delta indexes and how you handle vectors that exist in both (updates)
Explore how Kafka serves as the ingestion pipeline, feeding both the delta index builder and the base index snapshot store
Discuss the cutover process when a new base index is deployed and the delta index is reset