Practice/Amazon/Design Post Search

Design Post Search

System DesignMust

Problem Statement

Design a social media post search system that allows users to create text posts and search through existing posts using keywords and boolean operators (AND/OR). The system must handle billions of posts, support near real-time indexing so new posts become searchable within seconds, and deliver results sorted by recency or engagement within low-latency SLAs.

Interviewers ask this because it forces you to design the core of a search system, often without relying on Elasticsearch or Solr. They want to see if you can build an inverted index, plan for boolean query evaluation, handle hot keys, meet low-latency targets, and balance freshness against write amplification. The problem is intentionally focused so you can reason about ingestion, indexing, query planning, caching, and pagination under realistic load.

Key Requirements

Functional

Post creation -- users create text posts that become searchable shortly after creation (within 5-10 seconds)
Boolean keyword search -- users search posts by keywords using AND and OR operators
Result sorting -- users sort search results by recency or by engagement metrics (like count)
Paginated results -- users page through results deterministically without duplicates across pages

Non-Functional

Scalability -- handle billions of indexed posts with millions of new posts per day and thousands of search queries per second
Reliability -- ensure no post is permanently lost from the index; tolerate node failures without data loss
Latency -- return search results within 200ms at p95 for typical queries
Consistency -- eventual consistency acceptable for search freshness; new posts visible in search within 10 seconds

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Inverted Index Design and Boolean Evaluation

The core technical challenge is building an efficient search index and evaluating boolean queries against it. Interviewers want to see if you understand inverted indexes, postings lists, and set operations.

Hints to consider:

Build an inverted index that maps each term to a sorted list of post IDs (postings list) with metadata like timestamps and engagement scores
Evaluate AND queries by intersecting postings lists (start with the shortest list for efficiency) and OR queries by merging them
Store positional information if phrase queries are needed, or use bigram indexes for common phrases
Shard the index by term hash to distribute load, with scatter-gather query execution across shards

2. Write Path and Near Real-Time Indexing

New posts must become searchable within seconds, creating a tension between write throughput and index freshness. Interviewers expect a pipeline that decouples post creation from indexing.

Hints to consider:

Use Kafka to buffer post-creation events, decoupling the write API from indexing workers
Indexing workers consume events, tokenize text, and update the inverted index in batches
Maintain a "recent posts" buffer (in-memory or Redis) for very fresh content that hasn't been indexed yet, merging at query time
Use append-only index segments with periodic compaction to maintain write throughput

3. Hot Key and Query Performance

Some terms appear in millions of posts, creating hot shards and expensive queries. Interviewers look for strategies to handle skewed workloads.

Hints to consider:

Cap postings list length for extremely common terms and use approximate results with quality signals for ranking
Cache frequently queried terms' top results in Redis with short TTLs
Implement early termination for queries that have enough high-quality results
Use backpressure and timeout mechanisms to protect the system from expensive wildcard or broad OR queries

4. Ranking and Pagination

Simple boolean matching returns too many results. Interviewers want to see how you rank results and handle stable pagination.

Hints to consider:

Score results using a combination of recency (time decay), engagement metrics (likes, shares), and optionally term frequency relevance
Pre-compute sorted segments by recency and by engagement score to serve different sort modes efficiently
Use cursor-based pagination encoding the last seen (score, post_id) to enable deterministic page-through
Cache query result pages with short TTLs to serve repeated paginations efficiently

Suggested Approach

Step 1: Clarify Requirements

Confirm scope with your interviewer. Ask whether you should design the inverted index from scratch or can use Elasticsearch. Clarify the expected post volume and query rate. Determine if the search supports only keywords or also phrases, wildcards, and filters (by author, date range). Establish latency targets and acceptable index freshness. Ask about result ranking requirements and whether personalization is in scope.

Practice/Amazon/Design Post Search

Design Post Search

System DesignMust

Problem Statement

Key Requirements

Functional

Post creation -- users create text posts that become searchable shortly after creation (within 5-10 seconds)
Boolean keyword search -- users search posts by keywords using AND and OR operators
Result sorting -- users sort search results by recency or by engagement metrics (like count)
Paginated results -- users page through results deterministically without duplicates across pages

Non-Functional

Scalability -- handle billions of indexed posts with millions of new posts per day and thousands of search queries per second
Reliability -- ensure no post is permanently lost from the index; tolerate node failures without data loss
Latency -- return search results within 200ms at p95 for typical queries
Consistency -- eventual consistency acceptable for search freshness; new posts visible in search within 10 seconds

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Inverted Index Design and Boolean Evaluation

Hints to consider:

Build an inverted index that maps each term to a sorted list of post IDs (postings list) with metadata like timestamps and engagement scores
Evaluate AND queries by intersecting postings lists (start with the shortest list for efficiency) and OR queries by merging them
Store positional information if phrase queries are needed, or use bigram indexes for common phrases
Shard the index by term hash to distribute load, with scatter-gather query execution across shards

2. Write Path and Near Real-Time Indexing

New posts must become searchable within seconds, creating a tension between write throughput and index freshness. Interviewers expect a pipeline that decouples post creation from indexing.

Hints to consider:

Use Kafka to buffer post-creation events, decoupling the write API from indexing workers
Indexing workers consume events, tokenize text, and update the inverted index in batches
Maintain a "recent posts" buffer (in-memory or Redis) for very fresh content that hasn't been indexed yet, merging at query time
Use append-only index segments with periodic compaction to maintain write throughput

3. Hot Key and Query Performance

Some terms appear in millions of posts, creating hot shards and expensive queries. Interviewers look for strategies to handle skewed workloads.

Hints to consider:

Cap postings list length for extremely common terms and use approximate results with quality signals for ranking
Cache frequently queried terms' top results in Redis with short TTLs
Implement early termination for queries that have enough high-quality results
Use backpressure and timeout mechanisms to protect the system from expensive wildcard or broad OR queries

4. Ranking and Pagination

Simple boolean matching returns too many results. Interviewers want to see how you rank results and handle stable pagination.

Hints to consider:

Score results using a combination of recency (time decay), engagement metrics (likes, shares), and optionally term frequency relevance
Pre-compute sorted segments by recency and by engagement score to serve different sort modes efficiently
Use cursor-based pagination encoding the last seen (score, post_id) to enable deterministic page-through
Cache query result pages with short TTLs to serve repeated paginations efficiently