Practice/Amazon/Design Post Search
Design Post Search
System DesignMust
Problem Statement
Design a social media post search system that allows users to create text posts and search through existing posts using keywords and boolean operators (AND/OR). The system must handle billions of posts, support near real-time indexing so new posts become searchable within seconds, and deliver results sorted by recency or engagement within low-latency SLAs.
Interviewers ask this because it forces you to design the core of a search system, often without relying on Elasticsearch or Solr. They want to see if you can build an inverted index, plan for boolean query evaluation, handle hot keys, meet low-latency targets, and balance freshness against write amplification. The problem is intentionally focused so you can reason about ingestion, indexing, query planning, caching, and pagination under realistic load.
Key Requirements
Functional
- Post creation -- users create text posts that become searchable shortly after creation (within 5-10 seconds)
- Boolean keyword search -- users search posts by keywords using AND and OR operators
- Result sorting -- users sort search results by recency or by engagement metrics (like count)
- Paginated results -- users page through results deterministically without duplicates across pages
Non-Functional
- Scalability -- handle billions of indexed posts with millions of new posts per day and thousands of search queries per second
- Reliability -- ensure no post is permanently lost from the index; tolerate node failures without data loss
- Latency -- return search results within 200ms at p95 for typical queries
- Consistency -- eventual consistency acceptable for search freshness; new posts visible in search within 10 seconds
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Inverted Index Design and Boolean Evaluation
The core technical challenge is building an efficient search index and evaluating boolean queries against it. Interviewers want to see if you understand inverted indexes, postings lists, and set operations.
Hints to consider:
- Build an inverted index that maps each term to a sorted list of post IDs (postings list) with metadata like timestamps and engagement scores
- Evaluate AND queries by intersecting postings lists (start with the shortest list for efficiency) and OR queries by merging them
- Store positional information if phrase queries are needed, or use bigram indexes for common phrases
- Shard the index by term hash to distribute load, with scatter-gather query execution across shards
2. Write Path and Near Real-Time Indexing
New posts must become searchable within seconds, creating a tension between write throughput and index freshness. Interviewers expect a pipeline that decouples post creation from indexing.
Hints to consider:
- Use Kafka to buffer post-creation events, decoupling the write API from indexing workers
- Indexing workers consume events, tokenize text, and update the inverted index in batches
- Maintain a "recent posts" buffer (in-memory or Redis) for very fresh content that hasn't been indexed yet, merging at query time
- Use append-only index segments with periodic compaction to maintain write throughput
3. Hot Key and Query Performance
Some terms appear in millions of posts, creating hot shards and expensive queries. Interviewers look for strategies to handle skewed workloads.
Hints to consider:
- Cap postings list length for extremely common terms and use approximate results with quality signals for ranking
- Cache frequently queried terms' top results in Redis with short TTLs
- Implement early termination for queries that have enough high-quality results
- Use backpressure and timeout mechanisms to protect the system from expensive wildcard or broad OR queries
4. Ranking and Pagination
Simple boolean matching returns too many results. Interviewers want to see how you rank results and handle stable pagination.
Hints to consider:
- Score results using a combination of recency (time decay), engagement metrics (likes, shares), and optionally term frequency relevance
- Pre-compute sorted segments by recency and by engagement score to serve different sort modes efficiently
- Use cursor-based pagination encoding the last seen (score, post_id) to enable deterministic page-through
- Cache query result pages with short TTLs to serve repeated paginations efficiently
Suggested Approach
Step 1: Clarify Requirements
Confirm scope with your interviewer. Ask whether you should design the inverted index from scratch or can use Elasticsearch. Clarify the expected post volume and query rate. Determine if the search supports only keywords or also phrases, wildcards, and filters (by author, date range). Establish latency targets and acceptable index freshness. Ask about result ranking requirements and whether personalization is in scope.