Practice/Apple/Design a Meta post storage and query system
Design a Meta post storage and query system
System DesignOptional
Problem Statement
Design a system that stores social media posts with associated tags and keywords, and lets users search for posts using Boolean operations on those tags. For example, a user might search for posts tagged with both #sports AND #basketball, or posts tagged with either #cooking OR #baking. Think of Instagram hashtag search or Facebook keyword filtering, but with support for combining tags.
The challenge is building and scaling an inverted index that maps tags to posts, executing set operations (union, intersection) efficiently over potentially millions of results, and keeping the index consistent with the primary data store as posts are created, updated, and deleted.
Key Requirements
Functional
- Post creation with tags -- users create posts with text/media and attach one or more tags or keywords
- Boolean tag queries -- users search posts by combining tags with AND (intersection) and OR (union) operations
- Pagination -- results are paginated with stable ordering (cursor-based) sorted by recency
- Tag updates -- users can add/remove tags from existing posts; changes are reflected in search results promptly
Non-Functional
- Scalability -- support billions of posts with millions of unique tags and thousands of queries per second
- Latency -- tag queries return in under 500ms at P95, even for popular tags with millions of matching posts
- Consistency -- new posts appear in tag search results within seconds (eventual consistency is acceptable)
- Availability -- search remains available even if the indexing pipeline falls behind
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Inverted Index Design
The inverted index maps each tag to a sorted list of post IDs (a posting list). Efficient Boolean operations on these lists are the foundation of the system.
Hints to consider:
- For each tag, maintain a posting list sorted by timestamp (descending) for recency-based results
- AND (intersection): merge two sorted posting lists keeping only IDs present in both — use two pointers for O(n+m)
- OR (union): merge two sorted lists keeping all IDs — also two pointers
- Compress posting lists using delta encoding and variable-byte encoding to save storage and improve cache performance
2. Hot Tag Problem
Some tags are extremely popular (#love has billions of posts). Queries involving hot tags create massive read and compute loads.
Hints to consider:
- Shard posting lists by time range: instead of one giant list for
#love, have one per day/week
- For queries, only scan recent time ranges first and stop once enough results fill the page
- Cache the first page of results for popular tag combinations with short TTLs
- Consider pre-computing results for the most common tag queries (top 1000 combinations)
3. Keeping the Index Consistent
When posts are created, deleted, or re-tagged, the inverted index must be updated. Doing this synchronously on the write path adds latency and couples systems.
Hints to consider:
- Use an async indexing pipeline: write the post to the primary store, publish an event to Kafka, and a consumer updates the index
- Make index updates idempotent using post version numbers — if the index already has a newer version, skip the update
- Handle deletions with tombstones: mark a post as deleted in the index rather than removing it immediately, clean up in background
- Discuss how to backfill or rebuild the index if it gets corrupted or a new tag field is added