Practice/Meta/Design Meta News Feed
Design Meta News Feed
System DesignMust
Problem Statement
Build a system that allows multiple users to simultaneously edit the same document with real-time synchronization, similar to Google Docs or Notion. When one user types, deletes, or formats text, all other active collaborators should see those changes reflected on their screens within milliseconds. The system must handle documents ranging from small notes to large technical specifications with embedded media, support concurrent edits without conflicts, maintain complete edit history, and gracefully handle network partitions and reconnections.
The core challenge lies in reconciling concurrent edits from distributed users while maintaining document integrity and ensuring eventual consistency. Your design must support hundreds of thousands of active editing sessions simultaneously, with documents accessed by 2-50 concurrent editors on average, though some popular documents may have 100+ simultaneous collaborators. The system should provide sub-200ms latency for propagating changes and seamlessly handle users going offline and reconnecting without data loss.
Key Requirements
Functional
- Concurrent editing -- Multiple users can edit the same document simultaneously, with all changes propagated to other active editors in real-time
- Conflict-free merges -- When users make overlapping edits, the system automatically resolves conflicts deterministically so all users converge to the same final state
- Presence awareness -- Users can see who else is currently viewing or editing the document, including live cursor positions and text selections
- Edit history and versioning -- The system maintains a complete audit trail of all changes with the ability to revert to any previous state
- Offline editing -- Users can continue editing when disconnected, with changes automatically synchronized and merged when connectivity is restored
Non-Functional
- Scalability -- Support 500,000 concurrent editing sessions across 10 million active documents, with individual documents handling up to 200 simultaneous editors
- Reliability -- Ensure 99.95% uptime with zero data loss; system should gracefully degrade when components fail rather than blocking all edits
- Latency -- Propagate changes to all active collaborators within 150ms at p95, with local edits reflecting instantly on the user's own screen
- Consistency -- Guarantee strong eventual consistency where all users converge to identical document states, even after complex concurrent edit sequences
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Operational Transformation or CRDT Strategy
Interviewers want to see that you understand the fundamental challenge of merging concurrent edits. Simply broadcasting changes creates conflicts when users edit the same location simultaneously. You need a mathematically sound approach to conflict resolution.
Hints to consider:
- Operational Transformation (OT) transforms operations based on concurrent changes but requires a central server to establish operation ordering
- Conflict-free Replicated Data Types (CRDTs) like Yjs or Automerge allow peer-to-peer merging with commutative operations that can be applied in any order
- Consider trade-offs: OT typically offers smaller payload sizes and simpler server logic, while CRDTs enable offline-first architectures and reduce server coordination
- Character-wise CRDTs can grow large over time; discuss tombstone garbage collection and periodic document snapshotting
2. WebSocket Connection Management at Scale
Real-time collaboration requires persistent bidirectional connections, creating unique scalability challenges compared to stateless HTTP APIs. You must design for connection lifecycle, load balancing, and graceful failover.
Hints to consider:
- Maintain WebSocket connections through a stateful gateway tier that routes messages between clients and document servers based on document ID
- Use consistent hashing to route all editors of the same document to the same backend server, enabling in-memory state and reducing coordination
- Implement heartbeat mechanisms to detect dead connections and clean up presence information promptly
- Design reconnection logic with exponential backoff and session tokens to resume editing without full document reload
3. Handling Hot Documents and Celebrity Usage
When a widely-shared document has 100+ simultaneous editors, the broadcast fan-out becomes a bottleneck. A naive design will overwhelm a single server or create network congestion.
Hints to consider:
- Shard document servers by document ID, but detect when a single document exceeds capacity and migrate to a dedicated high-capacity instance
- Implement operation batching where the server collects multiple rapid changes and broadcasts them as a single message every 50-100ms
- Consider a hierarchical broadcast tree where edge servers handle subsets of clients and relay aggregated changes to reduce central server load
- Rate-limit cursor position updates separately from content changes since presence data has lower consistency requirements
4. Persistence and Snapshot Strategy
Every keystroke could theoretically be persisted, but that creates enormous write volume. You need an efficient strategy to durably store changes while enabling fast document loads and history reconstruction.
Hints to consider:
- Store a base snapshot of the document plus a log of operations; periodically compact old operations into new base snapshots to bound log size
- Batch write operations to durable storage every few seconds rather than on every keystroke, holding recent changes in memory with write-ahead logging
- Use append-only storage for operation logs to maximize write throughput; leverage Kafka or a similar event log for durability and replication
- For document loading, serve the most recent snapshot plus any operations since that snapshot, allowing clients to reconstruct current state efficiently
Suggested Approach
Step 1: Clarify Requirements
Begin by confirming the scope and constraints with your interviewer. Ask about expected document sizes (small text files vs. large documents with images), how many concurrent editors per document on average and at peak, whether the system needs to support rich formatting and media beyond plain text, and what consistency guarantees are required (can users temporarily see different states?). Clarify whether offline editing is in scope, as this significantly impacts architecture choices. Also confirm latency targets -- sub-second, sub-200ms, or sub-100ms will drive different design decisions. Finally, ask whether you should design for a single data center deployment or global distribution across regions.