Practice/Meta/Design Meta News Feed

Design Meta News Feed

System DesignMust

Problem Statement

Build a system that allows multiple users to simultaneously edit the same document with real-time synchronization, similar to Google Docs or Notion. When one user types, deletes, or formats text, all other active collaborators should see those changes reflected on their screens within milliseconds. The system must handle documents ranging from small notes to large technical specifications with embedded media, support concurrent edits without conflicts, maintain complete edit history, and gracefully handle network partitions and reconnections.

The core challenge lies in reconciling concurrent edits from distributed users while maintaining document integrity and ensuring eventual consistency. Your design must support hundreds of thousands of active editing sessions simultaneously, with documents accessed by 2-50 concurrent editors on average, though some popular documents may have 100+ simultaneous collaborators. The system should provide sub-200ms latency for propagating changes and seamlessly handle users going offline and reconnecting without data loss.

Key Requirements

Functional

Concurrent editing -- Multiple users can edit the same document simultaneously, with all changes propagated to other active editors in real-time
Conflict-free merges -- When users make overlapping edits, the system automatically resolves conflicts deterministically so all users converge to the same final state
Presence awareness -- Users can see who else is currently viewing or editing the document, including live cursor positions and text selections
Edit history and versioning -- The system maintains a complete audit trail of all changes with the ability to revert to any previous state
Offline editing -- Users can continue editing when disconnected, with changes automatically synchronized and merged when connectivity is restored

Non-Functional

Scalability -- Support 500,000 concurrent editing sessions across 10 million active documents, with individual documents handling up to 200 simultaneous editors
Reliability -- Ensure 99.95% uptime with zero data loss; system should gracefully degrade when components fail rather than blocking all edits
Latency -- Propagate changes to all active collaborators within 150ms at p95, with local edits reflecting instantly on the user's own screen
Consistency -- Guarantee strong eventual consistency where all users converge to identical document states, even after complex concurrent edit sequences

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Operational Transformation or CRDT Strategy

Interviewers want to see that you understand the fundamental challenge of merging concurrent edits. Simply broadcasting changes creates conflicts when users edit the same location simultaneously. You need a mathematically sound approach to conflict resolution.

Hints to consider:

Operational Transformation (OT) transforms operations based on concurrent changes but requires a central server to establish operation ordering
Conflict-free Replicated Data Types (CRDTs) like Yjs or Automerge allow peer-to-peer merging with commutative operations that can be applied in any order
Consider trade-offs: OT typically offers smaller payload sizes and simpler server logic, while CRDTs enable offline-first architectures and reduce server coordination
Character-wise CRDTs can grow large over time; discuss tombstone garbage collection and periodic document snapshotting

2. WebSocket Connection Management at Scale

Real-time collaboration requires persistent bidirectional connections, creating unique scalability challenges compared to stateless HTTP APIs. You must design for connection lifecycle, load balancing, and graceful failover.

Hints to consider:

Maintain WebSocket connections through a stateful gateway tier that routes messages between clients and document servers based on document ID
Use consistent hashing to route all editors of the same document to the same backend server, enabling in-memory state and reducing coordination
Implement heartbeat mechanisms to detect dead connections and clean up presence information promptly
Design reconnection logic with exponential backoff and session tokens to resume editing without full document reload

3. Handling Hot Documents and Celebrity Usage

When a widely-shared document has 100+ simultaneous editors, the broadcast fan-out becomes a bottleneck. A naive design will overwhelm a single server or create network congestion.

Hints to consider:

Shard document servers by document ID, but detect when a single document exceeds capacity and migrate to a dedicated high-capacity instance
Implement operation batching where the server collects multiple rapid changes and broadcasts them as a single message every 50-100ms
Consider a hierarchical broadcast tree where edge servers handle subsets of clients and relay aggregated changes to reduce central server load
Rate-limit cursor position updates separately from content changes since presence data has lower consistency requirements

4. Persistence and Snapshot Strategy

Every keystroke could theoretically be persisted, but that creates enormous write volume. You need an efficient strategy to durably store changes while enabling fast document loads and history reconstruction.

Hints to consider:

Store a base snapshot of the document plus a log of operations; periodically compact old operations into new base snapshots to bound log size
Batch write operations to durable storage every few seconds rather than on every keystroke, holding recent changes in memory with write-ahead logging
Use append-only storage for operation logs to maximize write throughput; leverage Kafka or a similar event log for durability and replication
For document loading, serve the most recent snapshot plus any operations since that snapshot, allowing clients to reconstruct current state efficiently

Suggested Approach

Step 1: Clarify Requirements

Begin by confirming the scope and constraints with your interviewer. Ask about expected document sizes (small text files vs. large documents with images), how many concurrent editors per document on average and at peak, whether the system needs to support rich formatting and media beyond plain text, and what consistency guarantees are required (can users temporarily see different states?). Clarify whether offline editing is in scope, as this significantly impacts architecture choices. Also confirm latency targets -- sub-second, sub-200ms, or sub-100ms will drive different design decisions. Finally, ask whether you should design for a single data center deployment or global distribution across regions.

Practice/Meta/Design Meta News Feed

Design Meta News Feed

System DesignMust

Problem Statement

Key Requirements

Functional

Concurrent editing -- Multiple users can edit the same document simultaneously, with all changes propagated to other active editors in real-time
Conflict-free merges -- When users make overlapping edits, the system automatically resolves conflicts deterministically so all users converge to the same final state
Presence awareness -- Users can see who else is currently viewing or editing the document, including live cursor positions and text selections
Edit history and versioning -- The system maintains a complete audit trail of all changes with the ability to revert to any previous state
Offline editing -- Users can continue editing when disconnected, with changes automatically synchronized and merged when connectivity is restored

Non-Functional

Scalability -- Support 500,000 concurrent editing sessions across 10 million active documents, with individual documents handling up to 200 simultaneous editors
Reliability -- Ensure 99.95% uptime with zero data loss; system should gracefully degrade when components fail rather than blocking all edits
Latency -- Propagate changes to all active collaborators within 150ms at p95, with local edits reflecting instantly on the user's own screen
Consistency -- Guarantee strong eventual consistency where all users converge to identical document states, even after complex concurrent edit sequences

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Operational Transformation or CRDT Strategy

Hints to consider:

Operational Transformation (OT) transforms operations based on concurrent changes but requires a central server to establish operation ordering
Conflict-free Replicated Data Types (CRDTs) like Yjs or Automerge allow peer-to-peer merging with commutative operations that can be applied in any order
Consider trade-offs: OT typically offers smaller payload sizes and simpler server logic, while CRDTs enable offline-first architectures and reduce server coordination
Character-wise CRDTs can grow large over time; discuss tombstone garbage collection and periodic document snapshotting

2. WebSocket Connection Management at Scale

Hints to consider:

Maintain WebSocket connections through a stateful gateway tier that routes messages between clients and document servers based on document ID
Use consistent hashing to route all editors of the same document to the same backend server, enabling in-memory state and reducing coordination
Implement heartbeat mechanisms to detect dead connections and clean up presence information promptly
Design reconnection logic with exponential backoff and session tokens to resume editing without full document reload

3. Handling Hot Documents and Celebrity Usage

When a widely-shared document has 100+ simultaneous editors, the broadcast fan-out becomes a bottleneck. A naive design will overwhelm a single server or create network congestion.

Hints to consider:

Shard document servers by document ID, but detect when a single document exceeds capacity and migrate to a dedicated high-capacity instance
Implement operation batching where the server collects multiple rapid changes and broadcasts them as a single message every 50-100ms
Consider a hierarchical broadcast tree where edge servers handle subsets of clients and relay aggregated changes to reduce central server load
Rate-limit cursor position updates separately from content changes since presence data has lower consistency requirements

4. Persistence and Snapshot Strategy

Hints to consider:

Store a base snapshot of the document plus a log of operations; periodically compact old operations into new base snapshots to bound log size
Batch write operations to durable storage every few seconds rather than on every keystroke, holding recent changes in memory with write-ahead logging
Use append-only storage for operation logs to maximize write throughput; leverage Kafka or a similar event log for durability and replication
For document loading, serve the most recent snapshot plus any operations since that snapshot, allowing clients to reconstruct current state efficiently