Practice/Microsoft/Design a Distributed Linked List System
Design a Distributed Linked List System
System DesignMust
Problem Statement
Design a distributed system for a collaborative document editor similar to Google Docs or Notion, where multiple users can simultaneously edit the same document and see each other's changes in real time. The system must handle thousands of concurrent editors on popular documents, resolve conflicting edits without losing user intent, and maintain a consistent view across all clients even during network partitions or server failures.
The core challenge lies in achieving both low-latency updates (users expect to see changes within milliseconds) and strong eventual consistency (all replicas must converge to the same state). Your design must handle operational transforms or conflict-free replicated data types (CRDTs), broadcast changes efficiently, persist document state reliably, and provide features like presence indicators, cursor positions, and revision history.
Key Requirements
Functional
- Real-time collaborative editing -- multiple users can edit the same document simultaneously with changes appearing to others within 100-200ms
- Conflict resolution -- concurrent edits at the same position must be resolved deterministically without losing user intent
- Presence and awareness -- users can see who else is viewing or editing the document and where their cursors are located
- Document persistence -- all changes must be durably stored with complete revision history for rollback and audit
- Rich text support -- handle formatted text including bold, italic, lists, headings, and embedded media with proper CRDT semantics
Non-Functional
- Scalability -- support 10M active documents with up to 500 concurrent editors per popular document
- Reliability -- 99.95% availability with no data loss; graceful degradation during network partitions or server failures
- Latency -- p99 edit propagation under 200ms; document load time under 500ms for documents up to 10MB
- Consistency -- strong eventual consistency where all clients converge to identical state; causal ordering of operations
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Conflict Resolution Strategy
The heart of any collaborative editor is how it handles concurrent, conflicting edits. Interviewers want to see if you understand the difference between Operational Transformation (OT) and Conflict-free Replicated Data Types (CRDTs), and can reason about their tradeoffs.
Hints to consider:
- OT requires a central server to serialize operations but offers simpler client logic; CRDTs allow peer-to-peer sync but require more complex data structures
- Character-level CRDTs like Yjs or Automerge use position identifiers that grow over time and need periodic compaction
- Consider how deletions interact with concurrent insertions at the same position, and how to preserve user intent
- Version vectors or Lamport timestamps help establish causal ordering of operations across distributed clients
2. Broadcasting and Connection Management
With hundreds of concurrent editors, naive broadcast strategies (sending every keystroke to every user) will overwhelm the network. Interviewers expect thoughtful discussion of batching, throttling, and connection topologies.
Hints to consider:
- WebSocket connections allow bidirectional streaming but create server load; consider connection pooling and load balancing strategies
- Batch operations over short windows (50-100ms) to reduce message volume while maintaining responsiveness
- Use presence channels or pub/sub patterns to broadcast changes only to active editors of a document
- Implement exponential backoff for reconnection attempts and queue operations locally during temporary disconnections
3. Storage and Durability Model
Documents must be persisted efficiently while supporting fast reads, incremental updates, and complete history. The choice between storing full snapshots versus operation logs has profound implications for performance and storage cost.
Hints to consider:
- Store periodic snapshots (e.g., every 100 operations) plus incremental operation logs to balance read performance and storage growth
- Use append-only logs for operations to ensure durability and enable easy replication; operations are immutable once committed
- Consider blob storage for large media embeddings referenced by document structure, not inline
- Partition documents by ID for horizontal scaling; hot documents may need dedicated shards or caching layers
4. Handling Scale and Hotspots
Popular documents with hundreds of simultaneous editors create hotspots that can overwhelm single servers. Interviewers look for strategies to distribute load while maintaining consistency guarantees.
Hints to consider:
- Employ a hierarchical broadcast topology where edge servers fan out changes to clients, reducing load on the origin server
- Cache document state at edge locations with invalidation on writes; eventual consistency is acceptable for read-heavy viewers
- Rate-limit operations per user to prevent abuse while allowing legitimate fast typing (e.g., 100 ops/second per user)
- Use separate infrastructure for presence/cursors (ephemeral, high-frequency updates) versus document edits (durable, lower frequency)