Practice/Microsoft/Design a Distributed Linked List System

Design a Distributed Linked List System

System DesignMust

Problem Statement

Design a distributed system for a collaborative document editor similar to Google Docs or Notion, where multiple users can simultaneously edit the same document and see each other's changes in real time. The system must handle thousands of concurrent editors on popular documents, resolve conflicting edits without losing user intent, and maintain a consistent view across all clients even during network partitions or server failures.

The core challenge lies in achieving both low-latency updates (users expect to see changes within milliseconds) and strong eventual consistency (all replicas must converge to the same state). Your design must handle operational transforms or conflict-free replicated data types (CRDTs), broadcast changes efficiently, persist document state reliably, and provide features like presence indicators, cursor positions, and revision history.

Key Requirements

Functional

Real-time collaborative editing -- multiple users can edit the same document simultaneously with changes appearing to others within 100-200ms
Conflict resolution -- concurrent edits at the same position must be resolved deterministically without losing user intent
Presence and awareness -- users can see who else is viewing or editing the document and where their cursors are located
Document persistence -- all changes must be durably stored with complete revision history for rollback and audit
Rich text support -- handle formatted text including bold, italic, lists, headings, and embedded media with proper CRDT semantics

Non-Functional

Scalability -- support 10M active documents with up to 500 concurrent editors per popular document
Reliability -- 99.95% availability with no data loss; graceful degradation during network partitions or server failures
Latency -- p99 edit propagation under 200ms; document load time under 500ms for documents up to 10MB
Consistency -- strong eventual consistency where all clients converge to identical state; causal ordering of operations

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Conflict Resolution Strategy

The heart of any collaborative editor is how it handles concurrent, conflicting edits. Interviewers want to see if you understand the difference between Operational Transformation (OT) and Conflict-free Replicated Data Types (CRDTs), and can reason about their tradeoffs.

Hints to consider:

OT requires a central server to serialize operations but offers simpler client logic; CRDTs allow peer-to-peer sync but require more complex data structures
Character-level CRDTs like Yjs or Automerge use position identifiers that grow over time and need periodic compaction
Consider how deletions interact with concurrent insertions at the same position, and how to preserve user intent
Version vectors or Lamport timestamps help establish causal ordering of operations across distributed clients

2. Broadcasting and Connection Management

With hundreds of concurrent editors, naive broadcast strategies (sending every keystroke to every user) will overwhelm the network. Interviewers expect thoughtful discussion of batching, throttling, and connection topologies.

Hints to consider:

WebSocket connections allow bidirectional streaming but create server load; consider connection pooling and load balancing strategies
Batch operations over short windows (50-100ms) to reduce message volume while maintaining responsiveness
Use presence channels or pub/sub patterns to broadcast changes only to active editors of a document
Implement exponential backoff for reconnection attempts and queue operations locally during temporary disconnections

3. Storage and Durability Model

Documents must be persisted efficiently while supporting fast reads, incremental updates, and complete history. The choice between storing full snapshots versus operation logs has profound implications for performance and storage cost.

Hints to consider:

Store periodic snapshots (e.g., every 100 operations) plus incremental operation logs to balance read performance and storage growth
Use append-only logs for operations to ensure durability and enable easy replication; operations are immutable once committed
Consider blob storage for large media embeddings referenced by document structure, not inline
Partition documents by ID for horizontal scaling; hot documents may need dedicated shards or caching layers

4. Handling Scale and Hotspots

Popular documents with hundreds of simultaneous editors create hotspots that can overwhelm single servers. Interviewers look for strategies to distribute load while maintaining consistency guarantees.

Hints to consider:

Employ a hierarchical broadcast topology where edge servers fan out changes to clients, reducing load on the origin server
Cache document state at edge locations with invalidation on writes; eventual consistency is acceptable for read-heavy viewers
Rate-limit operations per user to prevent abuse while allowing legitimate fast typing (e.g., 100 ops/second per user)
Use separate infrastructure for presence/cursors (ephemeral, high-frequency updates) versus document edits (durable, lower frequency)

Practice/Microsoft/Design a Distributed Linked List System

Design a Distributed Linked List System

System DesignMust

Problem Statement

Key Requirements

Functional

Real-time collaborative editing -- multiple users can edit the same document simultaneously with changes appearing to others within 100-200ms
Conflict resolution -- concurrent edits at the same position must be resolved deterministically without losing user intent
Presence and awareness -- users can see who else is viewing or editing the document and where their cursors are located
Document persistence -- all changes must be durably stored with complete revision history for rollback and audit
Rich text support -- handle formatted text including bold, italic, lists, headings, and embedded media with proper CRDT semantics

Non-Functional

Scalability -- support 10M active documents with up to 500 concurrent editors per popular document
Reliability -- 99.95% availability with no data loss; graceful degradation during network partitions or server failures
Latency -- p99 edit propagation under 200ms; document load time under 500ms for documents up to 10MB
Consistency -- strong eventual consistency where all clients converge to identical state; causal ordering of operations

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Conflict Resolution Strategy

Hints to consider:

OT requires a central server to serialize operations but offers simpler client logic; CRDTs allow peer-to-peer sync but require more complex data structures
Character-level CRDTs like Yjs or Automerge use position identifiers that grow over time and need periodic compaction
Consider how deletions interact with concurrent insertions at the same position, and how to preserve user intent
Version vectors or Lamport timestamps help establish causal ordering of operations across distributed clients

2. Broadcasting and Connection Management

Hints to consider:

WebSocket connections allow bidirectional streaming but create server load; consider connection pooling and load balancing strategies
Batch operations over short windows (50-100ms) to reduce message volume while maintaining responsiveness
Use presence channels or pub/sub patterns to broadcast changes only to active editors of a document
Implement exponential backoff for reconnection attempts and queue operations locally during temporary disconnections

3. Storage and Durability Model

Hints to consider:

Store periodic snapshots (e.g., every 100 operations) plus incremental operation logs to balance read performance and storage growth
Use append-only logs for operations to ensure durability and enable easy replication; operations are immutable once committed
Consider blob storage for large media embeddings referenced by document structure, not inline
Partition documents by ID for horizontal scaling; hot documents may need dedicated shards or caching layers

4. Handling Scale and Hotspots

Hints to consider:

Employ a hierarchical broadcast topology where edge servers fan out changes to clients, reducing load on the origin server
Cache document state at edge locations with invalidation on writes; eventual consistency is acceptable for read-heavy viewers
Rate-limit operations per user to prevent abuse while allowing legitimate fast typing (e.g., 100 ops/second per user)
Use separate infrastructure for presence/cursors (ephemeral, high-frequency updates) versus document edits (durable, lower frequency)