Design a web-based collaborative document editor similar to Google Docs that allows multiple users to simultaneously edit the same document in real-time. Users should see each other's changes as they type, with cursor positions and text selections visible. The system needs to handle conflicts gracefully when multiple users edit the same section of text, maintain document history for version control, and support rich text formatting including bold, italic, lists, and headings.
The application should scale to support documents with thousands of concurrent editors and millions of total users. Focus on the core editing experience, conflict resolution mechanisms, and the infrastructure needed to synchronize changes across clients with minimal latency.
Based on real interview experiences, these are the areas interviewers probe most deeply:
The heart of collaborative editing is managing concurrent edits. Interviewers want to see if you understand the fundamental approaches to conflict-free synchronization.
Hints to consider:
Real-time collaboration requires persistent connections. Interviewers expect discussion of how to handle many concurrent connections efficiently.
Hints to consider:
How you represent the document internally affects performance, conflict resolution, and feature support.
Hints to consider:
When users lose connectivity, the system must handle divergent edits gracefully.
Hints to consider:
Start by confirming the scope and priorities with your interviewer:
Sketch the major components:
Client Application: Web browser running a rich text editor component that captures keystrokes, applies local changes optimistically, and sends operations to the server while receiving and applying remote operations.
WebSocket Gateway: Stateful servers maintaining persistent connections to clients, routing messages between clients editing the same document, and handling connection lifecycle.
Collaboration Service: Manages active editing sessions, applies operational transformation or CRDT logic, broadcasts changes to all connected clients, and coordinates with storage.
Document Storage: Persists document snapshots and operation logs, typically using a database that supports fast writes and range queries for loading edit history.
Presence Service: Tracks which users are currently viewing or editing each document, manages cursor position updates, and handles session timeouts.
Walk through exactly how concurrent edits are handled:
When User A types "hello" at position 10 while User B types "world" at position 15, both clients immediately update their local view (optimistic updates). Each client assigns a logical timestamp or version number to their operation and sends it to the server.
The server receives both operations, potentially out of order due to network timing. Using Operational Transformation, the server transforms User B's operation to account for User A's insertion, adjusting the position from 15 to 20 (since 5 characters were inserted before it). The transformed operations are then broadcast to all clients.
Each client receives remote operations, transforms them against any concurrent local operations not yet acknowledged by the server, and applies them to the document. This ensures all clients converge to the same state: "hello" at position 10 and "world" at position 20.
For position tracking, use a combination of absolute positions for new operations and transformation functions that adjust positions based on preceding insertions/deletions. Discuss the complexity of handling delete operations that remove ranges being concurrently edited.
Discuss how the system handles growth and failures:
Horizontal Scaling: Use consistent hashing to assign documents to WebSocket servers. Clients connect to the server hosting their document. When a server becomes overloaded, migrate some documents to other servers (requires reconnecting affected clients).
Data Persistence: Write operations to a log-structured store (like Kafka or a database transaction log) before broadcasting to clients. This ensures operations aren't lost if a server crashes mid-broadcast. Periodically snapshot the document state to avoid replaying thousands of operations on startup.
Network Failures: Implement exponential backoff for reconnections. When reconnecting, clients send their last acknowledged operation ID so the server can send any missed operations. If the gap is too large, fetch a fresh document snapshot instead of replaying operations.
Monitoring: Track key metrics like operation latency (time from client send to receiving acknowledgment), conflict rate (how often transformations are needed), and WebSocket connection churn (frequent disconnects may indicate network issues).