Design a collaborative document editing service similar to Google Docs where multiple users can create, edit, and share documents in real-time with concurrent editing capabilities. Users see each other's cursors, changes merge automatically, and access is controlled through sharing and permissions.
The core challenge lies in real-time updates, high-contention writes, conflict resolution using Operational Transformation or CRDTs, stateful WebSocket connection management, storage and versioning, and search with permissions. You need to make pragmatic trade-offs under latency, consistency, and scalability constraints, especially for "hot" documents with many concurrent editors. Expect to clarify scope, define core SLAs for latency, availability, and durability, then drill into collaboration algorithms, transport, sharding, and indexing while keeping security and cost in check.
Based on real interview experiences, these are the areas interviewers probe most deeply:
This is the heart of any collaborative editor. Interviewers want to see that you can explain how concurrent edits are merged deterministically without losing intent or creating divergent states. Simply naming OT or CRDT is not sufficient -- you need to walk through how operations are represented, transformed, and applied.
Hints to consider:
Collaborative editing requires bidirectional, low-latency communication channels. Interviewers expect you to design a scalable WebSocket infrastructure that handles connection management, message routing, and failover gracefully.
Hints to consider:
Documents have unusual access patterns: frequent small updates during editing, occasional full reads on open, and historical queries for version browsing. Your storage layer must optimize for all three.
Hints to consider:
Code and document repositories often have complex sharing needs. Interviewers probe how you enforce permissions during search and real-time collaboration.
Hints to consider:
Start by confirming scope and priorities. Ask about the types of content the editor must support (plain text, rich formatting, embedded objects) and expected document sizes. Clarify maximum concurrent collaborators per document and geographic distribution of users. Determine whether offline editing is a hard requirement or a future enhancement. Confirm how far back version history must extend and whether the system needs full-text search across documents.
Sketch the core components: an API Gateway for authentication, authorization, and WebSocket upgrade handling; a Collaboration Service cluster where each instance manages active editing sessions for a subset of documents; a Document Store (PostgreSQL) for persisted document metadata, snapshots, and permissions; an Operation Log Store for the append-only sequence of edits per document; a Presence Service that tracks online users and cursor positions; and a Search Index (Elasticsearch) for full-text document search with ACL filtering.
Show the data flow for a user edit: the client sends an operation via WebSocket to the collaboration server, which validates and transforms the operation against any concurrent operations, assigns a sequence number, broadcasts the transformed operation to all connected clients, appends to the operation log, and asynchronously updates the document snapshot in storage.
Walk through the OT algorithm with a concrete example. User A inserts "hello" at position 10 while User B simultaneously deletes 3 characters starting at position 5. Both operations arrive at the server. The server processes them in arrival order: applies User A's insert, then transforms User B's delete against it. Since User A's insert shifted text after position 10, and User B's delete is at position 5 (before the insert), User B's operation does not need adjustment. If the positions overlapped, the transform function would adjust offsets to preserve both users' intent.
The collaboration server assigns monotonically increasing sequence numbers. Each client tracks the last sequence number it has seen. When a client sends an operation, it includes the sequence number of the last server operation it acknowledged. The server transforms the incoming operation against all operations applied since that sequence number. Cursor positions are treated as lightweight operations transformed alongside document edits.
Cover WebSocket failures and reconnections: clients maintain a local operation buffer and track sequence numbers. On reconnection, the client sends its last known sequence number, and the server replays any missed operations. If the gap is too large, the server sends a full snapshot instead.
Discuss the snapshot strategy: every 100 operations, a background job reads the operation log, applies all operations to the last snapshot, and writes a new snapshot. This bounds document load time to replaying at most 100 operations. For version history, users browse snapshots and the operations between them.
Address monitoring: track WebSocket connection counts per server, operation latency distributions, conflict rates, and storage growth. For security, validate permissions at the WebSocket gateway on every connection and re-check periodically during long sessions.
Deepen your understanding of the patterns used in this problem: