Practice/Databricks/Design Slack
Design Slack
System DesignMust
Problem Statement
Design a collaborative document editing platform similar to Google Docs or Notion that allows multiple users to edit the same document simultaneously. The system must provide real-time synchronization of changes across all active editors, conflict resolution when users modify the same content, rich text formatting support, and persistent storage of document history with version control. Users expect to see cursor positions and selections of other collaborators, receive instant updates as others type, and be able to revert to previous versions of the document.
This question evaluates your understanding of distributed state synchronization, operational transformation or conflict-free replicated data types (CRDTs), real-time bidirectional communication, and how to maintain consistency guarantees in a system where network partitions and concurrent edits are the norm rather than the exception. You'll need to balance low-latency updates with correctness, handle varying network conditions gracefully, and design for horizontal scalability as documents and concurrent editors grow.
Key Requirements
Functional
- Real-time collaborative editing -- multiple users can edit the same document simultaneously with changes appearing instantly for all participants
- Conflict resolution -- system must automatically merge concurrent edits without losing user intent or creating inconsistent states
- Rich text formatting -- support for text styling, headings, lists, links, images, and other formatting elements
- Presence awareness -- users can see who else is viewing or editing the document and track their cursor positions
- Version history -- complete edit history with ability to view past versions and restore previous states
- Access control -- documents can be private, shared with specific users, or made publicly accessible with appropriate permissions
Non-Functional
- Latency -- edits should propagate to other clients within 100-200ms under normal network conditions
- Scalability -- support documents up to 100MB in size with 50+ concurrent editors per document and millions of total documents
- Availability -- system should remain operational even if some backend services fail; clients should queue operations during temporary outages
- Consistency -- all clients must eventually converge to the same document state; conflicting edits must be deterministically resolved
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Operational Transformation vs. CRDT
The core technical challenge is maintaining consistency across concurrent edits. Interviewers want to see if you understand different approaches to conflict resolution and can articulate tradeoffs.
Hints to consider:
- Operational Transformation (OT) requires a central server to order operations but is more mature and memory-efficient for large documents
- CRDTs enable peer-to-peer synchronization and simpler reasoning about convergence but may have larger memory overhead
- Consider how character positions shift when concurrent insertions/deletions occur at different locations
- Discuss tombstones for deletions, unique identifiers for operations, and how to handle out-of-order delivery
2. Real-Time Connection Management
With potentially thousands of documents being edited simultaneously, managing WebSocket connections efficiently becomes critical.
Hints to consider:
- Not every backend server should hold connections for every document; design a routing layer that maps document IDs to connection servers
- Handle reconnection gracefully with sequence numbers or vector clocks to replay missed operations
- Consider heartbeat mechanisms to detect dead connections and clean up presence information
- Discuss when to use WebSocket pools versus creating new connections per document
3. Storage and Versioning
Documents need durable persistence while also supporting efficient retrieval of historical versions and fast reads of the current state.
Hints to consider:
- Store operations as an append-only log for complete history, but maintain a compacted current state for fast loading
- Partition version history by time ranges or operation count to bound query costs
- Consider snapshotting strategies to avoid replaying millions of operations on load
- Discuss indexing for full-text search and how to keep search indices synchronized with document changes
4. Presence and Awareness Features
Showing cursor positions, selections, and active users requires frequent low-latency broadcasts that must not overwhelm the system.
Hints to consider:
- Presence updates can be fire-and-forget; they don't require the same reliability guarantees as document edits
- Use a separate fast path or protocol (UDP-like semantics over WebSocket) for ephemeral presence data
- Throttle cursor position broadcasts on the client side to avoid flooding (e.g., send at most every 50ms)
- Store active user lists in memory with TTLs rather than durable storage
5. Handling Large Documents and Media
As documents grow with embedded images, videos, or large amounts of formatted text, naive approaches break down.
Hints to consider:
- Split large documents into blocks or pages that can be loaded lazily as users scroll
- Store media assets separately in object storage with references in the document structure
- Consider differential synchronization where only changed blocks are transmitted
- Discuss pagination strategies for version history when documents have thousands of revisions