Practice/Meta/Design Post Search
Design Post Search
System DesignMust
Problem Statement
Design a collaborative document editing system similar to Google Docs or Notion where multiple users can simultaneously edit the same document and see each other's changes in real time. The system must handle concurrent edits gracefully, prevent data loss, and provide a smooth editing experience even when users have varying network conditions.
The core challenge is managing conflicting edits from multiple users without requiring locks that would freeze the document. You need to ensure that all users eventually see the same document state, preserve user intent when merging changes, and keep the system responsive with sub-second latency for typical operations. The system should scale to support thousands of concurrent documents with hundreds of active editors per document.
Key Requirements
Functional
- Real-time collaborative editing -- multiple users can edit the same document simultaneously and see each other's changes within 1-2 seconds
- Conflict-free merge -- concurrent edits to different parts of the document merge automatically without data loss
- Edit history and undo -- users can view document history and undo/redo their own changes or revert to previous versions
- Presence and cursors -- users see who else is editing and where other users' cursors are positioned
- Rich text formatting -- support for bold, italic, lists, headings, and other basic formatting operations
Non-Functional
- Scalability -- support 100,000+ concurrent documents with up to 50 simultaneous editors per document
- Reliability -- zero data loss even during network partitions or server failures; automatic recovery from crashes
- Latency -- local edits appear instantly; remote changes propagate within 1-2 seconds under normal network conditions
- Consistency -- all users eventually converge to the same document state (strong eventual consistency)
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Operational Transform vs CRDT Choice
Interviewers want to see if you understand the fundamental algorithms that power collaborative editing. You need to explain how you'll handle concurrent edits when two users type at the same position or delete overlapping ranges. This is the heart of the problem.
Hints to consider:
- Operational Transformation (OT) transforms operations based on concurrent edits but requires a central server to order operations
- Conflict-free Replicated Data Types (CRDTs) like Yjs or Automerge allow peer-to-peer merging without a central coordinator
- Consider whether you need a central source of truth or can tolerate peer-to-peer synchronization
- Discuss how you'll represent positions in the document as users insert and delete text concurrently
2. Real-Time Communication Architecture
Your choice of communication protocol directly impacts latency and scalability. Interviewers will push you to justify whether you need persistent connections or can work with polling, and how you'll handle connection failures and reconnection storms.
Hints to consider:
- WebSockets provide full-duplex communication but require sticky sessions and careful connection management
- Server-Sent Events (SSE) offer simpler one-way streaming with automatic reconnection
- Long polling works as a fallback but creates more overhead at scale
- Consider using a dedicated connection server layer separate from your application servers to handle millions of concurrent connections
3. State Synchronization and Conflict Resolution
Interviewers want to see how you'll keep the document state consistent across all clients and the server. You need to handle cases where users go offline, make edits, and reconnect, or when the network delivers operations out of order.
Hints to consider:
- Use vector clocks or logical timestamps to track causality and detect concurrent operations
- Implement a merge algorithm that preserves user intent (e.g., if two users insert at position 5, both insertions should appear)
- Consider maintaining both the current document state and a log of operations for recovery and time-travel
- Design an efficient delta synchronization protocol so reconnecting clients don't need to download the entire document
4. Scaling Write Throughput and Hot Documents
A document with 50 concurrent editors generates hundreds of operations per second. Interviewers will probe how you'll prevent a single popular document from overwhelming your system and how you'll partition load across servers.
Hints to consider:
- Use a dedicated WebSocket server pool per document or range of documents to distribute connection load
- Implement operation batching and compression to reduce network overhead
- Consider caching recent operations in memory and using pub/sub (Redis, Kafka) to fan out changes
- For viral documents, implement backpressure or rate limiting to prevent cascading failures