Practice/Microsoft/Design a URL Shortener
Design a URL Shortener
System DesignMust
Problem Statement
Design a distributed system that enables multiple users to simultaneously edit the same document in real time, similar to Google Docs or Notion. When one user types, all other viewers should see those changes appear on their screens within milliseconds, without conflicts or data loss.
The system must handle hundreds of concurrent editors on a single popular document while maintaining consistency across all clients. You need to address operational transformation or conflict-free data structures, WebSocket connection management at scale, presence indicators showing who's editing where, revision history for undo/redo, and graceful handling of network partitions. Interviewers use this question to evaluate your understanding of real-time data synchronization, distributed consensus, event-driven architectures, and the trade-offs between strong consistency and user experience.
Key Requirements
Functional
- Real-time collaborative editing -- multiple users can edit the same document simultaneously with changes propagating to all connected clients within 200ms
- Conflict resolution -- concurrent edits to the same section must merge deterministically without losing data or requiring manual conflict resolution
- Presence awareness -- users can see active collaborators, their cursor positions, and which sections they're currently editing
- Revision history -- every change is versioned, allowing users to view past document states and revert to previous versions
- Offline support -- users can continue editing while disconnected, with changes synchronized when connectivity is restored
Non-Functional
- Scalability -- support 100+ simultaneous editors on a single document, with 10 million total active documents across the platform
- Reliability -- no data loss even during server failures or network partitions; 99.95% uptime SLA
- Latency -- sub-200ms end-to-end propagation of edits from one client to all others under normal network conditions
- Consistency -- eventual consistency is acceptable, but all clients must converge to the same document state; causality must be preserved
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Conflict-Free Merge Strategy
The core technical challenge is ensuring that when Alice types "hello" at position 5 while Bob simultaneously deletes characters 3-7, both operations can be applied in any order and all clients converge to the same final state. Naive approaches like last-write-wins lose data.
Hints to consider:
- Operational Transformation (OT) algorithms that transform operations based on concurrent changes to maintain intent
- Conflict-Free Replicated Data Types (CRDTs) like YJS or Automerge that guarantee convergence through commutative operations
- Character-level vs block-level granularity trade-offs between bandwidth and merge complexity
- Tombstone markers for deletions to preserve positional integrity across concurrent operations
2. WebSocket Connection Management at Scale
With millions of documents and hundreds of potential editors per document, managing persistent bidirectional connections is non-trivial. A single server can't handle all connections, and clients must reconnect gracefully when servers fail.
Hints to consider:
- Connection servers organized by document ID sharding to route all editors of the same document to the same server cluster
- Sticky sessions via consistent hashing to maintain affinity between clients and connection servers
- Heartbeat mechanisms and exponential backoff for reconnection when detecting stale connections
- Separating the connection layer (stateful WebSocket servers) from business logic (stateless API servers)
3. Data Model for Operations and Snapshots
Storing every single keystroke as an individual operation creates massive write volume and makes document reconstruction slow. You need a hybrid approach balancing granular history with efficient reads.
Hints to consider:
- Periodic snapshot compaction where every N operations (e.g., every 100 edits) the full document state is saved
- Operation log as an append-only stream with eventual archival to cold storage after snapshot boundaries
- Efficient delta encoding for operations to minimize storage (e.g., store operation type, position offset, and changed content only)
- Separate hot storage (recent operations in memory or fast SSD) from cold storage (historical revisions in object storage)
4. Handling Network Partitions and Offline Edits
When a user loses connectivity or a region becomes partitioned, they must continue working seamlessly. Upon reconnection, their changes need to merge with updates from other users without conflicts.
Hints to consider:
- Client-side operation queue that buffers changes locally and flushes on reconnection
- Vector clocks or hybrid logical clocks to track causality and detect concurrent vs sequential edits
- Server-side conflict resolution service that applies OT or CRDT merge logic when clients reconnect with divergent states
- Rate limiting and validation on reconnection to prevent abuse from malicious clients submitting fabricated operation histories
5. Scalability of the Broadcasting Fanout
When one user types, that edit must be pushed to potentially hundreds of connected clients. Broadcasting to all subscribers synchronously would create latency spikes and single points of failure.
Hints to consider:
- Pub/sub messaging systems like Redis Pub/Sub or Kafka for decoupling operation ingestion from fanout delivery
- Hierarchical fanout trees where a single operation is sent to regional edge nodes, which then broadcast to their local clients
- Selective subscription where clients register interest in specific document regions to reduce unnecessary traffic for large documents
- Back-pressure handling when slow clients lag behind, either buffering with bounded memory or dropping old operations in favor of periodic snapshots