Practice/Meta/Design Instagram
Design Instagram
System DesignMust
Problem Statement
Design a real-time collaborative whiteboard application where multiple users can simultaneously draw, add shapes, and insert text on a shared canvas. The system must support dozens of concurrent users on a single board, streaming changes with minimal latency so collaborators see each other's edits within 200 ms. Users expect smooth interactions even when hundreds of drawing operations occur per second, and they should be able to view historical versions of the board.
The primary challenge is balancing real-time propagation of granular drawing events with the need to persist a consistent, recoverable state. Unlike traditional documents with discrete edits, a whiteboard generates a continuous stream of coordinate data, requiring you to optimize for low-latency broadcasting while managing storage efficiently. Interviewers want to see how you handle operational transformation or conflict-free replicated data types (CRDTs), choose between WebSockets and polling, and design a snapshotting strategy that supports history without overwhelming storage.
Key Requirements
Functional
- Real-time drawing -- Users can draw freehand paths, add shapes (rectangles, circles), and insert text blocks that appear instantly for all collaborators on the same board
- Presence indicators -- Display live cursors showing each user's current position and activity status
- Persistent history -- Save board state so users can reload a session days later and optionally view past snapshots
- Permission controls -- Board owners can grant view-only or edit access to specific users
Non-Functional
- Scalability -- Support 50 concurrent editors per board and 10,000 active boards at peak
- Reliability -- Ensure no drawing strokes are lost even if a user's connection drops temporarily
- Latency -- Propagate drawing operations to all participants within 200 ms under normal network conditions
- Consistency -- Guarantee eventual consistency so all users converge to the same board state despite concurrent edits
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Real-Time Communication Protocol
Interviewers want to understand how you'll stream drawing operations between clients with sub-second latency and how you'll handle network interruptions.
Hints to consider:
- Evaluate WebSockets for bidirectional streaming versus HTTP long-polling or server-sent events
- Discuss message batching to reduce overhead when users generate hundreds of mouse-move events per second
- Explain how clients buffer operations locally and reconcile after reconnecting
- Consider sticky sessions or pub/sub to route messages efficiently when boards span multiple servers
2. Conflict Resolution and Operational Transform
Concurrent edits can create conflicts when two users modify the same region simultaneously. Interviewers probe how you'll merge changes deterministically.
Hints to consider:
- Introduce CRDTs or operational transformation (OT) to ensure commutative operations that converge without a central lock
- Assign vector clocks or Lamport timestamps to establish causal ordering of events
- Discuss tradeoffs between OT's complexity and CRDT's larger message payloads
- Show how you'll handle edge cases like overlapping shapes or text insertion at the same coordinate
3. State Persistence and Snapshotting
A board session may last hours and accumulate millions of small operations. Interviewers look for a practical storage strategy that supports both live editing and historical playback.
Hints to consider:
- Use an event-sourcing pattern where you log every drawing operation, then periodically compact them into snapshots
- Store snapshots in object storage (S3) with metadata pointing to the last operation ID included
- Replay events on top of the most recent snapshot to reconstruct current state on page load
- Define a retention policy (e.g., keep hourly snapshots for 30 days) to balance cost and auditability
4. Handling High-Frequency Input
Drawing generates far more events than typing. Interviewers want to see how you'll throttle, batch, and optimize for this traffic pattern.
Hints to consider:
- Implement client-side throttling (e.g., emit position updates every 16 ms for 60 Hz) to cap message rate
- Differentiate between transient events (cursor moves) and durable ones (stroke completion) to reduce database writes
- Use binary encoding (Protocol Buffers or MessagePack) instead of JSON to shrink payload sizes
- Discuss how you'll prioritize critical updates (shape placement) over non-critical ones (ephemeral cursor trails)
Suggested Approach
Step 1: Clarify Requirements
Confirm the scope of drawing primitives (freehand only, or also shapes and text?). Ask about expected board size (infinite canvas or fixed dimensions), whether undo/redo is required, and if users can simultaneously edit the same object or if optimistic locking is acceptable. Verify the latency target (200 ms end-to-end?) and whether mobile clients must be supported. Clarify retention needs for historical versions (full history forever, or time-boxed snapshots?). Establish expected concurrency per board (10 users, 50, or more?) and total number of active boards at peak. These answers will guide protocol choices, snapshot frequency, and whether you need a CRDT or simpler last-write-wins semantics.