Design a distributed whiteboard system that enables multiple users to draw, sketch, and annotate collaboratively in real time. The system must support sessions with up to 100 concurrent participants, handle thousands of drawing strokes per second, and synchronize changes across all clients with minimal perceived latency (under 100ms). Users should see cursor positions, drawing operations, and object manipulations from other participants as they happen, similar to platforms like Miro, Figma, or Microsoft Whiteboard.
The core challenge lies in maintaining consistent state across distributed clients while dealing with network partitions, out-of-order messages, and conflicting concurrent edits. The system must efficiently transmit fine-grained drawing data, resolve conflicts deterministically, and scale to support thousands of active whiteboard sessions simultaneously without degrading user experience.
Real-time stroke synchronization -- Users must see drawing strokes, shape manipulations, and cursor movements from all participants as they occur, with minimal visible lag.
Persistent whiteboard state -- All drawing data, objects, and annotations must be durably stored and recoverable, allowing users to leave and rejoin sessions without data loss.
Conflict resolution -- When multiple users edit the same region simultaneously, the system must deterministically resolve conflicts and converge to a consistent state across all clients.
Session management -- Support creation of whiteboard sessions with access control, invitation mechanisms, and the ability to handle both temporary and long-lived collaborative spaces.
Scalability -- Handle 10,000 concurrent whiteboard sessions, with up to 100 participants per session and 5,000 drawing operations per second across all active sessions.
Reliability -- Ensure 99.9% uptime with automatic recovery from server failures; no client should lose uncommitted drawing data during transient network issues.
Latency -- Achieve end-to-end propagation latency under 100ms for drawing operations from one client to all others in the same session under normal network conditions.
Consistency -- Provide eventual consistency with deterministic conflict resolution; all clients viewing the same whiteboard must converge to identical state within seconds of the last operation.
Based on real interview experiences, these are the areas interviewers probe most deeply:
The most critical design decision is how you propagate drawing operations between clients with minimal latency while maintaining consistent state. This tests your understanding of operational transformation, CRDTs, and real-time messaging protocols.
Evaluate whether to use a centralized server that broadcasts all operations versus peer-to-peer synchronization approaches and their latency implications
Consider how WebSocket connections per session scale and whether you need message multiplexing or connection pooling strategies
Discuss operational transformation (OT) versus Conflict-free Replicated Data Types (CRDTs) for resolving concurrent edits and the tradeoffs in complexity and consistency guarantees
Think about optimistic UI updates on the client versus waiting for server acknowledgment and how to handle rollback scenarios
Whiteboard operations generate massive amounts of fine-grained data. Interviewers want to see how you minimize bandwidth while maintaining fidelity and responsiveness.
Users expect their work to be saved and accessible after network disruptions or intentional session exits. This area tests your understanding of durable storage and snapshot strategies.
Design a versioned event log approach where operations are append-only versus periodic full snapshots of whiteboard state and the tradeoffs in recovery time
Consider how to handle large whiteboards with thousands of objects and whether lazy loading or spatial indexing is needed for fast session initialization
Discuss strategies for compaction and garbage collection of historical operations while maintaining the ability to replay for new joiners
Think about offline mode support where local changes are queued and merged when connectivity is restored
Supporting thousands of concurrent sessions requires careful resource allocation and session routing. Interviewers look for your ability to design for multi-tenancy and horizontal scaling.
Consider session affinity and routing strategies to ensure all participants in a session connect to the same server instance or cluster partition
Discuss sharding approaches based on session ID and how to handle hot sessions that exceed single-server capacity
Think about rate limiting and fairness mechanisms to prevent a single high-activity session from starving others
Evaluate memory management strategies for holding active session state in-memory versus hybrid approaches with tiered storage