Practice/MongoDB/Design Multi-player Figma
Design Multi-player Figma
System DesignMust
Problem Statement
Design a collaborative design platform like Figma that allows multiple users to edit documents simultaneously with real-time updates, conflict resolution, and WebSocket-based synchronization. Users open a shared canvas, manipulate vector shapes, text layers, and component instances, and see every collaborator's cursor, selection, and edits appear within milliseconds. The platform must also support version history, undo/redo, and file sharing with granular permissions.
The core technical challenge is reconciling concurrent edits from geographically distributed users without data loss or visible glitches. Two designers might drag the same rectangle at the same time, resize overlapping frames, or paste content into the same artboard simultaneously. Your system must resolve these conflicts deterministically so all participants converge on an identical document state. Beyond editing, the system stores large binary assets (images, fonts, illustrations) referenced by the design file, and must deliver these efficiently without blocking the real-time operation stream.
At MongoDB, a mid-level candidate was asked to "Design Google Docs," which shares the same real-time collaboration fundamentals. Expect interviewers to probe deeply into your concurrency model, WebSocket scaling approach, and the operational trade-offs between Operational Transformation and CRDTs.
Key Requirements
Functional
- Real-time collaborative editing -- multiple users edit the same design file simultaneously and see each other's changes within 100-200ms
- Live presence -- each participant sees real-time cursor positions, text selections, active layers, and who is currently viewing the file
- Sharing and permissions -- file owners invite collaborators with view-only or edit access, and can revoke permissions at any time
- Version history -- users browse a timeline of previous file states, compare snapshots, and restore any earlier version
Non-Functional
- Scalability -- support hundreds of thousands of concurrent editing sessions with 2-50 participants each across global regions
- Latency -- propagate edit operations to all session participants within 150ms at p95; presence updates (cursors, selections) within 50ms
- Consistency -- guarantee eventual convergence across all clients with causal ordering of dependent operations and idempotent application
- Reliability -- zero data loss on server failures; automatic reconnection with state reconciliation after network interruptions
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Concurrency Model: OT vs. CRDT
The defining technical decision in this system is how you handle concurrent edits. Interviewers want to see that you understand both Operational Transformation and Conflict-free Replicated Data Types and can reason about the trade-offs.
Hints to consider:
- Operational Transformation transforms incoming operations against concurrent ones to maintain consistency, but requires a central server to establish a canonical operation order
- CRDTs (like Yjs or Automerge) guarantee convergence through commutative operations and logical timestamps, enabling peer-to-peer and offline-first editing
- OT has lower per-operation metadata overhead but introduces server round-trip latency; CRDTs accumulate tombstones that require periodic compaction
- For a Figma-like product with structured objects (shapes, frames, text) rather than free-form text, a hybrid approach using per-object last-writer-wins with causal ordering can simplify the concurrency model significantly
2. WebSocket Scaling and Session Routing
Real-time collaboration depends on persistent WebSocket connections with low-latency message routing. Interviewers probe how you partition sessions across servers, handle reconnections, and prevent message loss.
Hints to consider:
- Route all participants of a single editing session to the same WebSocket server (or small cluster) using consistent hashing on session ID to minimize cross-server coordination
- If participants span multiple gateway servers, use Redis Pub/Sub to fan out operations between servers within the same session
- Implement heartbeat mechanisms with exponential backoff reconnection so clients detect disconnections within seconds and recover without losing buffered operations
- Plan for graceful draining during deployments: pause accepting new connections, wait for active sessions to close or migrate, then shut down
3. Operation Persistence and Version History
Every edit must be durable for undo/redo, version history, and crash recovery. Interviewers want to see how you balance the real-time hot path with durable storage.
Hints to consider:
- Persist every operation to an append-only log (Kafka or DynamoDB Streams) in the order the server applied it, giving you a complete audit trail
- Create periodic snapshots of the full document state (every N operations or every T seconds) so recovery does not require replaying the entire history
- Store snapshots in object storage (S3) and index them by (file_id, version_number) for fast retrieval during version history browsing
- Implement garbage collection that removes old operation log entries after a snapshot covers them, controlling storage growth
4. Large Asset Handling
Design files reference images, icons, and fonts that can be megabytes in size. Interviewers want to see that you keep heavy binary data off the real-time operation channel.
Hints to consider:
- Upload assets directly to object storage via presigned URLs and reference them in operations by content-addressed hash, keeping the operation payload small
- Serve assets through a CDN with long cache TTLs since content-addressed URLs are immutable
- Deduplicate assets across files and teams using content hashing so the same image uploaded by different users is stored only once
- Lazy-load assets on the client: fetch only the assets visible in the current viewport and prefetch nearby ones based on scroll position