Design Google Docs — Fountain

Reference Answer

For a full example answer with detailed architecture diagrams and deep dives, see our Design Slack guide. While the Slack guide focuses on messaging, many of the same patterns around WebSocket infrastructure, real-time fanout, and presence tracking apply directly to collaborative document editing.

Also review the Caching, Message Queues, and Databases building blocks for background on in-memory state management, event-driven change propagation, and durable storage for operation logs.

Problem Statement

Design a collaborative document editing service similar to Google Docs where multiple users can create, edit, and share documents in real-time with concurrent editing capabilities. Users should see each other's cursors, changes should merge automatically, and access should be controlled through sharing and permissions.

The core challenge is handling real-time updates with high-contention writes and conflict resolution. When multiple users edit the same paragraph simultaneously, the system must guarantee that all clients converge to the same document state without losing any edits. You need to reason about collaboration algorithms (operational transformation or CRDTs), stateful WebSocket connection management, storage and versioning of document content, and search with permission enforcement. Expect to design for sub-200ms propagation of edits between collaborators while maintaining durability and consistency for the document history.

Key Requirements

Functional

Collaborative editing -- multiple users can edit the same document simultaneously with real-time synchronization of changes, cursor positions, and text selections
Sharing and permissions -- users can share documents with view, comment, or edit access, and the system enforces these permissions consistently across all entry points
Version history -- the system maintains a complete history of changes, allowing users to browse past versions and revert to any previous state
Full-text search -- users can search across their documents by title and content, with results filtered to only include documents they have permission to access

Non-Functional

Scalability -- support millions of active documents with individual files having up to 50 simultaneous editors
Reliability -- no data loss even during server failures or network partitions, with 99.9% uptime for editing services
Latency -- edits propagate to all collaborators within 200ms under normal network conditions
Consistency -- eventual consistency for collaborative edits with guaranteed convergence; strong consistency for permissions and sharing metadata

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Conflict Resolution Strategy

Interviewers expect you to choose a concrete algorithm for merging concurrent edits rather than hand-waving "we'll handle conflicts." Simply mentioning operational transformation or CRDTs is not enough -- you need to explain how operations are represented, transformed, and applied.

Hints to consider:

Represent edits as operations (insert at position N, delete M characters at position P) rather than full document snapshots to minimize bandwidth
Operational transformation requires a central server to assign sequence numbers and transform operations against each other; CRDTs allow peer-to-peer convergence but use more memory for tombstones and metadata
Walk through a concrete example: if User A inserts "hello" at position 5 while User B deletes 3 characters starting at position 2, show how position indexes are adjusted after transformation
Cursor positions and selections must be transformed alongside document operations so collaborators see accurate positions

2. Real-Time Communication Architecture

Collaborative editing demands persistent, bidirectional, low-latency connections between clients and servers. Interviewers want to see how you scale WebSocket infrastructure and handle connection lifecycle events.

Hints to consider:

Route all editors of the same document to the same collaboration server instance to avoid distributed consensus on operation ordering
Use Redis pub/sub or a similar mechanism to broadcast operations if editors span multiple server instances
Implement heartbeat-based liveness detection with buffering of operations during brief disconnections, falling back to full resync for longer outages
Separate presence data (cursor positions, who is online) from document operations to reduce message volume on the critical editing path

3. Storage, Versioning, and Search

Documents need durable storage with efficient versioning and permission-aware search. Interviewers probe whether you understand the tradeoffs between storing full snapshots versus operation logs.

Hints to consider:

Store periodic snapshots of the full document plus an append-only log of operations between snapshots; this enables fast loading without replaying the entire history
Partition operation logs by document ID and compact or archive old operations on a schedule
Index document content in Elasticsearch with ACL metadata so search queries can filter by permission without leaking private content
Propagate permission changes to the search index promptly to prevent stale access after a document is unshared

4. Permission Enforcement in Real Time

Sharing and permission changes can happen while a document is being actively edited. Interviewers want to see how you handle revocation during live sessions.

Hints to consider:

Enforce permissions at the WebSocket subscription layer: when a user's access is revoked, close their editing connection and stop broadcasting operations to them
Cache ACLs in the collaboration server's memory for fast checks, but subscribe to a change feed so revocations take effect within seconds
Search results, file listings, and link previews must all respect permissions consistently to avoid information leaks
Model permissions as owner, editor, commenter, and viewer roles with inheritance for shared folders

Suggested Approach

Step 1: Clarify Requirements

Confirm the scope and constraints. Ask about the expected number of concurrent editors per document and whether the platform needs to support rich formatting or just plain text. Clarify whether offline editing and conflict resolution on reconnection are in scope. Verify latency expectations for edit propagation and whether version history must support fine-grained or only snapshot-level browsing. Establish the consistency model: can collaborators briefly see different states, or must every edit be globally ordered?

Step 2: High-Level Architecture

Sketch the core components: an API Gateway for authentication and REST operations (create, share, search), a fleet of Collaboration Servers that maintain WebSocket connections and run the OT/CRDT engine, a Document Store (PostgreSQL) for metadata, permissions, and snapshots, an Operation Log Store (append-only) for the sequence of edits per document, a Search Index (Elasticsearch) for permission-aware full-text search, and a Presence Service backed by Redis for tracking online users and cursor positions. Show how clients connect via WebSocket to a collaboration server assigned to their document, which broadcasts transformed operations to all other connected editors while durably appending to the operation log.

Step 3: Deep Dive on Collaborative Editing

Walk through the end-to-end flow of two users editing the same paragraph. User A types "hello" at position 10 while User B deletes 3 characters at position 5. User A's client sends an insert operation referencing its local document version. The collaboration server receives both operations, assigns global sequence numbers, and transforms User B's delete against User A's insert (adjusting position indexes). It then broadcasts the transformed operations to both clients. Each client applies the incoming operation, producing the same final document state. Explain that the server is the authority for ordering, which simplifies correctness at the cost of making the server a single point of coordination per document. Discuss how you mitigate this by assigning one collaboration server per document (or small set of documents) and failing over to a standby if the primary crashes.

Step 4: Address Secondary Concerns

Cover storage by explaining the snapshot-plus-delta strategy: every N operations (or every few minutes), persist a full document snapshot and trim the operation log prefix. Version history queries load the nearest snapshot and replay subsequent operations. Discuss search: use change data capture from the document store to keep Elasticsearch updated, embedding ACL metadata in the index so queries filter by user permissions. Address scaling: horizontally partition documents across collaboration servers using consistent hashing, with Redis pub/sub for cross-server communication when needed. Mention monitoring: track operation propagation latency, WebSocket connection health, operation log growth, and search index freshness. Finally, cover disaster recovery by replicating the operation log and snapshots across availability zones.

Real Interview Quotes

"Design a system that allows multiple users to read and write to a single file simultaneously."

Related Learning

Slack -- real-time messaging architecture with WebSocket fanout and presence tracking
Message Queues -- event-driven operation broadcasting and change propagation
Caching -- in-memory document state and ACL caching for low-latency collaboration
Databases -- PostgreSQL for metadata, snapshots, and append-only operation logs
Search Systems -- Elasticsearch for permission-aware full-text document search