Practice/Bloomberg/Design Google Docs

Design Google Docs

System DesignMust

Problem Statement

Design a collaborative document editing service similar to Google Docs where multiple users can create, edit, and share documents in real-time. Users working on the same document should see each other's changes appear within milliseconds, with proper conflict resolution when two people edit overlapping text simultaneously. The system must support documents ranging from brief meeting notes to lengthy reports with rich formatting, embedded images, and tables, while handling anywhere from a handful of editors on a private document to hundreds of concurrent collaborators on a company-wide announcement.

The core difficulty is maintaining consistency across distributed clients that can make edits at any time, including during network partitions. You need to choose a concrete conflict resolution algorithm (Operational Transformation or CRDTs), design a WebSocket infrastructure that scales to millions of active sessions, and build a storage layer that supports both real-time collaboration and complete version history. Bloomberg interviews emphasize this problem because real-time collaborative tools mirror the low-latency, high-throughput data systems central to the Bloomberg Terminal ecosystem.

Key Requirements

Functional

Real-time collaborative editing -- multiple users can edit the same document simultaneously, with changes propagating to all participants within 200 milliseconds under normal conditions
Conflict resolution -- when users edit overlapping content concurrently, the system must merge changes deterministically without losing either user's intent
Presence and cursors -- users can see who else is viewing or editing the document, including live cursor positions and text selections
Version history and restore -- maintain a complete edit history so users can browse past versions, compare changes between versions, and restore to any previous state
Permissions and sharing -- document owners can set access levels (view, comment, edit) and share via links or direct invitations with role-based access control

Non-Functional

Scalability -- support 10 million daily active users across 100 million documents, with up to 1,000 concurrent editors on popular documents
Reliability -- 99.9 percent uptime with automatic failover and zero data loss during server failures or network partitions
Latency -- sub-200ms edit propagation to all active collaborators; p99 latency under 500ms for geographically distributed users
Consistency -- eventual consistency with causal ordering guarantees, ensuring all clients converge to an identical document state within seconds

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Conflict Resolution Algorithm

This is the heart of the problem. Interviewers want to see that you can explain how concurrent edits are merged deterministically. Simply naming "OT" or "CRDT" is not sufficient -- you need to walk through how operations are represented, transformed, and applied.

Hints to consider:

With Operational Transformation, a central server serializes operations, assigns sequence numbers, and transforms concurrent operations against each other before broadcasting; show a concrete example of two users inserting text at the same position
With CRDTs like Yjs or Automerge, each character gets a unique logical ID and ordering is determined by Lamport timestamps, enabling peer-to-peer sync without a central serializer
OT produces smaller payloads and simpler client logic but requires a reliable central server for ordering; CRDTs handle offline editing and network partitions more gracefully but carry larger metadata overhead
Address how undo/redo operations work when interleaved with other users' edits, since reverting a local change must not undo someone else's work

2. WebSocket Architecture and Connection Management

Maintaining thousands of concurrent WebSocket connections per document server requires careful resource management. Interviewers look for awareness of connection routing, session affinity, and graceful degradation.

Hints to consider:

Route all editors of the same document to the same collaboration server instance (sticky routing by document ID) to avoid distributed consensus for operation ordering
Use a connection manager service that maps document IDs to active collaboration servers and handles rebalancing when servers are added or removed
Implement exponential backoff reconnection logic on clients; buffer operations locally during brief disconnections and replay them on reconnection using sequence numbers
Separate presence updates (cursor positions, online status) from document operations to reduce message volume, sending presence at a lower frequency (every 500ms)

3. Storage Strategy for Documents and Operation Logs

Documents have unusual access patterns: frequent small updates during editing, occasional full reads on open, and historical queries for version browsing. Your storage layer must optimize for all three.

Hints to consider:

Store the append-only operation log in a durable database (PostgreSQL) partitioned by document ID, providing a complete audit trail and crash recovery source
Create periodic snapshots (every 100 operations or every 5 minutes) so document loading does not require replaying the entire operation history from the beginning
Use a separate blob storage service for embedded media (images, attachments) with content-addressed URLs, keeping the operation log lightweight
Shard the document store by document ID using consistent hashing; partition operation logs by time ranges within each document for efficient historical queries

4. Handling Hot Documents and Rate Limiting

A company-wide announcement or viral template can attract thousands of simultaneous editors, creating extreme load on a single collaboration server. Interviewers want to see scalability thinking beyond the average case.

Hints to consider:

Separate read-only viewers from active editors: viewers receive a periodic snapshot via CDN or cache, while only editors maintain WebSocket connections to the collaboration server
Batch rapid successive edits from the same user (e.g., every keystroke within 50ms) into a single operation before broadcasting, reducing network and processing overhead
Apply per-user and per-document rate limits to prevent abuse and protect the collaboration server from denial-of-service scenarios
For extremely hot documents, consider splitting the document into independently editable sections (paragraphs or pages) that can be processed by different server instances

Suggested Approach

Step 1: Clarify Requirements

Start by confirming scope and priorities. Ask about the types of content the editor must support (plain text only, rich formatting, embedded objects) and expected document sizes. Clarify maximum concurrent collaborators per document and geographic distribution of users. Determine whether offline editing is a hard requirement or a future enhancement. Confirm how far back version history must extend and whether the system needs full-text search across documents. Ask whether read-only viewers count toward concurrency limits and what consistency guarantees are acceptable during partitions.

Practice/Bloomberg/Design Google Docs

Design Google Docs

System DesignMust

Problem Statement

Key Requirements

Functional

Real-time collaborative editing -- multiple users can edit the same document simultaneously, with changes propagating to all participants within 200 milliseconds under normal conditions
Conflict resolution -- when users edit overlapping content concurrently, the system must merge changes deterministically without losing either user's intent
Presence and cursors -- users can see who else is viewing or editing the document, including live cursor positions and text selections
Version history and restore -- maintain a complete edit history so users can browse past versions, compare changes between versions, and restore to any previous state
Permissions and sharing -- document owners can set access levels (view, comment, edit) and share via links or direct invitations with role-based access control

Non-Functional

Scalability -- support 10 million daily active users across 100 million documents, with up to 1,000 concurrent editors on popular documents
Reliability -- 99.9 percent uptime with automatic failover and zero data loss during server failures or network partitions
Latency -- sub-200ms edit propagation to all active collaborators; p99 latency under 500ms for geographically distributed users
Consistency -- eventual consistency with causal ordering guarantees, ensuring all clients converge to an identical document state within seconds

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Conflict Resolution Algorithm

Hints to consider:

With Operational Transformation, a central server serializes operations, assigns sequence numbers, and transforms concurrent operations against each other before broadcasting; show a concrete example of two users inserting text at the same position
With CRDTs like Yjs or Automerge, each character gets a unique logical ID and ordering is determined by Lamport timestamps, enabling peer-to-peer sync without a central serializer
OT produces smaller payloads and simpler client logic but requires a reliable central server for ordering; CRDTs handle offline editing and network partitions more gracefully but carry larger metadata overhead
Address how undo/redo operations work when interleaved with other users' edits, since reverting a local change must not undo someone else's work

2. WebSocket Architecture and Connection Management

Hints to consider:

Route all editors of the same document to the same collaboration server instance (sticky routing by document ID) to avoid distributed consensus for operation ordering
Use a connection manager service that maps document IDs to active collaboration servers and handles rebalancing when servers are added or removed
Implement exponential backoff reconnection logic on clients; buffer operations locally during brief disconnections and replay them on reconnection using sequence numbers
Separate presence updates (cursor positions, online status) from document operations to reduce message volume, sending presence at a lower frequency (every 500ms)

3. Storage Strategy for Documents and Operation Logs

Hints to consider:

Store the append-only operation log in a durable database (PostgreSQL) partitioned by document ID, providing a complete audit trail and crash recovery source
Create periodic snapshots (every 100 operations or every 5 minutes) so document loading does not require replaying the entire operation history from the beginning
Use a separate blob storage service for embedded media (images, attachments) with content-addressed URLs, keeping the operation log lightweight
Shard the document store by document ID using consistent hashing; partition operation logs by time ranges within each document for efficient historical queries

4. Handling Hot Documents and Rate Limiting

Hints to consider:

Separate read-only viewers from active editors: viewers receive a periodic snapshot via CDN or cache, while only editors maintain WebSocket connections to the collaboration server
Batch rapid successive edits from the same user (e.g., every keystroke within 50ms) into a single operation before broadcasting, reducing network and processing overhead
Apply per-user and per-document rate limits to prevent abuse and protect the collaboration server from denial-of-service scenarios
For extremely hot documents, consider splitting the document into independently editable sections (paragraphs or pages) that can be processed by different server instances