Design DropBox — Clio

Problem Statement

Design a file storage and synchronization service like Dropbox where users upload files, organize them in folders, and access or share them from any device. The core experience is seamless: place a file in one location and it becomes available everywhere, even after going offline and coming back.

The central engineering challenges are separating the control plane (metadata, authentication, sync coordination) from the data plane (large file transfers), supporting resumable uploads for multi-gigabyte files over unreliable networks, propagating file changes across multiple devices in near real time, and resolving conflicts when the same file is edited on two devices while disconnected. Your design must handle hundreds of millions of users, petabytes of stored data, and bursty upload/download traffic while keeping costs manageable through efficient storage tiering and CDN distribution.

Key Requirements

Functional

File upload and download -- users can upload files of any size with progress indication and the ability to pause, resume, and recover from network interruptions
Cross-device synchronization -- file creates, updates, renames, and deletes propagate automatically across all of a user's devices, including after periods of offline work
File browsing and search -- users can navigate folder hierarchies, view file metadata, and search across their stored files with accurate, up-to-date listings
Sharing and access control -- users can share files or folders with others via direct invitations or public links, with configurable view-only or edit permissions

Non-Functional

Scalability -- support hundreds of millions of users storing petabytes of data, with tens of thousands of concurrent uploads and downloads
Reliability -- zero data loss for uploaded files, with 99.95% availability for both metadata and file retrieval operations
Latency -- file metadata operations complete within 200ms; sync notifications reach other devices within 1 second of a change being committed
Consistency -- strong consistency for metadata operations (rename, move, permission changes); eventual consistency acceptable for sync propagation across devices

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Control Plane vs. Data Plane Separation

Interviewers want to see a clear architectural boundary between the metadata path and the file transfer path. Routing multi-gigabyte uploads through application servers is a costly anti-pattern that limits throughput and ties control-plane availability to data transfer spikes.

Hints to consider:

Use pre-signed URLs so clients upload and download directly to/from object storage (S3), bypassing application servers entirely
The metadata service handles authentication, generates upload tokens, and records file entries in the database, while the data plane is pure storage I/O
This separation lets you scale each plane independently and apply different caching, rate-limiting, and redundancy strategies
Consider a callback or webhook mechanism where object storage notifies the metadata service once an upload completes, triggering downstream processing

2. Resumable Uploads and Data Integrity

Large files over unreliable networks will fail without resumability. Interviewers expect you to design for multipart uploads, checksums, and idempotent upload sessions as table-stakes features.

Hints to consider:

Split files into fixed-size chunks (e.g., 4-8 MB) and upload each chunk independently with a checksum; the server tracks which chunks have been received
Generate a unique upload session ID so the client can resume by querying which chunks are missing without re-uploading completed portions
Use content-addressable storage (hash-based keys) to enable deduplication at the chunk level, saving significant storage costs
After all chunks arrive, a finalization step assembles the file, verifies the overall checksum, and atomically updates the metadata record

3. Multi-Device Sync and Conflict Resolution

Keeping files consistent across phones, laptops, and desktops requires a robust synchronization protocol. Interviewers probe how you handle offline edits, detect divergent changes, and present conflicts to users.

Hints to consider:

Maintain a per-device cursor (a sequence number or timestamp) that tracks the last change the device has acknowledged; on reconnection, the device requests all changes since its cursor
Use a durable change log (event stream) where every file operation is appended, allowing devices to catch up by tailing the log from their last known position
For conflict detection, compare the file's version vector or last-known hash at the time of the edit against the current server state
When two offline edits conflict, create a "conflicted copy" alongside the original and notify the user rather than silently overwriting

4. Push Notifications and Change Propagation

Real-time sync means devices need to learn about changes quickly without resorting to expensive polling. Interviewers evaluate your approach to maintaining persistent connections and efficiently distributing change notifications.

Hints to consider:

Use a single long-lived connection (WebSocket or long-poll) per device rather than one connection per file or folder
Publish change events to a Kafka topic partitioned by user ID; a notification service consumes these events and pushes lightweight change summaries to connected devices
The push message should contain only a change indicator (not the full diff), prompting the client to fetch details via the sync API
Handle connection loss gracefully with per-device cursors so reconnecting clients can catch up without missing any intermediate changes

5. Storage Efficiency and Cost Optimization

At scale, naive storage of every file version becomes prohibitively expensive. Interviewers look for strategies to reduce storage costs without sacrificing user experience.

Hints to consider:

Content-addressable chunk storage deduplicates identical blocks across users and file versions, reducing total storage significantly
Apply tiered storage policies: recent files in hot storage (S3 Standard), older versions in infrequent-access tiers, and very old snapshots in archive storage (Glacier)
Delta sync transmits only the changed chunks rather than the entire file, reducing both bandwidth and storage for incremental edits
Set configurable retention policies for version history so users can recover recent versions while old ones are automatically pruned

Suggested Approach

Step 1: Clarify Requirements

Confirm the scope with your interviewer. Ask about maximum file sizes and whether the system must handle very large files (10+ GB). Clarify whether real-time collaboration (simultaneous editing of the same file) is in scope or just asynchronous sync. Determine how many devices per user and whether mobile, desktop, and web clients all need full sync capability. Establish version history depth requirements and whether deleted files should be recoverable. Ask about sharing scale -- can a single folder be shared with thousands of collaborators?

Step 2: High-Level Architecture

Sketch the major components: client applications (desktop sync agent, mobile app, web interface) that maintain a local file index and communicate with the backend; an API gateway handling authentication, rate limiting, and routing; a Metadata Service backed by PostgreSQL for file and folder records, permissions, and version history; an object storage layer (S3) for actual file content accessed via pre-signed URLs; a Sync Service that maintains a change log in Kafka and pushes notifications to connected devices; a Notification Gateway managing WebSocket connections; and a Redis cache for hot metadata, upload session state, and connection routing. Show the separation between the metadata path (client to API to Metadata Service to Postgres) and the data path (client directly to S3 via pre-signed URL).

Step 3: Deep Dive on Upload and Sync Flow

Walk through an end-to-end file upload. The client computes a content hash for the file, splits it into chunks, and requests an upload session from the Metadata Service. The service creates a pending file record in Postgres and returns pre-signed URLs for each chunk. The client uploads chunks directly to S3 in parallel, retrying failures for individual chunks. After all chunks succeed, the client calls a finalize endpoint. The Metadata Service verifies all chunks are present, assembles the file manifest, transitions the file record to "committed," and publishes a change event to Kafka. The Sync Service consumes this event, looks up all devices associated with the user in Redis, and pushes a lightweight notification through the WebSocket gateway. Other devices receive the notification, query the Metadata Service for details, and download the new file chunks from S3 using pre-signed URLs.

Step 4: Address Secondary Concerns

Cover conflict resolution by explaining the version vector comparison on upload finalization and the "conflicted copy" strategy for divergent offline edits. Discuss sharing by maintaining an ACL table in Postgres with row-level security, propagating share events through Kafka for real-time notification to collaborators. Address search by indexing file names and metadata in Elasticsearch with CDC from Postgres. Touch on monitoring: track upload success rates, sync lag (time between change commit and device notification), storage utilization, and chunk deduplication ratios. Discuss scaling: shard Postgres by user ID, partition Kafka by user ID for ordered per-user change streams, and auto-scale WebSocket gateway instances based on active connection count.

Related Learning

Deepen your understanding of the patterns used in this problem:

Slack -- real-time push architecture with WebSocket fanout and persistent connections applicable to device sync notifications
Blob Storage -- object storage patterns for large file handling, pre-signed URLs, and content-addressable storage
Message Queues -- durable event logs with Kafka for change propagation and per-device cursor-based catch-up
Databases -- relational modeling for metadata, version history, and permission management with strong consistency
Caching -- Redis for hot metadata, upload session state, and connection routing lookups
CDN -- serving downloaded files efficiently to a global user base through edge distribution

Problem Statement

Key Requirements

Functional

File upload and download -- users can upload files of any size with progress indication and the ability to pause, resume, and recover from network interruptions
Cross-device synchronization -- file creates, updates, renames, and deletes propagate automatically across all of a user's devices, including after periods of offline work
File browsing and search -- users can navigate folder hierarchies, view file metadata, and search across their stored files with accurate, up-to-date listings
Sharing and access control -- users can share files or folders with others via direct invitations or public links, with configurable view-only or edit permissions

Non-Functional

Scalability -- support hundreds of millions of users storing petabytes of data, with tens of thousands of concurrent uploads and downloads
Reliability -- zero data loss for uploaded files, with 99.95% availability for both metadata and file retrieval operations
Latency -- file metadata operations complete within 200ms; sync notifications reach other devices within 1 second of a change being committed
Consistency -- strong consistency for metadata operations (rename, move, permission changes); eventual consistency acceptable for sync propagation across devices

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Control Plane vs. Data Plane Separation

Hints to consider:

Use pre-signed URLs so clients upload and download directly to/from object storage (S3), bypassing application servers entirely
The metadata service handles authentication, generates upload tokens, and records file entries in the database, while the data plane is pure storage I/O
This separation lets you scale each plane independently and apply different caching, rate-limiting, and redundancy strategies
Consider a callback or webhook mechanism where object storage notifies the metadata service once an upload completes, triggering downstream processing

2. Resumable Uploads and Data Integrity

Large files over unreliable networks will fail without resumability. Interviewers expect you to design for multipart uploads, checksums, and idempotent upload sessions as table-stakes features.

Hints to consider:

Split files into fixed-size chunks (e.g., 4-8 MB) and upload each chunk independently with a checksum; the server tracks which chunks have been received
Generate a unique upload session ID so the client can resume by querying which chunks are missing without re-uploading completed portions
Use content-addressable storage (hash-based keys) to enable deduplication at the chunk level, saving significant storage costs
After all chunks arrive, a finalization step assembles the file, verifies the overall checksum, and atomically updates the metadata record

3. Multi-Device Sync and Conflict Resolution

Hints to consider:

Maintain a per-device cursor (a sequence number or timestamp) that tracks the last change the device has acknowledged; on reconnection, the device requests all changes since its cursor
Use a durable change log (event stream) where every file operation is appended, allowing devices to catch up by tailing the log from their last known position
For conflict detection, compare the file's version vector or last-known hash at the time of the edit against the current server state
When two offline edits conflict, create a "conflicted copy" alongside the original and notify the user rather than silently overwriting

4. Push Notifications and Change Propagation

Hints to consider:

Use a single long-lived connection (WebSocket or long-poll) per device rather than one connection per file or folder
Publish change events to a Kafka topic partitioned by user ID; a notification service consumes these events and pushes lightweight change summaries to connected devices
The push message should contain only a change indicator (not the full diff), prompting the client to fetch details via the sync API
Handle connection loss gracefully with per-device cursors so reconnecting clients can catch up without missing any intermediate changes

5. Storage Efficiency and Cost Optimization

At scale, naive storage of every file version becomes prohibitively expensive. Interviewers look for strategies to reduce storage costs without sacrificing user experience.

Hints to consider:

Content-addressable chunk storage deduplicates identical blocks across users and file versions, reducing total storage significantly
Apply tiered storage policies: recent files in hot storage (S3 Standard), older versions in infrequent-access tiers, and very old snapshots in archive storage (Glacier)
Delta sync transmits only the changed chunks rather than the entire file, reducing both bandwidth and storage for incremental edits
Set configurable retention policies for version history so users can recover recent versions while old ones are automatically pruned

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Step 3: Deep Dive on Upload and Sync Flow

Step 4: Address Secondary Concerns

Related Learning

Deepen your understanding of the patterns used in this problem:

Slack -- real-time push architecture with WebSocket fanout and persistent connections applicable to device sync notifications
Blob Storage -- object storage patterns for large file handling, pre-signed URLs, and content-addressable storage
Message Queues -- durable event logs with Kafka for change propagation and per-device cursor-based catch-up
Databases -- relational modeling for metadata, version history, and permission management with strong consistency
Caching -- Redis for hot metadata, upload session state, and connection routing lookups
CDN -- serving downloaded files efficiently to a global user base through edge distribution