Design DropBox — Pinterest

Reference Answer

For a walkthrough of designing a distributed file system with sync semantics, see our Design File System guide. It covers metadata management, chunked storage, and multi-device synchronization patterns that form the foundation for this problem.

Also review the Blob Storage, Databases, Message Queues, and CDN building blocks.

Problem Statement

Design a file storage and synchronization service like Dropbox that lets users upload, download, and sync files across multiple devices with real-time updates. The core experience is seamless: save a file on one device and it appears on every other linked device within seconds, even if some were offline when the change occurred.

This problem tests your ability to separate a control plane (metadata, authentication, sync coordination) from a data plane (large file transfers). You must handle resumable uploads over unreliable networks, real-time change propagation across devices, conflict resolution for concurrent offline edits, and cost-efficient storage at petabyte scale. Interviewers use this to evaluate whether you can decompose a deceptively simple product into well-bounded services with clear consistency, durability, and scalability guarantees.

Key Requirements

Functional

File upload and download -- Users upload files of any size with progress indication and the ability to pause, resume, and recover from interruptions
Cross-device synchronization -- Files sync automatically across all linked devices, reflecting creates, updates, renames, and deletes, including after offline periods
Folder browsing and search -- Users navigate folder hierarchies and search across their file catalog with accurate, up-to-date results
Sharing and permissions -- Users share files or folders with others and control view-only or edit access through permissions and public links

Non-Functional

Scalability -- Support hundreds of millions of files and petabytes of storage with millions of sync operations daily
Reliability -- No data loss during network failures or server crashes; 99.9 percent availability
Latency -- Metadata operations under 200ms; file sync propagation to other devices within seconds of upload completion
Consistency -- Strong consistency for metadata operations (permissions, renames); eventual consistency for cross-device sync with convergence within seconds

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Control Plane vs. Data Plane Separation

Routing multi-gigabyte file uploads through application servers is the most common and costly mistake. Interviewers immediately check whether you understand the need to keep metadata operations and bulk data transfers on separate paths.

Hints to consider:

The control plane handles authentication, folder structure, permissions, and sync coordination through lightweight API calls to application servers
The data plane uses pre-signed URLs so clients upload directly to and download directly from object storage, bypassing application servers entirely
This separation ensures a surge in large uploads does not degrade the responsiveness of folder browsing or permission management
Scale the control plane and data plane independently based on their distinct traffic patterns

2. Chunked Uploads with Resumability

Users regularly upload files that are hundreds of megabytes or larger over unreliable connections. Without chunking, a network blip near the end of a large upload forces the user to restart from scratch.

Hints to consider:

Split files into fixed-size chunks (for example, 4 MB each) and upload each independently with its own checksum
Identify chunks by content hash (SHA-256) to enable deduplication: if the chunk already exists in storage, skip the upload
Track per-chunk upload status so retries only re-upload missing pieces
Validate end-to-end integrity by comparing the reconstructed file hash against the client-computed whole-file hash

3. Real-Time Sync Across Devices

When a file changes on one device, all other linked devices must learn about it quickly. Pure polling wastes bandwidth and adds latency; pure push is fragile at scale.

Hints to consider:

Maintain a per-device sync cursor (a monotonically increasing offset into a change log) so each device knows exactly where it left off
Use a push notification channel (WebSocket or long-poll) to alert connected devices that new changes are available
The device then pulls the actual change list from its last cursor position, keeping the push message lightweight
A durable, append-only change log per user (or per shared folder) is the source of truth; Kafka with per-consumer offsets maps naturally to this pattern

4. Offline Conflict Resolution

Two devices may edit the same file while disconnected. When both reconnect, the system needs a deterministic strategy to reconcile divergent histories without losing work.

Hints to consider:

Use version vectors or logical timestamps to detect when two devices diverged from a common ancestor version
The standard production approach is "last writer wins with conflict copy": accept the most recent edit as the primary version and save the other as a conflict file the user can manually reconcile
For directory-level conflicts (one device renames a file while another deletes it), define clear precedence rules and log all decisions
Avoid automatic content merging for binary files; content-level merge is only practical for structured text formats

5. Storage Efficiency at Scale

At petabyte scale, storage cost dominates the operating budget. Interviewers expect you to think beyond naive "store everything" approaches.

Hints to consider:

Content-addressable chunking means identical chunks across different files and users are stored exactly once, yielding significant deduplication savings
Tier storage by access frequency: hot tier for recently modified files, warm tier for infrequently accessed data, cold tier for archival
Implement file versioning with configurable retention windows so users can recover deleted files but storage is eventually reclaimed
Compress chunks before storage, especially for text-heavy file types where compression ratios are high

Suggested Approach

Step 1: Clarify Requirements

Confirm the expected file size distribution (documents, photos, videos, archives), whether real-time collaboration within files is needed or just file-level sync, how many devices a typical user links, and whether the system serves consumers or enterprises. Ask about version history retention, maximum file size, and sharing model (internal teams versus public links). Clarify offline behavior: does the client sync all files or only user-selected ones?

Step 2: High-Level Architecture

Sketch the core components: a metadata service backed by Postgres for folder trees, permissions, and file version records; an object storage layer (S3) for file chunks; a sync service that maintains the per-user change log and pushes notifications to connected devices via WebSocket; an upload service that generates pre-signed URLs, tracks chunk progress, and triggers post-upload processing; a background pipeline for thumbnail generation, virus scanning, and search indexing; and a CDN layer for accelerating downloads of popular shared files. Show the separation between metadata API calls and direct-to-S3 data transfers.

Step 3: Deep Dive on Upload and Sync Flow

Walk through the full upload path. The client splits the file into chunks, computes a SHA-256 hash per chunk, and sends the list to the metadata service. The service checks which chunks already exist (deduplication), returns pre-signed upload URLs for new chunks, and the client uploads them directly to S3. Once all chunks land, the client notifies the metadata service, which atomically creates a new file version record and appends an entry to the change log. The sync service pushes a lightweight notification to all connected devices for that user. Each device fetches the change log from its last cursor, discovers the new version, downloads new or changed chunks via pre-signed URLs, and reconstructs the file locally.

Step 4: Address Secondary Concerns

Cover reliability by relying on S3's built-in durability for file data and replicating Postgres with automatic failover for metadata. Discuss search by extracting text from uploaded files and indexing in Elasticsearch via a Kafka pipeline. Address sharing through ACLs stored in the metadata database, checked on every API call and pre-signed URL generation. Mention monitoring: track sync lag per device, upload success rates, deduplication ratios, and storage growth trends. If time allows, discuss multi-region deployment with metadata replicated globally and file chunks cached at edge locations via CDN.

Real Interview Insights

Candidates at Pinterest report that interviewers focused heavily on strong consistency and ACID properties for metadata operations, particularly for file sharing and permission management. The synchronization mechanism across multiple devices was a deep-dive topic, with the interviewer pushing hard on the versioning solution and how conflicts are detected. One interviewer specifically asked about the OneDrive-style experience of how the system behaves when the user goes offline and comes back online. Be ready to explain exactly how per-device cursors track sync state and how the change log enables reliable catch-up after disconnection.

Reference Answer

Also review the Blob Storage, Databases, Message Queues, and CDN building blocks.

Problem Statement

Key Requirements

Functional

File upload and download -- Users upload files of any size with progress indication and the ability to pause, resume, and recover from interruptions
Cross-device synchronization -- Files sync automatically across all linked devices, reflecting creates, updates, renames, and deletes, including after offline periods
Folder browsing and search -- Users navigate folder hierarchies and search across their file catalog with accurate, up-to-date results
Sharing and permissions -- Users share files or folders with others and control view-only or edit access through permissions and public links

Non-Functional

Scalability -- Support hundreds of millions of files and petabytes of storage with millions of sync operations daily
Reliability -- No data loss during network failures or server crashes; 99.9 percent availability
Latency -- Metadata operations under 200ms; file sync propagation to other devices within seconds of upload completion
Consistency -- Strong consistency for metadata operations (permissions, renames); eventual consistency for cross-device sync with convergence within seconds

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Control Plane vs. Data Plane Separation

Hints to consider:

The control plane handles authentication, folder structure, permissions, and sync coordination through lightweight API calls to application servers
The data plane uses pre-signed URLs so clients upload directly to and download directly from object storage, bypassing application servers entirely
This separation ensures a surge in large uploads does not degrade the responsiveness of folder browsing or permission management
Scale the control plane and data plane independently based on their distinct traffic patterns

2. Chunked Uploads with Resumability

Hints to consider:

Split files into fixed-size chunks (for example, 4 MB each) and upload each independently with its own checksum
Identify chunks by content hash (SHA-256) to enable deduplication: if the chunk already exists in storage, skip the upload
Track per-chunk upload status so retries only re-upload missing pieces
Validate end-to-end integrity by comparing the reconstructed file hash against the client-computed whole-file hash

3. Real-Time Sync Across Devices

When a file changes on one device, all other linked devices must learn about it quickly. Pure polling wastes bandwidth and adds latency; pure push is fragile at scale.

Hints to consider:

Maintain a per-device sync cursor (a monotonically increasing offset into a change log) so each device knows exactly where it left off
Use a push notification channel (WebSocket or long-poll) to alert connected devices that new changes are available
The device then pulls the actual change list from its last cursor position, keeping the push message lightweight
A durable, append-only change log per user (or per shared folder) is the source of truth; Kafka with per-consumer offsets maps naturally to this pattern

4. Offline Conflict Resolution

Two devices may edit the same file while disconnected. When both reconnect, the system needs a deterministic strategy to reconcile divergent histories without losing work.

Hints to consider:

Use version vectors or logical timestamps to detect when two devices diverged from a common ancestor version
The standard production approach is "last writer wins with conflict copy": accept the most recent edit as the primary version and save the other as a conflict file the user can manually reconcile
For directory-level conflicts (one device renames a file while another deletes it), define clear precedence rules and log all decisions
Avoid automatic content merging for binary files; content-level merge is only practical for structured text formats

5. Storage Efficiency at Scale

At petabyte scale, storage cost dominates the operating budget. Interviewers expect you to think beyond naive "store everything" approaches.

Hints to consider:

Content-addressable chunking means identical chunks across different files and users are stored exactly once, yielding significant deduplication savings
Tier storage by access frequency: hot tier for recently modified files, warm tier for infrequently accessed data, cold tier for archival
Implement file versioning with configurable retention windows so users can recover deleted files but storage is eventually reclaimed
Compress chunks before storage, especially for text-heavy file types where compression ratios are high