Problem Statement
Design a cloud file storage and synchronization service like Dropbox that lets users upload files, organize them in folders, and access or share them from any device. Files placed on one device should appear on all others, even after periods of offline work.
The fundamental architecture challenge is cleanly separating the control plane (metadata management, authentication, sync coordination, permissions) from the data plane (storing and transferring potentially multi-gigabyte files). You must design for resumable uploads over unreliable networks, real-time propagation of file changes to connected devices, conflict resolution when the same file is edited offline on two machines, and cost-efficient storage at petabyte scale. Strong answers demonstrate thoughtful requirements scoping, a scalable component architecture, and practical trade-offs around reliability, consistency, and user experience.
Key Requirements
Functional
- File upload and download -- users can upload files of any size with pause, resume, and integrity verification; downloads support range requests for partial retrieval
- Cross-device sync -- file creates, updates, renames, and deletes replicate automatically to all of a user's devices, including after offline periods
- File browsing and search -- users can navigate folder trees, view metadata (size, modified date, version), and search by file name or content type
- Sharing and permissions -- users can share files or folders via direct invitations or public links with view-only or edit access controls
Non-Functional
- Scalability -- support hundreds of millions of users, petabytes of stored data, and tens of thousands of concurrent file operations
- Reliability -- zero data loss for committed uploads; 99.95% availability for both metadata and file retrieval
- Latency -- metadata operations within 200ms; sync notifications to other devices within 1 second; upload throughput limited only by client bandwidth
- Consistency -- strong consistency for metadata mutations; eventual consistency for sync propagation across devices with convergence guarantees
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Control Plane and Data Plane Separation
Routing large file uploads through application servers is an expensive anti-pattern. Interviewers expect a clear split where app servers handle only metadata while file bytes flow directly between clients and object storage.
Hints to consider:
- The metadata service generates pre-signed URLs (or equivalent tokens) so clients transfer file data directly to/from object storage without touching application servers
- This separation allows independent scaling: metadata servers scale based on API request rate while storage scales based on capacity and throughput
- A finalization callback from object storage (or a client-initiated completion call) triggers the metadata service to update the file record and publish a sync event
- Rate limiting, authentication, and quota enforcement happen in the metadata layer, keeping the data path clean and simple
2. Chunked Uploads and Data Integrity
Files can be gigabytes in size and uploaded over mobile networks that drop connections frequently. Without resumability and integrity checks, large transfers will fail repeatedly and frustrate users.
Hints to consider:
- Split files into fixed-size chunks (4-8 MB), upload each independently with a per-chunk content hash, and track upload progress server-side
- Content-addressable storage (keyed by hash) enables chunk-level deduplication: identical blocks shared across users or file versions are stored only once
- After all chunks arrive, a finalization step reassembles the file manifest, verifies the overall content hash, and atomically commits the metadata record
- Upload sessions are identified by a unique ID so clients can resume by querying which chunks are still missing
3. Real-Time Sync and Change Propagation
Users expect files to appear on other devices within seconds of being saved. This requires a push-based notification system rather than periodic polling.
Hints to consider:
- Maintain a single long-lived connection per device (WebSocket or server-sent events) for push notifications; the notification carries only a change hint, not the full payload
- Publish file change events to a Kafka topic partitioned by user ID; a notification service consumes events and routes them to connected devices via the gateway
- Each device tracks a cursor (sequence number) representing the last change it has processed; on reconnection, the device catches up from its cursor position
- The sync API returns a batch of changes since the cursor, allowing efficient delta synchronization rather than full directory scans
4. Offline Editing and Conflict Resolution
Users edit files on laptops while disconnected. When they reconnect, changes from multiple devices may conflict. Interviewers want a deterministic, user-friendly resolution strategy.
Hints to consider:
- The client queues local file operations in a journal while offline, each tagged with the file's last-known version hash
- On reconnection, the client replays its journal to the server, which compares the base version hash against the current server state
- If versions match, the update applies cleanly; if they diverge (another device modified the file in the interim), the server creates a "conflicted copy" alongside the original
- Notify the user about conflicts so they can manually merge changes; automatic merge is only feasible for specific file types like plain text
5. Storage Tiering and Cost Management
At petabyte scale, storing every version of every file in hot storage is prohibitively expensive. Interviewers probe your cost optimization strategies.
Hints to consider:
- Use content-addressable chunk storage to deduplicate identical blocks across all users, significantly reducing total storage volume
- Apply lifecycle policies: recent versions in hot storage, older versions in infrequent-access tiers, and expired versions in archive or deleted
- Delta sync (transmitting only changed chunks) reduces bandwidth costs for incremental file edits
- Expose configurable version retention settings so enterprise customers can balance compliance requirements against storage costs
Suggested Approach
Step 1: Clarify Requirements
Confirm the scope with your interviewer. Ask about maximum file sizes and whether very large files (50+ GB) are common. Determine whether real-time collaborative editing (like Google Docs) is in scope or just asynchronous file sync. Clarify the number of devices per user and supported platforms. Ask about version history depth and deleted file recovery. Confirm sharing requirements -- maximum collaborators per shared folder and whether public link sharing needs analytics.
Step 2: High-Level Architecture
Draw the major components: client-side sync agents that maintain a local file index and journal; an API Gateway for authentication and rate limiting; a Metadata Service backed by PostgreSQL for file records, folder trees, permissions, and version history; object storage (S3) for file chunks accessed via pre-signed URLs; a Sync Service reading a Kafka change log and pushing notifications through a WebSocket Gateway; and Redis for caching hot metadata, upload session state, and connection routing. Show the two distinct paths: metadata path (client to API to Metadata Service to PostgreSQL) and data path (client directly to S3 via signed URLs).
Step 3: Deep Dive on Upload and Sync
Walk through the complete flow. User saves a file. The sync agent detects the change, computes chunk hashes, and calls the Metadata Service to initiate an upload session. The service checks for duplicate chunks (content-addressable lookup), creates a pending file version, and returns pre-signed upload URLs only for new chunks. The agent uploads chunks to S3 in parallel. After all uploads complete, the agent calls the finalization endpoint. The Metadata Service verifies chunk presence, commits the new version, and publishes a change event to Kafka. The Sync Service consumes the event, looks up connected devices for this user in Redis, and pushes a lightweight notification. Other devices receive the notification, query the Metadata Service for the change details, and download new chunks from S3 using pre-signed URLs.
Step 4: Address Secondary Concerns
Cover conflict resolution: version hash comparison at finalization, conflicted-copy creation, and user notification. Discuss sharing: ACL records in PostgreSQL, share invitation flow, real-time notifications to collaborators via Kafka events. Address search: index file names and metadata in Elasticsearch with CDC from PostgreSQL. Touch on monitoring: upload success rates, sync notification latency, chunk deduplication ratio, storage utilization by tier. Discuss scaling: shard PostgreSQL by user ID, partition Kafka by user ID, auto-scale WebSocket gateways based on active connections, and horizontally scale S3 access through multiple pre-signed URL generations.
Related Learning
Deepen your understanding of the patterns used in this problem:
- Slack -- persistent connection management and real-time push notification patterns applicable to device sync
- Blob Storage -- object storage patterns including pre-signed URLs, multipart uploads, and content-addressable storage
- Message Queues -- Kafka for durable change logs with cursor-based consumption for device sync
- Databases -- relational metadata modeling for file trees, versions, and permission management
- Caching -- Redis for hot metadata caching, upload session tracking, and connection routing
- CDN -- edge distribution for frequently downloaded shared files