Design a file storage and synchronization service like Dropbox where users upload files, organize them in folders, and access or share them from any device. The core experience is seamless: place a file in one location and it becomes available everywhere, even after going offline and coming back.
The central engineering challenges are separating the control plane (metadata, authentication, sync coordination) from the data plane (large file transfers), supporting resumable uploads for multi-gigabyte files over unreliable networks, propagating file changes across multiple devices in near real time, and resolving conflicts when the same file is edited on two devices while disconnected. Your design must handle hundreds of millions of users, petabytes of stored data, and bursty upload/download traffic while keeping costs manageable through efficient storage tiering and CDN distribution.
Based on real interview experiences, these are the areas interviewers probe most deeply:
Interviewers want to see a clear architectural boundary between the metadata path and the file transfer path. Routing multi-gigabyte uploads through application servers is a costly anti-pattern that limits throughput and ties control-plane availability to data transfer spikes.
Hints to consider:
Large files over unreliable networks will fail without resumability. Interviewers expect you to design for multipart uploads, checksums, and idempotent upload sessions as table-stakes features.
Hints to consider:
Keeping files consistent across phones, laptops, and desktops requires a robust synchronization protocol. Interviewers probe how you handle offline edits, detect divergent changes, and present conflicts to users.
Hints to consider:
Real-time sync means devices need to learn about changes quickly without resorting to expensive polling. Interviewers evaluate your approach to maintaining persistent connections and efficiently distributing change notifications.
Hints to consider:
At scale, naive storage of every file version becomes prohibitively expensive. Interviewers look for strategies to reduce storage costs without sacrificing user experience.
Hints to consider:
Confirm the scope with your interviewer. Ask about maximum file sizes and whether the system must handle very large files (10+ GB). Clarify whether real-time collaboration (simultaneous editing of the same file) is in scope or just asynchronous sync. Determine how many devices per user and whether mobile, desktop, and web clients all need full sync capability. Establish version history depth requirements and whether deleted files should be recoverable. Ask about sharing scale -- can a single folder be shared with thousands of collaborators?
Sketch the major components: client applications (desktop sync agent, mobile app, web interface) that maintain a local file index and communicate with the backend; an API gateway handling authentication, rate limiting, and routing; a Metadata Service backed by PostgreSQL for file and folder records, permissions, and version history; an object storage layer (S3) for actual file content accessed via pre-signed URLs; a Sync Service that maintains a change log in Kafka and pushes notifications to connected devices; a Notification Gateway managing WebSocket connections; and a Redis cache for hot metadata, upload session state, and connection routing. Show the separation between the metadata path (client to API to Metadata Service to Postgres) and the data path (client directly to S3 via pre-signed URL).
Walk through an end-to-end file upload. The client computes a content hash for the file, splits it into chunks, and requests an upload session from the Metadata Service. The service creates a pending file record in Postgres and returns pre-signed URLs for each chunk. The client uploads chunks directly to S3 in parallel, retrying failures for individual chunks. After all chunks succeed, the client calls a finalize endpoint. The Metadata Service verifies all chunks are present, assembles the file manifest, transitions the file record to "committed," and publishes a change event to Kafka. The Sync Service consumes this event, looks up all devices associated with the user in Redis, and pushes a lightweight notification through the WebSocket gateway. Other devices receive the notification, query the Metadata Service for details, and download the new file chunks from S3 using pre-signed URLs.
Cover conflict resolution by explaining the version vector comparison on upload finalization and the "conflicted copy" strategy for divergent offline edits. Discuss sharing by maintaining an ACL table in Postgres with row-level security, propagating share events through Kafka for real-time notification to collaborators. Address search by indexing file names and metadata in Elasticsearch with CDC from Postgres. Touch on monitoring: track upload success rates, sync lag (time between change commit and device notification), storage utilization, and chunk deduplication ratios. Discuss scaling: shard Postgres by user ID, partition Kafka by user ID for ordered per-user change streams, and auto-scale WebSocket gateway instances based on active connection count.
Deepen your understanding of the patterns used in this problem: