Practice/Meta/Design a Distributed File System
Design a Distributed File System
System DesignMust
Problem Statement
Design a cloud-based object storage service similar to Amazon S3 or Google Cloud Storage that allows millions of users to store and retrieve files of any size over the internet. Your system must handle everything from small configuration files to multi-terabyte video archives, serving concurrent requests from applications worldwide while maintaining data durability and availability guarantees of 99.999999999% (11 nines).
The service should support standard object storage operations including uploading objects to named buckets, retrieving entire objects or byte ranges, listing bucket contents, and deleting objects. You'll need to design for tens of exabytes of total storage across millions of buckets, handle peak traffic of millions of requests per second, and ensure that data survives multiple concurrent hardware failures without loss. Consider how you'll organize metadata, replicate data across geographic regions, optimize costs through storage tiers, and provide strong consistency guarantees for critical operations.
Key Requirements
Functional
- Object upload and storage -- Users must be able to upload objects (files) of any size from bytes to terabytes into named buckets with unique keys
- Object retrieval -- Support fetching complete objects or specific byte ranges with high throughput and low latency
- Bucket and object listing -- Users should be able to list all buckets they own and enumerate objects within a bucket, optionally filtered by key prefix
- Object deletion -- Support immediate deletion of individual objects or batch deletion of thousands of objects in a single request
- Metadata management -- Each object should support custom key-value metadata tags and standard attributes like content type, creation time, and size
Non-Functional
- Scalability -- Handle exabytes of storage, billions of objects, and millions of concurrent read/write operations per second across global regions
- Reliability -- Achieve 11 nines of data durability through redundant storage and 99.99% availability with automatic failover
- Latency -- Provide sub-100ms response time for metadata operations and support multi-gigabit throughput for large object transfers
- Consistency -- Guarantee read-after-write consistency for new object uploads and eventual consistency for bucket listings and replicated objects
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Metadata and Data Path Separation
Interviewers want to see if you understand that metadata operations (listing, permissions, versioning) must be separated from the data storage path to achieve independent scaling and performance optimization.
Hints to consider:
- Design a dedicated metadata service using a distributed database that can handle billions of object records partitioned by bucket
- Keep the data path simple by storing large objects directly on chunk servers or blob stores, returning signed URLs or chunk manifests to clients
- Cache frequently accessed metadata (bucket listings, object manifests) in memory tiers to reduce database load and improve p99 latency
- Consider how to handle hot buckets where thousands of clients list or write simultaneously without creating metadata bottlenecks
2. Chunking and Replication Strategy
Large objects cannot be stored as single monolithic blobs, and you need to balance durability, cost, and read performance across different object sizes and access patterns.
Hints to consider:
- Split objects larger than a threshold (e.g., 64MB) into fixed-size chunks that can be stored, replicated, and garbage collected independently
- Use different replication strategies based on access patterns: 3-way replication for hot data and frequently accessed objects, erasure coding (8+3 or similar) for cold archival data to reduce storage costs
- Implement checksumming at the chunk level using algorithms like CRC32C or MD5 to detect bit rot and validate data integrity during reads
- Design a manifest structure that maps object keys to ordered lists of chunk IDs, enabling efficient range reads by translating byte offsets to specific chunks
3. Handling Multi-Region and Cross-Region Replication
Global users expect low latency access regardless of location, and enterprises require geographic redundancy for disaster recovery without sacrificing consistency guarantees.
Hints to consider:
- Implement a regional architecture where each region has independent metadata and storage clusters, with buckets optionally replicated across regions based on user configuration
- Use a global namespace service or consistent hashing to route requests to the home region for a bucket, then apply async replication to secondary regions
- Provide strong consistency within a region using consensus protocols like Raft or Paxos for critical metadata operations while accepting eventual consistency for cross-region replicas
- Address conflict resolution for rare cases where writes occur in multiple regions simultaneously, potentially using last-write-wins with vector clocks or prompting users to choose single-region writes
4. Optimizing Large Object Uploads
Uploading terabyte-scale objects over unreliable internet connections requires resumability, parallelism, and efficient commit semantics to avoid wasting bandwidth and storage.
Hints to consider:
- Support multipart uploads where clients break large objects into parts (5MB to 5GB each), upload them independently in parallel, then commit all parts atomically with a final manifest
- Store uncommitted parts in temporary storage with TTL-based expiration to automatically clean up abandoned uploads after 7 days
- Generate pre-signed upload URLs that allow clients to write chunks directly to storage nodes without proxying through API servers, reducing gateway bottlenecks
- Implement atomic commit by writing the object manifest only after verifying all chunks exist and their checksums match, ensuring no partial objects are visible to readers
5. Cost Optimization and Storage Tiering
At exabyte scale, storage costs dominate operational expenses, so you must provide multiple storage classes with different durability, availability, and pricing characteristics.
Hints to consider:
- Offer hot, warm, and cold storage tiers where hot data uses fast SSDs with 3-way replication, warm data uses HDDs with erasure coding, and cold archival uses tape or glacier-style deep storage
- Implement lifecycle policies that automatically transition objects between tiers based on age or access patterns, moving infrequently accessed data to cheaper storage after 30/90/365 days
- Design a retrieval system for cold data that trades latency (minutes to hours) for cost, pre-warming caches when users request archived objects
- Consider thin provisioning and deduplication at the chunk level to reduce storage for objects with identical content blocks across different keys or buckets