System Design - Distributed File System — Databricks

Problem Statement

Design a cloud-based object storage service similar to Amazon S3 or Google Cloud Storage that allows millions of users to store and retrieve files of any size over the internet. Your system must handle everything from small configuration files to multi-terabyte video archives, serving concurrent requests from applications worldwide while maintaining data durability and availability guarantees of 99.999999999% (11 nines).

The service should support standard object storage operations including uploading objects to named buckets, retrieving entire objects or byte ranges, listing bucket contents, and deleting objects. You'll need to design for tens of exabytes of total storage across millions of buckets, handle peak traffic of millions of requests per second, and ensure that data survives multiple concurrent hardware failures without loss. Consider how you'll organize metadata, replicate data across geographic regions, optimize costs through storage tiers, and provide strong consistency guarantees for critical operations.

Key Requirements

Functional

Object upload and storage -- Users must be able to upload objects (files) of any size from bytes to terabytes into named buckets with unique keys
Object retrieval -- Support fetching complete objects or specific byte ranges with high throughput and low latency
Bucket and object listing -- Users should be able to list all buckets they own and enumerate objects within a bucket, optionally filtered by key prefix
Object deletion -- Support immediate deletion of individual objects or batch deletion of thousands of objects in a single request
Metadata management -- Each object should support custom key-value metadata tags and standard attributes like content type, creation time, and size

Non-Functional

Scalability -- Handle exabytes of storage, billions of objects, and millions of concurrent read/write operations per second across global regions
Reliability -- Achieve 11 nines of data durability through redundant storage and 99.99% availability with automatic failover
Latency -- Provide sub-100ms response time for metadata operations and support multi-gigabit throughput for large object transfers
Consistency -- Guarantee read-after-write consistency for new object uploads and eventual consistency for bucket listings and replicated objects

Based on real interview experiences, these are the areas interviewers probe most deeply:

Interviewers want to see if you understand that metadata operations (listing, permissions, versioning) must be separated from the data storage path to achieve independent scaling and performance optimization.

Design a dedicated metadata service using a distributed database that can handle billions of object records partitioned by bucket
Keep the data path simple by storing large objects directly on chunk servers or blob stores, returning signed URLs or chunk manifests to clients
Cache frequently accessed metadata (bucket listings, object manifests) in memory tiers to reduce database load and improve p99 latency
Consider how to handle hot buckets where thousands of clients list or write simultaneously without creating metadata bottlenecks

Large objects cannot be stored as single monolithic blobs, and you need to balance durability, cost, and read performance across different object sizes and access patterns.

3. Handling Multi-Region and Cross-Region Replication

Global users expect low latency access regardless of location, and enterprises require geographic redundancy for disaster recovery without sacrificing consistency guarantees.

Implement a regional architecture where each region has independent metadata and storage clusters, with buckets optionally replicated across regions based on user configuration
Use a global namespace service or consistent hashing to route requests to the home region for a bucket, then apply async replication to secondary regions
Provide strong consistency within a region using consensus protocols like Raft or Paxos for critical metadata operations while accepting eventual consistency for cross-region replicas
Address conflict resolution for rare cases where writes occur in multiple regions simultaneously, potentially using last-write-wins with vector clocks or prompting users to choose single-region writes

Uploading terabyte-scale objects over unreliable internet connections requires resumability, parallelism, and efficient commit semantics to avoid wasting bandwidth and storage.

Support multipart uploads where clients break large objects into parts (5MB to 5GB each), upload them independently in parallel, then commit all parts atomically with a final manifest
Store uncommitted parts in temporary storage with TTL-based expiration to automatically clean up abandoned uploads after 7 days
Generate pre-signed upload URLs that allow clients to write chunks directly to storage nodes without proxying through API servers, reducing gateway bottlenecks
Implement atomic commit by writing the object manifest only after verifying all chunks exist and their checksums match, ensuring no partial objects are visible to readers

At exabyte scale, storage costs dominate operational expenses, so you must provide multiple storage classes with different durability, availability, and pricing characteristics.

Offer hot, warm, and cold storage tiers where hot data uses fast SSDs with 3-way replication, warm data uses HDDs with erasure coding, and cold archival uses tape or glacier-style deep storage
Implement lifecycle policies that automatically transition objects between tiers based on age or access patterns, moving infrequently accessed data to cheaper storage after 30/90/365 days
Design a retrieval system for cold data that trades latency (minutes to hours) for cost, pre-warming caches when users request archived objects
Consider thin provisioning and deduplication at the chunk level to reduce storage for objects with identical content blocks across different keys or buckets