Practice/Google/Design a block storage system
Design a block storage system
System DesignMust
Problem Statement
You are asked to design a distributed block storage system similar to AWS EBS or Google Persistent Disk. The system exposes fixed-size volumes that virtual machines mount as raw block devices. Every read must return the most recently acknowledged write, making strict read-after-write consistency a non-negotiable requirement.
The architecture must cleanly separate the control plane (volume lifecycle, placement, snapshot management) from the data plane (actual block I/O). The control plane handles provisioning, attaching volumes to VMs, and orchestrating snapshots, while the data plane must deliver predictable, low-latency I/O regardless of neighboring workloads on the same physical infrastructure.
A production-grade system must also support point-in-time snapshots for backup and cloning, replication across failure domains for durability, and quality-of-service mechanisms that prevent one noisy tenant from degrading another tenant's performance.
Key Requirements
Functional
- Volume CRUD and attachment -- Create, resize, delete, and attach/detach fixed-size block volumes to virtual machines with strict ownership semantics.
- Read-after-write consistency -- Any read issued after a write acknowledgment must return the written data, even if the read is served by a different replica.
- Point-in-time snapshots -- Capture a consistent snapshot of a volume at any moment, enabling restore and cross-region clone operations.
- Replication across failure domains -- Each block is replicated to multiple storage nodes across different racks or availability zones to survive hardware failures.
Non-Functional
- Scalability -- Support millions of volumes across a fleet of storage nodes, with individual volumes ranging from 1 GB to 64 TB.
- Latency -- Sub-millisecond P50 and single-digit millisecond P99 latency for 4 KB random reads and writes on the data path.
- Availability -- 99.999% volume availability; a single storage node failure must not cause any volume downtime.
- Isolation -- IOPS and throughput guarantees per volume must hold even when co-located volumes experience heavy load (noisy neighbor protection).
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Consistency Model and Replication Protocol
Interviewers want you to articulate exactly how you guarantee read-after-write consistency across replicas. They probe whether you use chain replication, quorum writes, or a primary-backup scheme and how each choice affects tail latency.
Hints to consider:
- Compare chain replication (strong ordering, higher write latency) with quorum-based approaches (flexible but requires careful read routing)
- Think about how the primary is elected and what happens during failover — does the system block writes or risk stale reads?
- Consider how you handle in-flight writes during a replica failure and ensure no acknowledged write is lost
- Explore how ZooKeeper or a similar coordination service tracks the authoritative replica set for each volume
2. Control Plane vs. Data Plane Separation
A clean separation is critical for both reliability and performance. Interviewers look for evidence that a control plane outage does not affect active I/O and that the data plane can operate autonomously.
Hints to consider:
- Think about what metadata the data plane needs locally to serve I/O without contacting the control plane on every request
- Consider how volume placement decisions flow from the control plane to data plane nodes
- Explore how the control plane detects unhealthy data plane nodes and triggers re-replication without disrupting active I/O
- Discuss how Kafka or an event bus can decouple control plane commands from data plane execution
3. Snapshot Implementation
Point-in-time snapshots on a live volume require careful coordination. Interviewers probe whether you use copy-on-write, redirect-on-write, or a log-based approach and how snapshots interact with ongoing writes.
Hints to consider:
- Consider copy-on-write at the block level: before overwriting a block, copy the original to a snapshot region
- Think about how snapshot metadata tracks which blocks differ from the base volume
- Explore incremental snapshots that only store changed blocks relative to the previous snapshot
- Discuss how snapshot reads are served and what the read amplification cost looks like