Practice/Google/Design a block storage system

Design a block storage system

System DesignMust

Problem Statement

You are asked to design a distributed block storage system similar to AWS EBS or Google Persistent Disk. The system exposes fixed-size volumes that virtual machines mount as raw block devices. Every read must return the most recently acknowledged write, making strict read-after-write consistency a non-negotiable requirement.

The architecture must cleanly separate the control plane (volume lifecycle, placement, snapshot management) from the data plane (actual block I/O). The control plane handles provisioning, attaching volumes to VMs, and orchestrating snapshots, while the data plane must deliver predictable, low-latency I/O regardless of neighboring workloads on the same physical infrastructure.

A production-grade system must also support point-in-time snapshots for backup and cloning, replication across failure domains for durability, and quality-of-service mechanisms that prevent one noisy tenant from degrading another tenant's performance.

Key Requirements

Functional

Volume CRUD and attachment -- Create, resize, delete, and attach/detach fixed-size block volumes to virtual machines with strict ownership semantics.
Read-after-write consistency -- Any read issued after a write acknowledgment must return the written data, even if the read is served by a different replica.
Point-in-time snapshots -- Capture a consistent snapshot of a volume at any moment, enabling restore and cross-region clone operations.
Replication across failure domains -- Each block is replicated to multiple storage nodes across different racks or availability zones to survive hardware failures.

Non-Functional

Scalability -- Support millions of volumes across a fleet of storage nodes, with individual volumes ranging from 1 GB to 64 TB.
Latency -- Sub-millisecond P50 and single-digit millisecond P99 latency for 4 KB random reads and writes on the data path.
Availability -- 99.999% volume availability; a single storage node failure must not cause any volume downtime.
Isolation -- IOPS and throughput guarantees per volume must hold even when co-located volumes experience heavy load (noisy neighbor protection).

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Consistency Model and Replication Protocol

Interviewers want you to articulate exactly how you guarantee read-after-write consistency across replicas. They probe whether you use chain replication, quorum writes, or a primary-backup scheme and how each choice affects tail latency. Hints to consider:

Compare chain replication (strong ordering, higher write latency) with quorum-based approaches (flexible but requires careful read routing)
Think about how the primary is elected and what happens during failover — does the system block writes or risk stale reads?
Consider how you handle in-flight writes during a replica failure and ensure no acknowledged write is lost
Explore how ZooKeeper or a similar coordination service tracks the authoritative replica set for each volume

2. Control Plane vs. Data Plane Separation

A clean separation is critical for both reliability and performance. Interviewers look for evidence that a control plane outage does not affect active I/O and that the data plane can operate autonomously. Hints to consider:

Think about what metadata the data plane needs locally to serve I/O without contacting the control plane on every request
Consider how volume placement decisions flow from the control plane to data plane nodes
Explore how the control plane detects unhealthy data plane nodes and triggers re-replication without disrupting active I/O
Discuss how Kafka or an event bus can decouple control plane commands from data plane execution

3. Snapshot Implementation

Point-in-time snapshots on a live volume require careful coordination. Interviewers probe whether you use copy-on-write, redirect-on-write, or a log-based approach and how snapshots interact with ongoing writes. Hints to consider:

Consider copy-on-write at the block level: before overwriting a block, copy the original to a snapshot region
Think about how snapshot metadata tracks which blocks differ from the base volume
Explore incremental snapshots that only store changed blocks relative to the previous snapshot
Discuss how snapshot reads are served and what the read amplification cost looks like