Coding - Implement Circuit Breaker Pattern — Databricks

Problem Statement

Design and implement a Circuit Breaker pattern for distributed systems that prevents cascading failures by automatically detecting service degradation and blocking requests to failing services. The system must manage three states (CLOSED, OPEN, HALF_OPEN), handle concurrent access safely, and make intelligent decisions about when to attempt recovery.

This problem tests your understanding of concurrency control, state machines, fault tolerance patterns, and the trade-offs between different synchronization mechanisms. Interviewers will probe deeply on thread safety, atomic operations, and how your design prevents race conditions in high-concurrency environments.

Key Requirements

Functional

Three-state management -- CLOSED (normal operation), OPEN (blocking requests), HALF_OPEN (testing recovery)
Automatic state transitions -- CLOSED to OPEN on threshold failures, OPEN to HALF_OPEN after timeout, HALF_OPEN to CLOSED on success
Failure counting -- track consecutive failures and trigger circuit opening at configurable threshold
Recovery timeout -- wait specified duration before attempting to test service recovery
Generic operation wrapping -- execute any operation through the circuit breaker with type-safe return values
State inspection -- allow monitoring systems to query current circuit state

Non-Functional

Thread safety -- support concurrent calls from hundreds or thousands of threads without data corruption
Lock-free design -- prefer atomic operations over explicit locking for better performance under high concurrency
Low overhead -- circuit breaker should add minimal latency to successful operations (microseconds, not milliseconds)
Fail-fast -- when circuit is OPEN, reject requests immediately without attempting the operation
Configurable thresholds -- support different failure thresholds and recovery timeouts for different services

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Thread Safety Implementation (Most Emphasized)

Interviewers want to see you choose appropriate concurrency primitives and explain why.

AtomicReference for state: lock-free state transitions using compare-and-set (CAS)
AtomicInteger for failure count: lock-free increment and reset operations
Why atomic operations over synchronized: no thread blocking, better performance at high concurrency
Race condition in OPEN to HALF_OPEN transition: only one thread should attempt recovery test
CAS pattern for state transitions: state.compareAndSet(OPEN, HALF_OPEN) ensures only one thread succeeds
Alternative: ReadWriteLock for situations where synchronized is acceptable

How to handle concurrent state changes correctly without lost updates or inconsistent state.

Different approaches to counting failures and their trade-offs.

Consecutive failures: simpler, but one success resets count (can mask gradual degradation)
Sliding window: count failures in last N requests (more accurate but higher memory usage)
Time-based window: failures per time period (handles varying traffic rates better)
Reset on success: failure count resets to zero on successful call in CLOSED state
HALF_OPEN behavior: any failure immediately reopens circuit, success closes it

How to determine when to attempt recovery and handle timeout expiry correctly.

Store last failure timestamp using AtomicReference for thread-safe access
On each request during OPEN state, compare current time with last failure time
If Duration.between(lastFailure, now) > recoveryTimeout, attempt transition to HALF_OPEN
Only one thread should perform the recovery test (use CAS for exclusive transition)
If recovery test fails, immediately return to OPEN with new timestamp
Exponential backoff for repeated failures: increase timeout after multiple failed recovery attempts

How to instrument the circuit breaker for production monitoring.

Ask about expected concurrency level (10 threads vs 10,000 threads), whether to optimize for read-heavy or balanced workloads, if configurable failure criteria are needed (not all exceptions count as failures), and whether metrics/monitoring integration is expected.