System Design - Metric Counter Library — Stripe

Problem Statement

Design a library that enables services to collect and aggregate metrics. This is a foundational component used by many production systems to track events, measure performance, and power dashboards and alerting.

Your library should provide a simple interface for services to record and query metric values. Services need to increment counters for events like API calls, errors, and user actions; query metrics over specific time windows such as last minute, last hour, or last day; and support tagging metrics with dimensions like endpoint, region, or status code. This problem tests your ability to design clean APIs, manage memory-bounded data structures, and reason about concurrency in a high-throughput library context.

Key Requirements

Functional

Increment counters -- record occurrences of events (e.g., API calls, errors, user actions) with an optional value parameter
Time-windowed aggregation -- query metrics over specific time windows (last minute, last hour, last day)
Multi-dimensional metrics -- support tagging metrics with key-value dimensions (e.g., endpoint, region, status code)
Query by tags -- retrieve aggregated counts filtered by specific tag combinations

Non-Functional

Low overhead -- metric recording must not impact the host service's latency; sub-microsecond increment calls
Memory bounded -- memory usage must be bounded even with multiple time windows and high-cardinality tags
Thread safety -- correctness under concurrent access from multiple threads recording metrics simultaneously
Flush reliability -- metrics must be exported to a central aggregation system without data loss during flush failures

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Time Window Implementation (Most Emphasized)

Interviewers want to see how you implement sliding vs. tumbling windows and the tradeoffs between them. This is the core algorithmic challenge of the problem.

Sliding windows give smoother results but require more memory and bookkeeping
Tumbling (fixed) windows are simpler but can miss events at boundary edges
A ring buffer of fixed-size buckets (e.g., 60 one-second buckets for a one-minute window) is a common practical approach
Consider how you expire old buckets -- lazy cleanup on read vs. background cleanup thread

Since multiple threads will call increment concurrently, interviewers probe your approach to lock-free or low-contention data structures.

AtomicLong or ConcurrentHashMap for lock-free counters
Striped locks or per-bucket locking to reduce contention
Compare-and-swap (CAS) loops for atomic updates without global locks
Read-write lock separation: frequent writes (increments) vs. infrequent reads (queries)

How do you handle metrics with many unique tag combinations (e.g., per-user metrics) without unbounded memory growth?

Set a maximum cardinality limit per metric name and drop or aggregate overflow tags
Use approximate data structures (e.g., HyperLogLog, Count-Min Sketch) for high-cardinality dimensions
Distinguish between low-cardinality tags (status code, region) and high-cardinality tags (user ID) at registration time
Pre-aggregate locally before flushing to reduce the volume sent to backends

How and when does the library export metrics to a central system like Prometheus, StatsD, or a custom aggregation service?

Periodic flush on a background thread (e.g., every 10 seconds)
Double-buffering: swap the active buffer with a flush buffer to avoid blocking writers
Retry logic with exponential backoff on flush failures
Ensure partial flush failures do not result in data loss -- keep failed batches for the next cycle

Interviewers may ask how you handle time-based windows when system clocks drift or NTP corrections occur.

Use a monotonic clock for bucket placement rather than wall-clock time
Handle backward clock jumps gracefully without corrupting window boundaries
Consider what happens when the service is paused (e.g., GC pause) and resumes with stale bucket pointers