Practice/Stripe/Metric Counter Library
Metric Counter Library
CodingMust
Problem Statement
Design a library that enables services to collect and aggregate metrics. This is a foundational component used by many production systems to track events, measure performance, and power dashboards and alerting.
Your library should provide a simple interface for services to record and query metric values. Services need to increment counters for events like API calls, errors, and user actions; query metrics over specific time windows such as last minute, last hour, or last day; and support tagging metrics with dimensions like endpoint, region, or status code. This problem tests your ability to design clean APIs, manage memory-bounded data structures, and reason about concurrency in a high-throughput library context.
Key Requirements
Functional
- Increment counters -- record occurrences of events (e.g., API calls, errors, user actions) with an optional value parameter
- Time-windowed aggregation -- query metrics over specific time windows (last minute, last hour, last day)
- Multi-dimensional metrics -- support tagging metrics with key-value dimensions (e.g., endpoint, region, status code)
- Query by tags -- retrieve aggregated counts filtered by specific tag combinations
Non-Functional
- Low overhead -- metric recording must not impact the host service's latency; sub-microsecond
increment calls
- Memory bounded -- memory usage must be bounded even with multiple time windows and high-cardinality tags
- Thread safety -- correctness under concurrent access from multiple threads recording metrics simultaneously
- Flush reliability -- metrics must be exported to a central aggregation system without data loss during flush failures
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Time Window Implementation (Most Emphasized)
Interviewers want to see how you implement sliding vs. tumbling windows and the tradeoffs between them. This is the core algorithmic challenge of the problem.
Hints to consider:
- Sliding windows give smoother results but require more memory and bookkeeping
- Tumbling (fixed) windows are simpler but can miss events at boundary edges
- A ring buffer of fixed-size buckets (e.g., 60 one-second buckets for a one-minute window) is a common practical approach
- Consider how you expire old buckets -- lazy cleanup on read vs. background cleanup thread
2. Concurrency and Thread Safety
Since multiple threads will call increment concurrently, interviewers probe your approach to lock-free or low-contention data structures.
Hints to consider:
AtomicLong or ConcurrentHashMap for lock-free counters
- Striped locks or per-bucket locking to reduce contention
- Compare-and-swap (CAS) loops for atomic updates without global locks
- Read-write lock separation: frequent writes (increments) vs. infrequent reads (queries)
3. High Cardinality Tag Handling
How do you handle metrics with many unique tag combinations (e.g., per-user metrics) without unbounded memory growth?
Hints to consider:
- Set a maximum cardinality limit per metric name and drop or aggregate overflow tags
- Use approximate data structures (e.g., HyperLogLog, Count-Min Sketch) for high-cardinality dimensions
- Distinguish between low-cardinality tags (status code, region) and high-cardinality tags (user ID) at registration time
- Pre-aggregate locally before flushing to reduce the volume sent to backends
4. Batching, Flushing, and Backend Integration
How and when does the library export metrics to a central system like Prometheus, StatsD, or a custom aggregation service?
Hints to consider:
- Periodic flush on a background thread (e.g., every 10 seconds)
- Double-buffering: swap the active buffer with a flush buffer to avoid blocking writers
- Retry logic with exponential backoff on flush failures
- Ensure partial flush failures do not result in data loss -- keep failed batches for the next cycle
5. Clock Skew and Edge Cases
Interviewers may ask how you handle time-based windows when system clocks drift or NTP corrections occur.
Hints to consider:
- Use a monotonic clock for bucket placement rather than wall-clock time
- Handle backward clock jumps gracefully without corrupting window boundaries
- Consider what happens when the service is paused (e.g., GC pause) and resumes with stale bucket pointers
Suggested Approach
Step 1: Clarify Requirements
Confirm the API contract, supported time windows, expected throughput (events per second), and whether approximate counting is acceptable. Ask whether the library must support gauge and histogram metric types or only counters.