System Design - Payment Gateway System — Databricks

Problem Statement

Build a distributed system that enables advertisers to launch campaigns with defined spending limits across multiple geographic markets, tracks user interactions with advertisements in real time, and bills based on validated engagement events. Your platform must serve millions of ad requests per second globally while preventing budget overruns and ensuring billing accuracy.

The core challenge lies in managing high-contention budget counters that are updated concurrently from thousands of edge servers, propagating budget exhaustion signals to distributed ad-serving nodes within seconds, and guaranteeing that duplicate events don't result in double-billing. The system must remain available during partial failures while maintaining strict financial correctness—advertisers should never be charged twice for the same click, nor should campaigns overspend their allocated budgets by more than a small tolerance threshold.

Key Requirements

Functional

Campaign Management -- Advertisers create campaigns with total and daily spending caps, geographic targeting rules, and scheduling parameters that determine when ads are eligible for display
Real-Time Ad Serving -- The system evaluates incoming ad requests against active campaigns, selects eligible ads based on budget availability and targeting criteria, and records impressions when ads are displayed
Event-Based Billing -- Track user clicks on served ads, validate events for fraud detection, apply charges against campaign budgets, and ensure each billable event is processed exactly once
Budget Enforcement -- Monitor spending in near real-time across all regions, automatically pause campaigns when budgets are exhausted, and support manual pause/resume operations with immediate propagation

Non-Functional

Scalability -- Handle 10 million ad requests per second globally, with 100K active campaigns and a 2% click-through rate generating 200K billing events per second
Reliability -- Maintain 99.95% uptime for ad serving with graceful degradation, ensure zero data loss for billing events, and support multi-region failover for critical path components
Latency -- Ad selection decisions must complete within 50ms end-to-end, budget checks should add less than 5ms overhead, and budget exhaustion signals must reach all serving nodes within 10 seconds
Consistency -- Guarantee exactly-once billing semantics for clicks, tolerate temporary budget overspend of up to 5% during propagation delays, and provide eventually consistent reporting views

Based on real interview experiences, these are the areas interviewers probe most deeply:

Multiple ad servers simultaneously checking and decrementing the same campaign budget creates a classic distributed counter problem with severe contention. Interviewers want to see how you handle atomic updates without creating bottlenecks.

Partition budgets into sharded counters that can be updated independently, then aggregate for reporting
Use optimistic locking or compare-and-swap operations to handle races without distributed locks
Consider implementing local budget allocations where each serving node gets a quota to spend autonomously
Discuss tradeoffs between strict consistency (slower) and allowing bounded overspend (faster)

Ad impressions and clicks flow through unreliable networks with at-least-once delivery guarantees, creating duplicate events. Billing the same click twice violates financial correctness.

When a campaign exhausts its budget in one region, edge servers worldwide must stop serving that campaign within seconds to prevent massive overspend. Traditional polling is too slow.

Implement a push-based notification system using pub-sub patterns to broadcast state changes
Cache campaign eligibility decisions at edge nodes with TTL-based invalidation plus active push
Use eventual consistency with compensating actions rather than waiting for global coordination
Design circuit breakers that fail open (allow serving) versus fail closed (block serving) based on criticality

The path from impression to final charge involves multiple stages: serve ad, record impression, detect click, validate for fraud, reserve budget, settle charge, handle refunds. Each step can fail independently.

Model the workflow as a saga with compensating transactions for each stage
Store intermediate states durably (pending, reserved, charged, refunded) with timeout-based reconciliation
Use event sourcing to maintain an audit trail of all state transitions for debugging and compliance
Design retry policies that distinguish transient failures (retry) from permanent failures (compensate)

Campaigns target specific regions but are managed globally, creating tension between centralizing budget logic and distributing it for low latency.