Practice/Amazon/Design a Real-Time EC2 Spot Instance Bidding System
Design a Real-Time EC2 Spot Instance Bidding System
System DesignOptional
Problem Statement
Design a marketplace system that allows cloud customers to bid on unused compute capacity at discounted prices. Users submit bids specifying their maximum price, desired instance type, availability zone, and quantity. The platform continuously matches bids against available capacity at market-clearing prices, and when demand exceeds supply or spot prices rise above a user's bid, the system reclaims instances with a guaranteed two-minute advance notice.
The challenge lies in operating a high-throughput auction engine that balances marketplace efficiency with operational reliability. You must prevent double-allocation of scarce resources, deliver interruption warnings within tight SLA windows, and implement reclamation policies that minimize disruption to running workloads. The system handles thousands of concurrent bids, processes capacity changes in real time, and maintains strict ordering guarantees for fairness across regional markets with independent pricing.
Key Requirements
Functional
- Bid management -- users submit bids specifying instance type, region, availability zone, maximum price, quantity, and duration; they can modify or cancel pending bids before allocation
- Real-time allocation -- system matches bids to available capacity using market-clearing algorithms, provisions instances when bids clear, and updates users on allocation status
- Interruption handling -- platform delivers guaranteed two-minute termination warnings through multiple channels (webhooks, SQS, polling API), tracks acknowledgment, and executes graceful shutdown workflows
- Market transparency -- users view current spot prices per instance type and AZ, see historical pricing trends, monitor active instances, and receive capacity availability signals
Non-Functional
- Scalability -- support 100K active bids and 50K running instances per region, process 10K bid submissions per second during peak hours across 20+ availability zones
- Reliability -- guarantee 99.95% delivery of two-minute interruption notices, prevent double-allocation through strong consistency, recover from partial failures without losing bid state
- Latency -- match and allocate instances within 500ms of bid submission, propagate spot price updates within 1 second, deliver interruption notices within 5 seconds of reclaim decision
- Consistency -- ensure at-most-once allocation semantics per capacity unit, maintain strict ordering of bids within priority tiers, provide read-after-write consistency for bid status queries
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Market Segmentation and Partitioning Strategy
How you isolate different markets to avoid cross-market contention and enable independent pricing is critical for horizontal scalability.
Hints to consider:
- Partition by (instance-type, availability-zone) pairs to create independent order books that scale horizontally
- Use consistent hashing to distribute markets across matcher services, ensuring each market has a single writer to prevent allocation races
- Consider separate price-calculation engines per market that aggregate supply/demand signals without coordinating across boundaries
- Design for hot market handling when popular instance types see 10x normal bid volume during demand spikes
2. Atomic Allocation Under High Concurrency
Preventing double-allocation when multiple users bid for the same scarce capacity unit.
Hints to consider:
- Implement optimistic concurrency control using versioned capacity records with compare-and-swap operations on each allocation attempt
- Use idempotency tokens on bid submissions so retries do not create duplicate allocations, storing token-to-allocation mappings with expiry
- Consider a two-phase protocol: reserve capacity tentatively, provision the instance, then commit the reservation or release on failure
- Maintain allocation audit logs with timestamps and correlation IDs to debug inconsistencies during post-mortems
3. Guaranteed Interruption Notification Pipeline
Delivering two-minute warnings reliably when instances must be reclaimed.
Hints to consider:
- Use a durable event queue (Kafka) to publish interruption events, fan out to multiple notification channels (webhook, SQS, metadata service)
- Implement acknowledgment tracking where users must confirm receipt; retry with exponential backoff for unacknowledged notices until timeout
- Schedule termination jobs in a distributed delay queue with exactly-once semantics, so instances are not killed prematurely even if schedulers fail
- Provide metrics dashboards showing notification latency distribution and acknowledgment rates to meet SLA visibility requirements
4. Price Discovery and Market Clearing Algorithm
Calculating spot prices that balance supply and demand while maintaining fairness.
Hints to consider:
- Run a periodic (every 5-10 seconds) auction clearing process that sorts bids by price, matches against available capacity, and sets the market-clearing price
- Consider first-price vs uniform-price auction semantics: uniform pricing reduces strategic gaming but may waste user budget
- Handle partial fills when a user requests 10 instances but only 3 are available, requiring atomic multi-unit allocation logic
- Design surge pricing dampers to prevent price spikes from cascading failures or flash crowds
5. Reclamation Strategy Minimizing User Impact
Choosing which instances to interrupt when capacity must be reclaimed requires a principled selection algorithm.
Hints to consider:
- Prioritize reclaiming instances from users with lowest bids first, implementing a min-heap of active allocations sorted by bid price
- Consider instance age as a tie-breaker: prefer interrupting recently launched instances to minimize wasted compute
- Allow users to specify interruption-preference signals (checkpoint-friendly workloads vs long-running batch) that influence reclaim decisions
- Batch reclamations to avoid thrashing: if you need 100 instances back, issue all 100 interruption notices simultaneously