Practice/Amazon/Design a Real-Time EC2 Spot Instance Bidding System

Design a Real-Time EC2 Spot Instance Bidding System

System DesignOptional

Problem Statement

Design a marketplace system that allows cloud customers to bid on unused compute capacity at discounted prices. Users submit bids specifying their maximum price, desired instance type, availability zone, and quantity. The platform continuously matches bids against available capacity at market-clearing prices, and when demand exceeds supply or spot prices rise above a user's bid, the system reclaims instances with a guaranteed two-minute advance notice.

The challenge lies in operating a high-throughput auction engine that balances marketplace efficiency with operational reliability. You must prevent double-allocation of scarce resources, deliver interruption warnings within tight SLA windows, and implement reclamation policies that minimize disruption to running workloads. The system handles thousands of concurrent bids, processes capacity changes in real time, and maintains strict ordering guarantees for fairness across regional markets with independent pricing.

Key Requirements

Functional

Bid management -- users submit bids specifying instance type, region, availability zone, maximum price, quantity, and duration; they can modify or cancel pending bids before allocation
Real-time allocation -- system matches bids to available capacity using market-clearing algorithms, provisions instances when bids clear, and updates users on allocation status
Interruption handling -- platform delivers guaranteed two-minute termination warnings through multiple channels (webhooks, SQS, polling API), tracks acknowledgment, and executes graceful shutdown workflows
Market transparency -- users view current spot prices per instance type and AZ, see historical pricing trends, monitor active instances, and receive capacity availability signals

Non-Functional

Scalability -- support 100K active bids and 50K running instances per region, process 10K bid submissions per second during peak hours across 20+ availability zones
Reliability -- guarantee 99.95% delivery of two-minute interruption notices, prevent double-allocation through strong consistency, recover from partial failures without losing bid state
Latency -- match and allocate instances within 500ms of bid submission, propagate spot price updates within 1 second, deliver interruption notices within 5 seconds of reclaim decision
Consistency -- ensure at-most-once allocation semantics per capacity unit, maintain strict ordering of bids within priority tiers, provide read-after-write consistency for bid status queries

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Market Segmentation and Partitioning Strategy

How you isolate different markets to avoid cross-market contention and enable independent pricing is critical for horizontal scalability.

Hints to consider:

Partition by (instance-type, availability-zone) pairs to create independent order books that scale horizontally
Use consistent hashing to distribute markets across matcher services, ensuring each market has a single writer to prevent allocation races
Consider separate price-calculation engines per market that aggregate supply/demand signals without coordinating across boundaries
Design for hot market handling when popular instance types see 10x normal bid volume during demand spikes

2. Atomic Allocation Under High Concurrency

Preventing double-allocation when multiple users bid for the same scarce capacity unit.

Hints to consider:

Implement optimistic concurrency control using versioned capacity records with compare-and-swap operations on each allocation attempt
Use idempotency tokens on bid submissions so retries do not create duplicate allocations, storing token-to-allocation mappings with expiry
Consider a two-phase protocol: reserve capacity tentatively, provision the instance, then commit the reservation or release on failure
Maintain allocation audit logs with timestamps and correlation IDs to debug inconsistencies during post-mortems

3. Guaranteed Interruption Notification Pipeline

Delivering two-minute warnings reliably when instances must be reclaimed.

Hints to consider:

Use a durable event queue (Kafka) to publish interruption events, fan out to multiple notification channels (webhook, SQS, metadata service)
Implement acknowledgment tracking where users must confirm receipt; retry with exponential backoff for unacknowledged notices until timeout
Schedule termination jobs in a distributed delay queue with exactly-once semantics, so instances are not killed prematurely even if schedulers fail
Provide metrics dashboards showing notification latency distribution and acknowledgment rates to meet SLA visibility requirements

4. Price Discovery and Market Clearing Algorithm

Calculating spot prices that balance supply and demand while maintaining fairness.

Hints to consider:

Run a periodic (every 5-10 seconds) auction clearing process that sorts bids by price, matches against available capacity, and sets the market-clearing price
Consider first-price vs uniform-price auction semantics: uniform pricing reduces strategic gaming but may waste user budget
Handle partial fills when a user requests 10 instances but only 3 are available, requiring atomic multi-unit allocation logic
Design surge pricing dampers to prevent price spikes from cascading failures or flash crowds

5. Reclamation Strategy Minimizing User Impact

Choosing which instances to interrupt when capacity must be reclaimed requires a principled selection algorithm.

Hints to consider:

Prioritize reclaiming instances from users with lowest bids first, implementing a min-heap of active allocations sorted by bid price
Consider instance age as a tie-breaker: prefer interrupting recently launched instances to minimize wasted compute
Allow users to specify interruption-preference signals (checkpoint-friendly workloads vs long-running batch) that influence reclaim decisions
Batch reclamations to avoid thrashing: if you need 100 instances back, issue all 100 interruption notices simultaneously

Practice/Amazon/Design a Real-Time EC2 Spot Instance Bidding System

Design a Real-Time EC2 Spot Instance Bidding System

System DesignOptional

Problem Statement

Key Requirements

Functional

Bid management -- users submit bids specifying instance type, region, availability zone, maximum price, quantity, and duration; they can modify or cancel pending bids before allocation
Real-time allocation -- system matches bids to available capacity using market-clearing algorithms, provisions instances when bids clear, and updates users on allocation status
Interruption handling -- platform delivers guaranteed two-minute termination warnings through multiple channels (webhooks, SQS, polling API), tracks acknowledgment, and executes graceful shutdown workflows
Market transparency -- users view current spot prices per instance type and AZ, see historical pricing trends, monitor active instances, and receive capacity availability signals

Non-Functional

Scalability -- support 100K active bids and 50K running instances per region, process 10K bid submissions per second during peak hours across 20+ availability zones
Reliability -- guarantee 99.95% delivery of two-minute interruption notices, prevent double-allocation through strong consistency, recover from partial failures without losing bid state
Latency -- match and allocate instances within 500ms of bid submission, propagate spot price updates within 1 second, deliver interruption notices within 5 seconds of reclaim decision
Consistency -- ensure at-most-once allocation semantics per capacity unit, maintain strict ordering of bids within priority tiers, provide read-after-write consistency for bid status queries

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Market Segmentation and Partitioning Strategy

How you isolate different markets to avoid cross-market contention and enable independent pricing is critical for horizontal scalability.

Hints to consider:

Partition by (instance-type, availability-zone) pairs to create independent order books that scale horizontally
Use consistent hashing to distribute markets across matcher services, ensuring each market has a single writer to prevent allocation races
Consider separate price-calculation engines per market that aggregate supply/demand signals without coordinating across boundaries
Design for hot market handling when popular instance types see 10x normal bid volume during demand spikes

2. Atomic Allocation Under High Concurrency

Preventing double-allocation when multiple users bid for the same scarce capacity unit.

Hints to consider:

Implement optimistic concurrency control using versioned capacity records with compare-and-swap operations on each allocation attempt
Use idempotency tokens on bid submissions so retries do not create duplicate allocations, storing token-to-allocation mappings with expiry
Consider a two-phase protocol: reserve capacity tentatively, provision the instance, then commit the reservation or release on failure
Maintain allocation audit logs with timestamps and correlation IDs to debug inconsistencies during post-mortems

3. Guaranteed Interruption Notification Pipeline

Delivering two-minute warnings reliably when instances must be reclaimed.

Hints to consider:

Use a durable event queue (Kafka) to publish interruption events, fan out to multiple notification channels (webhook, SQS, metadata service)
Implement acknowledgment tracking where users must confirm receipt; retry with exponential backoff for unacknowledged notices until timeout
Schedule termination jobs in a distributed delay queue with exactly-once semantics, so instances are not killed prematurely even if schedulers fail
Provide metrics dashboards showing notification latency distribution and acknowledgment rates to meet SLA visibility requirements

4. Price Discovery and Market Clearing Algorithm

Calculating spot prices that balance supply and demand while maintaining fairness.

Hints to consider:

Run a periodic (every 5-10 seconds) auction clearing process that sorts bids by price, matches against available capacity, and sets the market-clearing price
Consider first-price vs uniform-price auction semantics: uniform pricing reduces strategic gaming but may waste user budget
Handle partial fills when a user requests 10 instances but only 3 are available, requiring atomic multi-unit allocation logic
Design surge pricing dampers to prevent price spikes from cascading failures or flash crowds

5. Reclamation Strategy Minimizing User Impact

Choosing which instances to interrupt when capacity must be reclaimed requires a principled selection algorithm.

Hints to consider:

Prioritize reclaiming instances from users with lowest bids first, implementing a min-heap of active allocations sorted by bid price
Consider instance age as a tie-breaker: prefer interrupting recently launched instances to minimize wasted compute
Allow users to specify interruption-preference signals (checkpoint-friendly workloads vs long-running batch) that influence reclaim decisions
Batch reclamations to avoid thrashing: if you need 100 instances back, issue all 100 interruption notices simultaneously