Practice/Microsoft/Design a real-time event classification system

Design a real-time event classification system

System DesignOptional

Problem Statement

Design a system that monitors financial transactions in real time and detects potentially fraudulent activity by evaluating each transaction against a set of dynamic detection rules. The system must process tens of thousands of transactions per second from payment processors, banking APIs, and e-commerce platforms, classify each transaction as legitimate or suspicious within milliseconds, and trigger appropriate actions such as blocking transactions, flagging for review, or sending alerts to fraud analysts.

The core challenge is building a low-latency streaming pipeline that can evaluate complex rules (single transaction thresholds, velocity checks across time windows, behavioral anomalies, and multi-step patterns) while maintaining high availability and accuracy. The system must handle bursty traffic during peak shopping periods, support frequent rule updates without downtime, and provide audit trails for regulatory compliance. You'll need to balance processing speed with detection accuracy while managing infrastructure costs at scale.

Key Requirements

Functional

Transaction ingestion -- Accept transactions from multiple payment processors and platforms with schema validation and acknowledgment within 100ms
Rule evaluation -- Apply configurable fraud detection rules including amount thresholds, geographic anomalies, velocity checks, and sequence patterns to each transaction
Real-time decisioning -- Return allow/block/review decisions fast enough to prevent fraudulent transactions while minimizing false positives that harm user experience
Rule management -- Enable fraud analysts to create, test, version, and deploy new detection rules with rollback capability and gradual rollout
Alert and action routing -- Notify fraud teams, block suspicious transactions, and trigger step-up authentication based on risk scores
Historical analysis -- Support replaying past transactions against new rules to validate effectiveness before production deployment

Non-Functional

Scalability -- Process 50,000 transactions per second during peak load with horizontal scaling capability to handle 5x traffic spikes
Reliability -- Ensure 99.99% uptime with no transaction loss, automatic failover, and graceful degradation when downstream services are unavailable
Latency -- Complete rule evaluation and return decisions within 200ms at p99 to avoid timeout failures in payment flows
Consistency -- Maintain exactly-once processing guarantees to prevent duplicate charges or missed fraud, with eventual consistency for analytics data

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Stateful Stream Processing Architecture

Fraud detection requires maintaining state across time windows (transactions per user in the last hour, geographic location changes, spending patterns). Interviewers want to see how you design stateful processing that handles out-of-order events, manages state size, and recovers quickly from failures.

Hints to consider:

Use event time with watermarks rather than processing time to handle delayed transactions correctly
Partition state by user ID or account ID to enable parallel processing and limit state scope per worker
Consider how to expire old state entries and compact stateful stores to prevent unbounded growth
Design checkpointing strategies that balance recovery speed with overhead during normal operation

2. Rule Engine Design and Dynamic Configuration

The rule engine must evaluate hundreds of rules per transaction with different complexity levels (simple thresholds to multi-event sequences) while supporting real-time rule updates. This separation of control plane and data plane is critical.

Hints to consider:

Store rules in a versioned configuration service with caching layers to avoid rule-fetch becoming a bottleneck
Design a rule representation that supports both simple filters and complex temporal patterns without forcing all rules through expensive evaluation paths
Implement rule compilation or optimization phases so frequently-evaluated rules execute faster
Use rule shadowing or A/B testing to validate new rules on production traffic before full activation
Consider how to handle rule conflicts when multiple rules match the same transaction

3. Handling Hotspots and Skewed Traffic

Some users generate far more transactions than others (large merchants, frequent shoppers), and some rules are more expensive to evaluate. Interviewers expect you to address partition skew and prevent hot partitions from becoming bottlenecks.

Hints to consider:

Use composite partition keys (user + random suffix) to split high-volume accounts across multiple partitions
Implement per-tenant or per-user rate limiting with token bucket algorithms to prevent abuse
Design separate fast and slow evaluation paths where simple rules execute first to short-circuit expensive checks
Consider pre-aggregating metrics for hot users in a cache layer to speed up velocity checks
Use priority queues or separate processing lanes for different transaction value tiers

4. Ensuring Correctness Under Failures

Financial systems cannot lose transactions or produce duplicate fraud decisions. Interviewers probe how you guarantee exactly-once semantics, idempotency, and correct recovery from partial failures across distributed components.

Hints to consider:

Design idempotent decision APIs using transaction IDs so retries don't cause duplicate blocks or alerts
Use distributed transactions or saga patterns when fraud decisions must coordinate with payment processing and notification systems
Implement deduplication windows with bloom filters or distributed caches to catch duplicate transaction submissions
Store decision audit logs in durable storage before acknowledging transaction processing
Plan for partial failures where rule evaluation completes but downstream actions (alerts, blocks) fail

Practice/Microsoft/Design a real-time event classification system

Design a real-time event classification system

System DesignOptional

Problem Statement

Key Requirements

Functional

Transaction ingestion -- Accept transactions from multiple payment processors and platforms with schema validation and acknowledgment within 100ms
Rule evaluation -- Apply configurable fraud detection rules including amount thresholds, geographic anomalies, velocity checks, and sequence patterns to each transaction
Real-time decisioning -- Return allow/block/review decisions fast enough to prevent fraudulent transactions while minimizing false positives that harm user experience
Rule management -- Enable fraud analysts to create, test, version, and deploy new detection rules with rollback capability and gradual rollout
Alert and action routing -- Notify fraud teams, block suspicious transactions, and trigger step-up authentication based on risk scores
Historical analysis -- Support replaying past transactions against new rules to validate effectiveness before production deployment

Non-Functional

Scalability -- Process 50,000 transactions per second during peak load with horizontal scaling capability to handle 5x traffic spikes
Reliability -- Ensure 99.99% uptime with no transaction loss, automatic failover, and graceful degradation when downstream services are unavailable
Latency -- Complete rule evaluation and return decisions within 200ms at p99 to avoid timeout failures in payment flows
Consistency -- Maintain exactly-once processing guarantees to prevent duplicate charges or missed fraud, with eventual consistency for analytics data

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Stateful Stream Processing Architecture

Hints to consider:

Use event time with watermarks rather than processing time to handle delayed transactions correctly
Partition state by user ID or account ID to enable parallel processing and limit state scope per worker
Consider how to expire old state entries and compact stateful stores to prevent unbounded growth
Design checkpointing strategies that balance recovery speed with overhead during normal operation

2. Rule Engine Design and Dynamic Configuration

Hints to consider:

Store rules in a versioned configuration service with caching layers to avoid rule-fetch becoming a bottleneck
Design a rule representation that supports both simple filters and complex temporal patterns without forcing all rules through expensive evaluation paths
Implement rule compilation or optimization phases so frequently-evaluated rules execute faster
Use rule shadowing or A/B testing to validate new rules on production traffic before full activation
Consider how to handle rule conflicts when multiple rules match the same transaction

3. Handling Hotspots and Skewed Traffic

Hints to consider:

Use composite partition keys (user + random suffix) to split high-volume accounts across multiple partitions
Implement per-tenant or per-user rate limiting with token bucket algorithms to prevent abuse
Design separate fast and slow evaluation paths where simple rules execute first to short-circuit expensive checks
Consider pre-aggregating metrics for hot users in a cache layer to speed up velocity checks
Use priority queues or separate processing lanes for different transaction value tiers

4. Ensuring Correctness Under Failures

Hints to consider:

Design idempotent decision APIs using transaction IDs so retries don't cause duplicate blocks or alerts
Use distributed transactions or saga patterns when fraud decisions must coordinate with payment processing and notification systems
Implement deduplication windows with bloom filters or distributed caches to catch duplicate transaction submissions
Store decision audit logs in durable storage before acknowledging transaction processing
Plan for partial failures where rule evaluation completes but downstream actions (alerts, blocks) fail