Practice/Microsoft/Design a real-time event classification system
Design a real-time event classification system
System DesignOptional
Problem Statement
Design a system that monitors financial transactions in real time and detects potentially fraudulent activity by evaluating each transaction against a set of dynamic detection rules. The system must process tens of thousands of transactions per second from payment processors, banking APIs, and e-commerce platforms, classify each transaction as legitimate or suspicious within milliseconds, and trigger appropriate actions such as blocking transactions, flagging for review, or sending alerts to fraud analysts.
The core challenge is building a low-latency streaming pipeline that can evaluate complex rules (single transaction thresholds, velocity checks across time windows, behavioral anomalies, and multi-step patterns) while maintaining high availability and accuracy. The system must handle bursty traffic during peak shopping periods, support frequent rule updates without downtime, and provide audit trails for regulatory compliance. You'll need to balance processing speed with detection accuracy while managing infrastructure costs at scale.
Key Requirements
Functional
- Transaction ingestion -- Accept transactions from multiple payment processors and platforms with schema validation and acknowledgment within 100ms
- Rule evaluation -- Apply configurable fraud detection rules including amount thresholds, geographic anomalies, velocity checks, and sequence patterns to each transaction
- Real-time decisioning -- Return allow/block/review decisions fast enough to prevent fraudulent transactions while minimizing false positives that harm user experience
- Rule management -- Enable fraud analysts to create, test, version, and deploy new detection rules with rollback capability and gradual rollout
- Alert and action routing -- Notify fraud teams, block suspicious transactions, and trigger step-up authentication based on risk scores
- Historical analysis -- Support replaying past transactions against new rules to validate effectiveness before production deployment
Non-Functional
- Scalability -- Process 50,000 transactions per second during peak load with horizontal scaling capability to handle 5x traffic spikes
- Reliability -- Ensure 99.99% uptime with no transaction loss, automatic failover, and graceful degradation when downstream services are unavailable
- Latency -- Complete rule evaluation and return decisions within 200ms at p99 to avoid timeout failures in payment flows
- Consistency -- Maintain exactly-once processing guarantees to prevent duplicate charges or missed fraud, with eventual consistency for analytics data
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Stateful Stream Processing Architecture
Fraud detection requires maintaining state across time windows (transactions per user in the last hour, geographic location changes, spending patterns). Interviewers want to see how you design stateful processing that handles out-of-order events, manages state size, and recovers quickly from failures.
Hints to consider:
- Use event time with watermarks rather than processing time to handle delayed transactions correctly
- Partition state by user ID or account ID to enable parallel processing and limit state scope per worker
- Consider how to expire old state entries and compact stateful stores to prevent unbounded growth
- Design checkpointing strategies that balance recovery speed with overhead during normal operation
2. Rule Engine Design and Dynamic Configuration
The rule engine must evaluate hundreds of rules per transaction with different complexity levels (simple thresholds to multi-event sequences) while supporting real-time rule updates. This separation of control plane and data plane is critical.
Hints to consider:
- Store rules in a versioned configuration service with caching layers to avoid rule-fetch becoming a bottleneck
- Design a rule representation that supports both simple filters and complex temporal patterns without forcing all rules through expensive evaluation paths
- Implement rule compilation or optimization phases so frequently-evaluated rules execute faster
- Use rule shadowing or A/B testing to validate new rules on production traffic before full activation
- Consider how to handle rule conflicts when multiple rules match the same transaction
3. Handling Hotspots and Skewed Traffic
Some users generate far more transactions than others (large merchants, frequent shoppers), and some rules are more expensive to evaluate. Interviewers expect you to address partition skew and prevent hot partitions from becoming bottlenecks.
Hints to consider:
- Use composite partition keys (user + random suffix) to split high-volume accounts across multiple partitions
- Implement per-tenant or per-user rate limiting with token bucket algorithms to prevent abuse
- Design separate fast and slow evaluation paths where simple rules execute first to short-circuit expensive checks
- Consider pre-aggregating metrics for hot users in a cache layer to speed up velocity checks
- Use priority queues or separate processing lanes for different transaction value tiers
4. Ensuring Correctness Under Failures
Financial systems cannot lose transactions or produce duplicate fraud decisions. Interviewers probe how you guarantee exactly-once semantics, idempotency, and correct recovery from partial failures across distributed components.
Hints to consider:
- Design idempotent decision APIs using transaction IDs so retries don't cause duplicate blocks or alerts
- Use distributed transactions or saga patterns when fraud decisions must coordinate with payment processing and notification systems
- Implement deduplication windows with bloom filters or distributed caches to catch duplicate transaction submissions
- Store decision audit logs in durable storage before acknowledging transaction processing
- Plan for partial failures where rule evaluation completes but downstream actions (alerts, blocks) fail