Problem Statement
Design a machine learning system to detect fraudulent transactions for a payments platform. The system must process transactions in real time, assigning a fraud risk score before the payment is authorized, while maintaining low latency and high throughput.
This problem combines both ML system design and traditional system design elements. You should be prepared to discuss not only the ML components -- model architecture, feature engineering, and evaluation -- but also the infrastructure required to support fraud detection at scale, including streaming pipelines, feature stores, and system reliability.
Key Requirements
Functional
- Real-time fraud scoring -- assign a risk score to each transaction before authorization, within a strict latency budget
- Feature engineering pipeline -- collect, transform, and serve features from raw transaction data, user history, and device signals
- Model training and retraining -- train on labeled fraud data with support for periodic retraining as fraud patterns evolve
- Class imbalance handling -- address the extreme skew between legitimate and fraudulent transactions in training and evaluation
- Rules and model ensemble -- combine ML model scores with deterministic rules for known fraud patterns and regulatory requirements
Non-Functional
- Low latency -- end-to-end scoring must complete within tens of milliseconds to avoid degrading the checkout experience
- High throughput -- handle millions of transactions per day with consistent performance under peak load
- High availability -- the scoring service must be fault-tolerant; a downtime means either blocking all payments or letting fraud through
- Scalability -- all components (feature store, model serving, data pipeline) should scale horizontally
- Monitoring and observability -- track model performance, data drift, and system health in production
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. End-to-End System Architecture (Most Emphasized)
Interviewers expect you to draw a complete system diagram showing how all components interact -- from raw transaction ingestion through feature computation, model scoring, and the decision engine. They want to see databases, message queues, model serving infrastructure, and how they connect.
Hints to consider:
- Sketch a streaming pipeline (e.g., Kafka) that ingests transaction events in real time
- Include a feature store with both online (low-latency lookups) and offline (batch training) paths
- Show where the model serving layer sits relative to the payment authorization flow
- Include a rules engine alongside the ML model for hard-coded fraud rules
- Discuss fallback behavior when the model service is unavailable
2. Model Design and Feature Engineering
Interviewers probe deeply on which models suit fraud detection and how you engineer discriminative features from raw data.
Hints to consider:
- Gradient-boosted trees (XGBoost, LightGBM) are common baselines; discuss why they work well for tabular fraud data
- Deep learning approaches (e.g., sequence models for transaction history) can add lift but come with latency trade-offs
- Features to consider: transaction amount, merchant category, time since last transaction, device fingerprint, geolocation velocity, historical fraud rates
- Aggregated features (e.g., rolling 24-hour transaction count) require a real-time feature computation layer
3. Handling Class Imbalance
Fraud is rare -- often less than 0.1 percent of transactions. Interviewers want to see that you understand the implications for training and evaluation.
Hints to consider:
- Oversampling (SMOTE), undersampling, or cost-sensitive learning during training
- Use precision-recall curves and PR-AUC rather than accuracy or ROC-AUC alone
- Discuss the business trade-off between false positives (blocking legitimate users) and false negatives (missing fraud)
- Stratified sampling for train/test splits to preserve class distribution
4. Evaluation Metrics and Thresholding
Choosing the right metrics and operating point is critical for a fraud system.
Hints to consider:
- Precision at a fixed recall level (e.g., 95 percent recall) is a common business metric
- Dollar-weighted metrics may matter more than transaction-count metrics
- A/B testing framework for comparing model versions in production
- Shadow mode deployment before full rollout
5. Scalability and Reliability
Interviewers push on how you handle peak load and failure modes in a system that sits in the critical payment path.
Hints to consider:
- Horizontal scaling of model serving with load balancing
- Caching frequently accessed features (e.g., user profile features) to reduce latency
- Circuit breakers and graceful degradation if the model service is slow or down
- Separate read replicas for feature store queries versus training data exports
Suggested Approach
Step 1: Clarify Requirements
Ask about the scale (transactions per second), latency budget, acceptable false positive rate, and whether you should focus on card-not-present fraud, account-level fraud, or both. Confirm whether the interviewer wants an ML-heavy or infrastructure-heavy deep dive.
Step 2: High-Level Architecture
Sketch these core components:
- Event Ingestion Layer -- receives transaction events via streaming (Kafka or similar) and routes them to feature computation and scoring
- Feature Store -- dual-layer store with online serving (Redis or DynamoDB) for real-time lookups and offline store (S3 or data warehouse) for training
- Feature Computation -- stream processor (Flink or Spark Streaming) that computes real-time aggregated features
- Model Serving Service -- stateless service that loads the trained model and returns fraud scores with low latency
- Rules Engine -- applies deterministic business rules alongside ML scores for the final decision
- Decision Service -- combines model score and rules output to produce an approve/decline/review decision
- Training Pipeline -- batch pipeline that pulls labeled data from the offline store, trains new models, and registers them for deployment
- Monitoring and Alerting -- tracks prediction distributions, latency, and data drift
Step 3: Deep Dive on Model Design
Discuss your model choice, feature set, and how you handle class imbalance. Walk through the training pipeline: label acquisition (chargebacks, manual reviews), feature engineering, model selection, hyperparameter tuning, and offline evaluation.
Step 4: Deep Dive on Serving Infrastructure
This is where Stripe interviews differentiate from pure ML interviews. Discuss how the model serving layer integrates into the payment authorization flow, latency optimizations, caching strategies, and what happens when the model service is degraded.
Step 5: Post-Deployment Monitoring
Discuss data drift detection, model performance decay, feedback loops (fraudsters adapting to the model), and your retraining cadence.
Real Interview Quotes
"The problem distinctively combines ML system design with traditional system design concepts. For the ML portion, standard preparation applies -- data processing, model design, evaluation metrics, deployment strategies. What makes it unique is the emphasis on system architecture."
"Candidates were asked to draw a complete system diagram showing different components including databases and servers. Questions about scalability, reliability, and throughput were central to the discussion."
"The interviewer was helpful in guiding candidates through the system architecture portion. Even without extensive preparation on traditional system design, candidates found they could work through the requirements with the interviewer's guidance."
"Be ready to go deep on how you handle the extreme class imbalance -- they want to see you understand why accuracy is misleading and how you pick the right threshold for the business."