Fraud Detection System — Stripe

Problem Statement

Design a machine learning system to detect fraudulent transactions for a payments platform. The system must process transactions in real time, assigning a fraud risk score before the payment is authorized, while maintaining low latency and high throughput.

This problem combines both ML system design and traditional system design elements. You should be prepared to discuss not only the ML components -- model architecture, feature engineering, and evaluation -- but also the infrastructure required to support fraud detection at scale, including streaming pipelines, feature stores, and system reliability.

Key Requirements

Functional

Real-time fraud scoring -- assign a risk score to each transaction before authorization, within a strict latency budget
Feature engineering pipeline -- collect, transform, and serve features from raw transaction data, user history, and device signals
Model training and retraining -- train on labeled fraud data with support for periodic retraining as fraud patterns evolve
Class imbalance handling -- address the extreme skew between legitimate and fraudulent transactions in training and evaluation
Rules and model ensemble -- combine ML model scores with deterministic rules for known fraud patterns and regulatory requirements

Non-Functional

Low latency -- end-to-end scoring must complete within tens of milliseconds to avoid degrading the checkout experience
High throughput -- handle millions of transactions per day with consistent performance under peak load
High availability -- the scoring service must be fault-tolerant; a downtime means either blocking all payments or letting fraud through
Scalability -- all components (feature store, model serving, data pipeline) should scale horizontally
Monitoring and observability -- track model performance, data drift, and system health in production

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. End-to-End System Architecture (Most Emphasized)

Interviewers expect you to draw a complete system diagram showing how all components interact -- from raw transaction ingestion through feature computation, model scoring, and the decision engine. They want to see databases, message queues, model serving infrastructure, and how they connect.

Hints to consider:

Sketch a streaming pipeline (e.g., Kafka) that ingests transaction events in real time
Include a feature store with both online (low-latency lookups) and offline (batch training) paths
Show where the model serving layer sits relative to the payment authorization flow
Include a rules engine alongside the ML model for hard-coded fraud rules
Discuss fallback behavior when the model service is unavailable

2. Model Design and Feature Engineering

Interviewers probe deeply on which models suit fraud detection and how you engineer discriminative features from raw data.

Hints to consider:

Gradient-boosted trees (XGBoost, LightGBM) are common baselines; discuss why they work well for tabular fraud data
Deep learning approaches (e.g., sequence models for transaction history) can add lift but come with latency trade-offs
Features to consider: transaction amount, merchant category, time since last transaction, device fingerprint, geolocation velocity, historical fraud rates
Aggregated features (e.g., rolling 24-hour transaction count) require a real-time feature computation layer

3. Handling Class Imbalance

Fraud is rare -- often less than 0.1 percent of transactions. Interviewers want to see that you understand the implications for training and evaluation.

Hints to consider:

Oversampling (SMOTE), undersampling, or cost-sensitive learning during training
Use precision-recall curves and PR-AUC rather than accuracy or ROC-AUC alone
Discuss the business trade-off between false positives (blocking legitimate users) and false negatives (missing fraud)
Stratified sampling for train/test splits to preserve class distribution

4. Evaluation Metrics and Thresholding

Choosing the right metrics and operating point is critical for a fraud system.

Hints to consider:

Precision at a fixed recall level (e.g., 95 percent recall) is a common business metric
Dollar-weighted metrics may matter more than transaction-count metrics
A/B testing framework for comparing model versions in production
Shadow mode deployment before full rollout

5. Scalability and Reliability

Interviewers push on how you handle peak load and failure modes in a system that sits in the critical payment path.

Hints to consider:

Horizontal scaling of model serving with load balancing
Caching frequently accessed features (e.g., user profile features) to reduce latency
Circuit breakers and graceful degradation if the model service is slow or down
Separate read replicas for feature store queries versus training data exports

Suggested Approach

Step 1: Clarify Requirements

Ask about the scale (transactions per second), latency budget, acceptable false positive rate, and whether you should focus on card-not-present fraud, account-level fraud, or both. Confirm whether the interviewer wants an ML-heavy or infrastructure-heavy deep dive.

Step 2: High-Level Architecture

Sketch these core components:

Event Ingestion Layer -- receives transaction events via streaming (Kafka or similar) and routes them to feature computation and scoring
Feature Store -- dual-layer store with online serving (Redis or DynamoDB) for real-time lookups and offline store (S3 or data warehouse) for training
Feature Computation -- stream processor (Flink or Spark Streaming) that computes real-time aggregated features
Model Serving Service -- stateless service that loads the trained model and returns fraud scores with low latency
Rules Engine -- applies deterministic business rules alongside ML scores for the final decision
Decision Service -- combines model score and rules output to produce an approve/decline/review decision
Training Pipeline -- batch pipeline that pulls labeled data from the offline store, trains new models, and registers them for deployment
Monitoring and Alerting -- tracks prediction distributions, latency, and data drift

Step 3: Deep Dive on Model Design

Discuss your model choice, feature set, and how you handle class imbalance. Walk through the training pipeline: label acquisition (chargebacks, manual reviews), feature engineering, model selection, hyperparameter tuning, and offline evaluation.

Step 4: Deep Dive on Serving Infrastructure

This is where Stripe interviews differentiate from pure ML interviews. Discuss how the model serving layer integrates into the payment authorization flow, latency optimizations, caching strategies, and what happens when the model service is degraded.

Step 5: Post-Deployment Monitoring

Discuss data drift detection, model performance decay, feedback loops (fraudsters adapting to the model), and your retraining cadence.

Real Interview Quotes

"The problem distinctively combines ML system design with traditional system design concepts. For the ML portion, standard preparation applies -- data processing, model design, evaluation metrics, deployment strategies. What makes it unique is the emphasis on system architecture."

"Candidates were asked to draw a complete system diagram showing different components including databases and servers. Questions about scalability, reliability, and throughput were central to the discussion."

"The interviewer was helpful in guiding candidates through the system architecture portion. Even without extensive preparation on traditional system design, candidates found they could work through the requirements with the interviewer's guidance."

"Be ready to go deep on how you handle the extreme class imbalance -- they want to see you understand why accuracy is misleading and how you pick the right threshold for the business."

Problem Statement

Key Requirements

Functional

Real-time fraud scoring -- assign a risk score to each transaction before authorization, within a strict latency budget
Feature engineering pipeline -- collect, transform, and serve features from raw transaction data, user history, and device signals
Model training and retraining -- train on labeled fraud data with support for periodic retraining as fraud patterns evolve
Class imbalance handling -- address the extreme skew between legitimate and fraudulent transactions in training and evaluation
Rules and model ensemble -- combine ML model scores with deterministic rules for known fraud patterns and regulatory requirements

Non-Functional

Low latency -- end-to-end scoring must complete within tens of milliseconds to avoid degrading the checkout experience
High throughput -- handle millions of transactions per day with consistent performance under peak load
High availability -- the scoring service must be fault-tolerant; a downtime means either blocking all payments or letting fraud through
Scalability -- all components (feature store, model serving, data pipeline) should scale horizontally
Monitoring and observability -- track model performance, data drift, and system health in production

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. End-to-End System Architecture (Most Emphasized)

Hints to consider:

Sketch a streaming pipeline (e.g., Kafka) that ingests transaction events in real time
Include a feature store with both online (low-latency lookups) and offline (batch training) paths
Show where the model serving layer sits relative to the payment authorization flow
Include a rules engine alongside the ML model for hard-coded fraud rules
Discuss fallback behavior when the model service is unavailable

2. Model Design and Feature Engineering

Interviewers probe deeply on which models suit fraud detection and how you engineer discriminative features from raw data.

Hints to consider:

Gradient-boosted trees (XGBoost, LightGBM) are common baselines; discuss why they work well for tabular fraud data
Deep learning approaches (e.g., sequence models for transaction history) can add lift but come with latency trade-offs
Features to consider: transaction amount, merchant category, time since last transaction, device fingerprint, geolocation velocity, historical fraud rates
Aggregated features (e.g., rolling 24-hour transaction count) require a real-time feature computation layer

3. Handling Class Imbalance

Fraud is rare -- often less than 0.1 percent of transactions. Interviewers want to see that you understand the implications for training and evaluation.

Hints to consider:

Oversampling (SMOTE), undersampling, or cost-sensitive learning during training
Use precision-recall curves and PR-AUC rather than accuracy or ROC-AUC alone
Discuss the business trade-off between false positives (blocking legitimate users) and false negatives (missing fraud)
Stratified sampling for train/test splits to preserve class distribution

4. Evaluation Metrics and Thresholding

Choosing the right metrics and operating point is critical for a fraud system.

Hints to consider:

Precision at a fixed recall level (e.g., 95 percent recall) is a common business metric
Dollar-weighted metrics may matter more than transaction-count metrics
A/B testing framework for comparing model versions in production
Shadow mode deployment before full rollout

5. Scalability and Reliability

Interviewers push on how you handle peak load and failure modes in a system that sits in the critical payment path.

Hints to consider:

Horizontal scaling of model serving with load balancing
Caching frequently accessed features (e.g., user profile features) to reduce latency
Circuit breakers and graceful degradation if the model service is slow or down
Separate read replicas for feature store queries versus training data exports

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Sketch these core components:

Event Ingestion Layer -- receives transaction events via streaming (Kafka or similar) and routes them to feature computation and scoring
Feature Store -- dual-layer store with online serving (Redis or DynamoDB) for real-time lookups and offline store (S3 or data warehouse) for training
Feature Computation -- stream processor (Flink or Spark Streaming) that computes real-time aggregated features
Model Serving Service -- stateless service that loads the trained model and returns fraud scores with low latency
Rules Engine -- applies deterministic business rules alongside ML scores for the final decision
Decision Service -- combines model score and rules output to produce an approve/decline/review decision
Training Pipeline -- batch pipeline that pulls labeled data from the offline store, trains new models, and registers them for deployment
Monitoring and Alerting -- tracks prediction distributions, latency, and data drift

Step 3: Deep Dive on Model Design

Step 4: Deep Dive on Serving Infrastructure

Step 5: Post-Deployment Monitoring

Discuss data drift detection, model performance decay, feedback loops (fraudsters adapting to the model), and your retraining cadence.

Real Interview Quotes

"Be ready to go deep on how you handle the extreme class imbalance -- they want to see you understand why accuracy is misleading and how you pick the right threshold for the business."