ML System Design - Account Takeover Prediction — Stripe

Problem Statement

Design a machine learning system to predict account takeover (ATO) risk for a payments API platform. Account takeover occurs when bad actors gain unauthorized access to legitimate user accounts through stolen credentials, session hijacking, or identity fraud.

The problem is intentionally open-ended about the specific use case. Before diving into the solution, you should clarify with the interviewer whether the focus is on login-time detection, session-level anomaly detection, or transaction-level risk scoring -- each leads to a different system design. The system must score risk in real time while balancing security (catching compromised accounts) against user experience (avoiding false lockouts of legitimate users).

Key Requirements

Functional

Real-time risk scoring -- evaluate ATO risk at login, session activity, or transaction time and return a score within a strict latency budget
Multi-signal feature engineering -- combine behavioral signals (login patterns, device fingerprints, IP reputation, geolocation) with historical account data
Adaptive model retraining -- support periodic and triggered retraining as attacker tactics evolve over time
Tiered response actions -- map risk scores to graduated responses such as allow, step-up authentication (MFA), temporary lock, or manual review
Feedback loop integration -- incorporate user-confirmed ATO reports and false positive feedback to improve the model continuously

Non-Functional

Low latency -- scoring must complete in tens of milliseconds to avoid degrading the login or transaction experience
High availability -- the risk scoring service must be always-on; downtime means either blocking all logins or letting attackers through
Scalability -- handle spikes in login volume (e.g., credential stuffing attacks generating millions of attempts)
Privacy compliance -- handle sensitive user data (IP addresses, device info, location) in compliance with data retention and privacy regulations
Monitoring and drift detection -- track model accuracy, feature distributions, and attacker pattern shifts in production

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Model Design and Feature Engineering (Most Emphasized)

Interviewers spend roughly 20 minutes on this area. They want to see a rich, well-structured feature set and a thoughtful model choice.

Behavioral features: login time of day, login frequency, time since last login, typical session duration
Device and network features: device fingerprint, IP address reputation, geolocation, IP-to-account velocity
Historical features: rolling count of failed logins, number of distinct devices in the past 30 days, average transaction amount
Consider both numerical and categorical features, and discuss preprocessing (missing value handling, standardization, categorical encoding)
Gradient-boosted trees are a strong baseline for tabular data; discuss when you might add a sequence model for session-level patterns
Address risks during development: data imbalance, feature redundancy, overfitting on historical attack patterns

2. System Architecture and Serving Infrastructure

Interviewers spend roughly 15 minutes probing the end-to-end system, with particular attention to the real-time serving path.

3. Business Impact and False Positive Trade-offs

Interviewers spend roughly 15 minutes connecting the ML system to business outcomes.

Reducing the probability of successful account takeovers directly impacts user trust and platform revenue
False positives (legitimate users locked out or forced through extra MFA) degrade user experience and increase support costs
Discuss how you set the risk threshold: too aggressive locks out good users, too lenient lets attackers in
Propose metrics for business impact: ATO rate reduction, false lockout rate, user friction index
Consider tiered responses (step-up auth vs. hard block) to reduce friction for borderline cases

Interviewers frequently ask follow-up questions about train/test split strategies for imbalanced data.

Oversampling minority class (SMOTE) or undersampling majority class
Stratified sampling to preserve class distribution in train/test splits
Cost-sensitive learning where misclassifying an ATO is penalized more heavily
Evaluate with precision-recall curves and PR-AUC rather than accuracy
Discuss how label quality affects imbalance -- many ATOs go unreported, leading to noisy negative labels

The open-ended nature of this problem is itself a test. Interviewers watch whether you ask clarifying questions before jumping to a solution.

Ask: Is the focus on login-time risk, session anomaly detection, or post-login transaction risk?
Ask: What response actions are available (block, MFA challenge, flag for review)?
Ask: What labeled data is available and how is ATO ground truth established?
Scoping the problem well signals senior-level thinking