Practice/PayPal/Design a Basic Auth System
Design a Basic Auth System
System DesignMust
Problem Statement
Design a scalable authentication and authorization platform that handles user registration, login, session management, and access control for a suite of applications and microservices. Think of platforms like Auth0, Okta, or AWS Cognito that centralize identity management so individual services do not need to implement their own authentication logic.
The system must support standard credential-based login with optional multi-factor authentication, issue and validate tokens or sessions that persist across requests, enable single sign-on across multiple applications, and enforce role-based access control policies. At peak, the platform needs to handle 100,000 authentication requests per second, maintain millions of concurrent sessions, and validate tokens with sub-100ms latency on every API call across all integrated services.
The core tension in this design is between security and performance. Stateless tokens (like JWTs) give you fast validation without network round-trips but make instant revocation difficult. Server-side sessions support immediate revocation but require a fast distributed store that becomes a critical dependency. Interviewers want to see you navigate these trade-offs while also addressing abuse prevention, multi-region deployment, and integration patterns that make adoption simple for downstream services.
Key Requirements
Functional
- User registration and verification -- Support account creation with email or phone verification, prevent duplicate accounts, and handle verification token expiration
- Credential authentication -- Validate username and password combinations with support for multi-factor authentication via TOTP apps, SMS codes, or hardware keys
- Session and token management -- Issue secure tokens after authentication, support refresh flows for long-lived sessions, and allow users to view and revoke active sessions across devices
- Single sign-on -- Allow users to authenticate once and access multiple integrated applications without re-entering credentials
- Role-based access control -- Enforce hierarchical permissions so services can authorize requests based on user roles, scopes, and resource ownership
Non-Functional
- Scalability -- Handle 100,000+ authentication requests per second and 10 million concurrent sessions with horizontal scaling
- Latency -- Token validation must complete in under 100ms at p99; login flows should complete in under 500ms including MFA challenges
- Availability -- Maintain 99.99% uptime with graceful degradation; authentication failures must not cascade to dependent services
- Security -- Protect against credential stuffing, brute force attacks, session hijacking, and token replay; support compliance requirements like SOC 2 and PCI-DSS
- Consistency -- Strong consistency for credential updates and session revocations; eventual consistency acceptable for permission propagation with bounded staleness under 1 second
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Token Architecture and Revocation Strategy
The choice between stateless JWTs and server-side sessions fundamentally shapes the system's performance, security, and operational complexity. Interviewers want to see you reason through this decision and address the revocation problem that arises with stateless tokens.
Hints to consider:
- Use short-lived JWTs (15-minute expiry) as access tokens for fast validation at API gateways without network calls, paired with longer-lived opaque refresh tokens stored server-side
- Implement a revocation list in Redis that API gateways check on each request -- the list is small (only actively revoked tokens) and cached locally with sub-second TTL
- Rotate refresh tokens on each use so a stolen refresh token can only be used once before detection
- Distribute JWT signing key verification through published JWKS endpoints so services can validate tokens independently
2. Abuse Prevention and Rate Limiting
Login endpoints are prime targets for credential stuffing and brute force attacks. Interviewers expect layered defenses that stop abuse without creating false positives for legitimate users.
Hints to consider:
- Implement multi-tier rate limiting: per-IP (using Redis counters with sliding windows), per-account (tracking failed attempts), and global (protecting downstream capacity)
- Trigger adaptive challenges like CAPTCHAs based on risk signals such as unusual IP, device fingerprint mismatch, or velocity of failed attempts
- Lock accounts temporarily after N consecutive failed attempts, but implement unlock mechanisms that prevent denial-of-service against legitimate users
- Use distributed rate limiting across multiple auth service instances by centralizing counters in Redis with atomic increment operations
3. Multi-Region Deployment and Data Replication
A global auth platform must serve users with low latency regardless of location while ensuring that credential changes and session revocations propagate reliably. This tension between latency and consistency reveals deep distributed systems understanding.
Hints to consider:
- Route authentication requests to the nearest region for low latency, with credential data replicated across regions via database multi-region capabilities
- Designate a home region per user for authoritative credential writes, with read replicas everywhere for validation
- Replicate the JWT signing key material to all regions so token validation works locally, but centralize key rotation operations
- Handle network partitions by allowing read-path authentication to continue with cached data while queuing write-path changes for reconciliation
4. Service Integration and Enforcement Architecture
Authentication is only effective if every service enforces it consistently. Interviewers want to see a clear architectural boundary that makes correct integration easy and incorrect integration difficult.
Hints to consider:
- Position an API gateway or service mesh sidecar as the single enforcement point that validates tokens and extracts identity claims before requests reach application code
- Design the gateway to cache token validation results and permission decisions for the token's remaining lifetime to reduce calls to the auth service
- Provide an authorization decision API for fine-grained permission checks that services call for resource-level access control beyond what the gateway handles
- Use webhooks or event streams to push critical security events (mass revocation, policy updates) to all integrated services in near real-time