Design a URL Shortener — Zscaler

Problem Statement

Design a URL shortening service similar to TinyURL or Bitly that allows users to convert long URLs into short, shareable links and manage their shortened URLs. A user pastes a long URL, receives a compact code like https://sho.rt/Ab3Cd, and anyone visiting that short link is immediately redirected to the original address.

The service must handle creating short links, redirecting users with sub-50ms latency at the edge, managing links through an authenticated dashboard (view, disable, delete, update destination), and delivering basic analytics including total clicks, time-series trends, referrer sources, and geographic breakdowns. While the problem appears straightforward, it exercises core distributed systems skills: globally unique ID generation without collisions, extreme read-path scaling for redirects, edge-level serving via CDN, write-heavy analytics event capture, abuse prevention, and thoughtful data modeling.

At Zscaler scale, interviewers use this question to evaluate whether you can define crisp requirements, estimate scale realistically, select the right storage and caching strategy, and make pragmatic trade-offs around availability, consistency, and cost.

Key Requirements

Functional

Short link creation -- users submit a long URL and receive a unique, compact short code that maps to the original destination
Low-latency redirection -- anyone visiting a short link is redirected to the original URL within milliseconds, served globally from edge locations
Link management -- authenticated users can view, disable, delete, and optionally update the destination of their short links via a dashboard
Analytics -- users can view engagement data for each link including total clicks, time-series graphs, referrer sources, and geographic breakdown

Non-Functional

Scalability -- handle hundreds of millions of redirects per day with a much smaller volume of link creations; the read-to-write ratio is in the hundreds-to-one range
Latency -- P99 redirect latency under 50ms at the edge; link creation latency under 500ms
Reliability -- 99.99 percent availability for the redirect path; zero data loss on link creation
Consistency -- newly created links must be resolvable within seconds; analytics data can be eventually consistent

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Short Code Generation and Collision Avoidance

The ID generation strategy determines much of the system's scalability and correctness. A naive single auto-increment counter creates contention and a single point of failure.

Hints to consider:

Use base62 encoding (a-z, A-Z, 0-9) for compact, URL-safe codes; 7 characters yield roughly 3.5 trillion possible combinations
Consider pre-allocated counter ranges per node, hash-based generation with collision detection, or a dedicated ID service like Snowflake
Explain why base62 is preferred over base64: the characters + and / are not URL-safe and require encoding
Implement conditional writes or uniqueness checks to guarantee no two URLs share the same short code

2. Read Path Optimization and Caching

Redirect traffic is overwhelmingly read-heavy. Hammering the primary datastore on every redirect will miss latency targets and inflate costs.

Hints to consider:

Deploy a multi-tier cache: CDN edge cache for the most popular links, a Redis layer for warm lookups, and the primary datastore as the fallback
Choose between HTTP 301 (permanent) and 302 (temporary) redirects strategically -- 301 lets browsers cache but bypasses analytics, while 302 routes every request through your servers for click tracking
Implement negative caching for non-existent short codes to protect the database from abuse traffic and enumeration attacks
Consider geo-distributed cache nodes or DynamoDB Global Tables for cross-region low latency

3. Analytics Ingestion Without Impacting Redirects

Coupling analytics writes to the redirect path raises tail latency and creates failure correlation. An analytics outage should never break the redirect experience.

Hints to consider:

Fire click events asynchronously to a message queue (Kafka) from the redirect service, then process them downstream
Use append-only event streams for raw click data and batch-aggregate into time-series tables for dashboard queries
Consider approximate counters like HyperLogLog for unique visitor counts to reduce write amplification
Separate the analytics read path (dashboards) from the redirect read path entirely to isolate failure domains

4. Abuse Prevention and Rate Limiting

URL shorteners are frequent targets for spam, phishing, and denial-of-service attacks. Interviewers want to see you think about operational safety beyond the happy path.

Hints to consider:

Rate limit link creation per user or IP address to prevent mass generation of malicious links
Scan destination URLs against blocklists and safe-browsing APIs before creating a short link
Implement per-short-code rate limiting on the redirect path to mitigate DDoS amplification through popular links
Provide a reporting and flagging mechanism to disable abusive links quickly

5. Data Model and Storage Choices

The core mapping from short code to URL is a classic key-value problem, but the full data model includes user ownership, metadata, expiration, and analytics.

Hints to consider:

A key-value store like DynamoDB is ideal for the code-to-URL mapping: predictable low latency, auto-scaling, conditional writes for uniqueness, and Global Tables for multi-region support
Store link metadata (owner, creation time, expiration, enabled/disabled status) alongside the mapping or in a relational store depending on query patterns
Use TTL features for links with expiration dates to automate cleanup without manual garbage collection
Keep analytics data in a separate time-series or columnar store optimized for aggregation queries rather than mixing it with the core mapping store

Suggested Approach

Step 1: Clarify Requirements

Confirm the expected scale: how many link creations per day, how many redirects per day, and what read-to-write ratio to design for. Ask whether custom aliases are supported (users choosing their own short code). Clarify the analytics depth: just click counts, or full breakdowns by time, geography, and referrer. Confirm whether links can expire and whether the system needs to support bulk creation via API. Ask about the geographic distribution of users to inform CDN and multi-region decisions.

Step 2: High-Level Architecture

Sketch the core components: an API Gateway for authentication and rate limiting, a Link Service handling creation and management, a Redirect Service optimized for speed, a Cache Layer (Redis plus CDN), a primary datastore (DynamoDB) for the code-to-URL mapping and metadata, a message queue (Kafka) for analytics event ingestion, and an Analytics Service backed by a time-series store. Show two distinct paths: the write path (create link, generate code, persist, return short URL) and the read path (CDN check, Redis check, database lookup, serve redirect, emit analytics event). Emphasize that the redirect path is the hot path and must stay as thin as possible.

Step 3: Deep Dive on Code Generation and Read Path

Walk through the full lifecycle of a redirect request. A client hits the CDN with a short URL. If the edge has a cached redirect, it responds immediately in under 10ms. On a cache miss, the request reaches the Redirect Service, which checks Redis and then DynamoDB if needed. The mapping is served as a 302 redirect and cached at both the Redis and CDN layers for future requests. Simultaneously, a click event containing the short code, timestamp, referrer, and client geography is published to Kafka. For code generation, explain how a distributed counter with pre-allocated ranges per node (or a hash-based scheme with collision detection) avoids hotspots and single points of failure. Show how conditional writes in DynamoDB guarantee uniqueness without requiring distributed locking.

Step 4: Address Secondary Concerns

Cover the analytics pipeline: Kafka consumers aggregate click events into per-link counters and time-series buckets, stored in a columnar database for efficient dashboard queries. Discuss cache invalidation when a user disables or deletes a link: invalidate both CDN and Redis entries and serve a 404 or 410 Gone response. Address link expiration using DynamoDB TTL to trigger automatic cleanup of expired entries, followed by cache eviction. Cover monitoring: track redirect latency percentiles, cache hit rates, link creation throughput, and Kafka consumer lag. Discuss cost optimization: the vast majority of redirects should be served from CDN edge caches, minimizing origin hits and database reads to keep infrastructure costs proportional to actual creation volume rather than redirect volume.

Related Learning

Deepen your understanding of the patterns used in this problem:

Distributed Counters -- the analytics click-counting challenge at scale mirrors the counters pattern
Caching -- multi-tier caching is the foundation of low-latency redirects
Content Delivery Networks (CDN) -- serving redirects at the edge for global low latency
Message Queues -- Kafka decouples the redirect hot path from analytics ingestion
Rate Limiters -- protecting the service from abuse on both creation and redirect paths

Problem Statement

Key Requirements

Functional

Short link creation -- users submit a long URL and receive a unique, compact short code that maps to the original destination
Low-latency redirection -- anyone visiting a short link is redirected to the original URL within milliseconds, served globally from edge locations
Link management -- authenticated users can view, disable, delete, and optionally update the destination of their short links via a dashboard
Analytics -- users can view engagement data for each link including total clicks, time-series graphs, referrer sources, and geographic breakdown

Non-Functional

Scalability -- handle hundreds of millions of redirects per day with a much smaller volume of link creations; the read-to-write ratio is in the hundreds-to-one range
Latency -- P99 redirect latency under 50ms at the edge; link creation latency under 500ms
Reliability -- 99.99 percent availability for the redirect path; zero data loss on link creation
Consistency -- newly created links must be resolvable within seconds; analytics data can be eventually consistent

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Short Code Generation and Collision Avoidance

The ID generation strategy determines much of the system's scalability and correctness. A naive single auto-increment counter creates contention and a single point of failure.

Hints to consider:

Use base62 encoding (a-z, A-Z, 0-9) for compact, URL-safe codes; 7 characters yield roughly 3.5 trillion possible combinations
Consider pre-allocated counter ranges per node, hash-based generation with collision detection, or a dedicated ID service like Snowflake
Explain why base62 is preferred over base64: the characters + and / are not URL-safe and require encoding
Implement conditional writes or uniqueness checks to guarantee no two URLs share the same short code

2. Read Path Optimization and Caching

Redirect traffic is overwhelmingly read-heavy. Hammering the primary datastore on every redirect will miss latency targets and inflate costs.

Hints to consider:

Deploy a multi-tier cache: CDN edge cache for the most popular links, a Redis layer for warm lookups, and the primary datastore as the fallback
Choose between HTTP 301 (permanent) and 302 (temporary) redirects strategically -- 301 lets browsers cache but bypasses analytics, while 302 routes every request through your servers for click tracking
Implement negative caching for non-existent short codes to protect the database from abuse traffic and enumeration attacks
Consider geo-distributed cache nodes or DynamoDB Global Tables for cross-region low latency

3. Analytics Ingestion Without Impacting Redirects

Coupling analytics writes to the redirect path raises tail latency and creates failure correlation. An analytics outage should never break the redirect experience.

Hints to consider:

Fire click events asynchronously to a message queue (Kafka) from the redirect service, then process them downstream
Use append-only event streams for raw click data and batch-aggregate into time-series tables for dashboard queries
Consider approximate counters like HyperLogLog for unique visitor counts to reduce write amplification
Separate the analytics read path (dashboards) from the redirect read path entirely to isolate failure domains

4. Abuse Prevention and Rate Limiting

URL shorteners are frequent targets for spam, phishing, and denial-of-service attacks. Interviewers want to see you think about operational safety beyond the happy path.

Hints to consider:

Rate limit link creation per user or IP address to prevent mass generation of malicious links
Scan destination URLs against blocklists and safe-browsing APIs before creating a short link
Implement per-short-code rate limiting on the redirect path to mitigate DDoS amplification through popular links
Provide a reporting and flagging mechanism to disable abusive links quickly

5. Data Model and Storage Choices

The core mapping from short code to URL is a classic key-value problem, but the full data model includes user ownership, metadata, expiration, and analytics.

Hints to consider:

A key-value store like DynamoDB is ideal for the code-to-URL mapping: predictable low latency, auto-scaling, conditional writes for uniqueness, and Global Tables for multi-region support
Store link metadata (owner, creation time, expiration, enabled/disabled status) alongside the mapping or in a relational store depending on query patterns
Use TTL features for links with expiration dates to automate cleanup without manual garbage collection
Keep analytics data in a separate time-series or columnar store optimized for aggregation queries rather than mixing it with the core mapping store

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Step 3: Deep Dive on Code Generation and Read Path

Step 4: Address Secondary Concerns

Related Learning

Deepen your understanding of the patterns used in this problem:

Distributed Counters -- the analytics click-counting challenge at scale mirrors the counters pattern
Caching -- multi-tier caching is the foundation of low-latency redirects
Content Delivery Networks (CDN) -- serving redirects at the edge for global low latency
Message Queues -- Kafka decouples the redirect hot path from analytics ingestion
Rate Limiters -- protecting the service from abuse on both creation and redirect paths