Practice/Google/Design an Anti-Phishing System
Design an Anti-Phishing System
System DesignMust
Problem Statement
Web browsers protect billions of users from phishing attacks by checking every URL they visit against a database of known malicious sites. When a user navigates to a flagged URL, the browser displays a warning page before the site can load. This protection must work seamlessly — users should experience no perceptible delay on safe URLs, and the system must function even when the device is offline or on a poor connection.
The scale of the problem is enormous. Hundreds of millions of browser instances need access to a blocklist that grows by thousands of entries daily. Distributing this list naively would create thundering herd problems during updates, consume excessive bandwidth on mobile networks, and expose user browsing history to the backend if every URL check requires a server round-trip.
You need to design the backend infrastructure that curates the phishing blocklist, distributes it efficiently to every browser instance worldwide, and provides a real-time lookup fallback for URLs not covered by the local dataset — all while preserving user privacy and minimizing bandwidth consumption.
Key Requirements
Functional
- Blocklist Curation -- Ingest phishing reports from automated crawlers, user submissions, and partner feeds; classify and validate URLs before adding them to the canonical blocklist.
- Client-Side Local Database -- Distribute a compact, indexed representation of the blocklist to every browser instance so that most URL checks happen locally without any network request.
- Incremental Updates -- Push daily (or more frequent) delta updates to clients so they do not need to re-download the full blocklist each time.
- Real-Time Lookup Fallback -- Provide a server-side API for checking URLs that are not conclusively resolved by the local database, using privacy-preserving techniques like hash-prefix matching.
Non-Functional
- Scalability -- Serve incremental updates and real-time lookups to hundreds of millions of browser instances across all platforms.
- Latency -- Local checks complete in under 1 millisecond; server-side fallback lookups complete in under 50 milliseconds.
- Privacy -- The backend must not learn which URLs a user is visiting; the lookup protocol must not reveal the full URL to the server.
- Bandwidth Efficiency -- Daily updates consume minimal bandwidth (kilobytes, not megabytes) to remain practical on metered mobile connections.
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Client-Side Data Structure Design
Fitting millions of blocklist entries into a compact, queryable structure on the client is a core challenge. Interviewers want to see trade-off analysis between different representations.
Hints to consider:
- Think about using hash prefixes (e.g., the first 4 bytes of a SHA-256 hash of the canonicalized URL) stored in a sorted array or Bloom filter for fast local lookups.
- Consider the false-positive rate implications — a shorter prefix saves space but increases the number of URLs that require a server-side verification round-trip.
- Evaluate how you handle URL canonicalization (stripping tracking parameters, normalizing case, resolving redirects) consistently between client and server.
- Think about how you version the local database so the client knows which delta updates it needs to apply.
2. Efficient Blocklist Distribution
Pushing updates to hundreds of millions of clients simultaneously is a distribution challenge. Interviewers probe how you avoid thundering herd effects.
Hints to consider:
- Consider serving update bundles as static files from a CDN, with clients polling at randomized intervals within a configurable window.
- Think about how you structure delta updates — an append-only log of additions and removals since a specific version, or a binary diff of the full dataset.
- Evaluate the trade-off between update frequency (fresher data, more bandwidth) and staleness (less bandwidth, longer exposure window for new phishing sites).
- Consider how you handle clients that are many versions behind — at what point do you force a full re-download rather than applying hundreds of deltas?