Practice/Apple/Design a Text-Based Search System
Design a Text-Based Search System
System DesignOptional
Problem Statement
Design a web search engine that crawls pages, builds a searchable index, and returns relevant results when users type queries — like a simplified Google. Users enter keywords and get a ranked list of pages with titles, URLs, and snippets, with results appearing in under a second.
This problem tests end-to-end systems thinking: building a multi-stage pipeline (crawl → parse → index → serve), handling billions of documents, ranking results by relevance, and keeping content fresh. You need to reason about crawl politeness, inverted index design, sharding, and how to keep queries fast under heavy load.
Key Requirements
Functional
- Keyword search -- users enter queries and receive a ranked list of relevant web pages
- Result details -- each result shows title, URL, and a snippet highlighting matching terms
- Content freshness -- recently updated pages appear in results within hours
- Filtering -- users can filter by domain or time range
Non-Functional
- Scalability -- index billions of pages and handle thousands of queries per second
- Latency -- return search results in under 500ms at P95
- Availability -- the search serving layer must be highly available; crawl pipeline can tolerate brief interruptions
- Relevance -- results should be ranked by a combination of text match quality and page authority
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Crawl Pipeline Design
The crawler must systematically discover and fetch web pages at scale without overwhelming target sites or wasting resources on duplicate content.
Hints to consider:
- Use a URL frontier (priority queue) that schedules which URLs to crawl next, respecting per-host rate limits and robots.txt
- Deduplicate URLs before crawling (canonicalize URLs) and deduplicate content after fetching (fingerprint/simhash)
- Prioritize crawling: popular/authoritative sites first, stale content before new content for freshness
- Discuss politeness: limit concurrent requests per domain, use exponential backoff for errors, and respect crawl-delay directives
2. Inverted Index Architecture
The inverted index maps each word to the list of documents containing it. This is the data structure that makes keyword search fast.
Hints to consider:
- For each term, store a posting list: sorted list of (document_id, term_frequency, positions)
- Shard the index by document ID so each shard holds a complete index of its document subset (document-partitioned)
- Alternatively, shard by term (term-partitioned) for certain workloads — discuss trade-offs
- Compress posting lists (variable-byte encoding, delta encoding) to reduce storage and improve cache hit rates
3. Query Processing and Ranking
When a user searches "best pizza NYC," the system must find documents matching all terms, score them, and return the top results quickly.
Hints to consider:
- Use TF-IDF or BM25 as the baseline text relevance score
- Combine text relevance with page authority signals (like PageRank) for the final ranking
- For multi-word queries, intersect posting lists efficiently using skip pointers
- Use a scatter-gather pattern: send the query to all index shards in parallel, merge top-K results from each shard
4. Keeping the Index Fresh
The web changes constantly. Users expect recent content to be searchable without waiting days.
Hints to consider:
- Separate the index into segments: a large base index (refreshed less often) and a small incremental index (updated frequently)
- Merge the incremental index into the base periodically
- Use a change detection system: re-crawl pages at intervals proportional to their update frequency
- Discuss near-real-time indexing: newly crawled pages are added to the incremental index within minutes