Practice/Apple/Design a Text-Based Search System

Design a Text-Based Search System

System DesignOptional

Problem Statement

Design a web search engine that crawls pages, builds a searchable index, and returns relevant results when users type queries — like a simplified Google. Users enter keywords and get a ranked list of pages with titles, URLs, and snippets, with results appearing in under a second.

This problem tests end-to-end systems thinking: building a multi-stage pipeline (crawl → parse → index → serve), handling billions of documents, ranking results by relevance, and keeping content fresh. You need to reason about crawl politeness, inverted index design, sharding, and how to keep queries fast under heavy load.

Key Requirements

Functional

Keyword search -- users enter queries and receive a ranked list of relevant web pages
Result details -- each result shows title, URL, and a snippet highlighting matching terms
Content freshness -- recently updated pages appear in results within hours
Filtering -- users can filter by domain or time range

Non-Functional

Scalability -- index billions of pages and handle thousands of queries per second
Latency -- return search results in under 500ms at P95
Availability -- the search serving layer must be highly available; crawl pipeline can tolerate brief interruptions
Relevance -- results should be ranked by a combination of text match quality and page authority

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Crawl Pipeline Design

The crawler must systematically discover and fetch web pages at scale without overwhelming target sites or wasting resources on duplicate content.

Hints to consider:

Use a URL frontier (priority queue) that schedules which URLs to crawl next, respecting per-host rate limits and robots.txt
Deduplicate URLs before crawling (canonicalize URLs) and deduplicate content after fetching (fingerprint/simhash)
Prioritize crawling: popular/authoritative sites first, stale content before new content for freshness
Discuss politeness: limit concurrent requests per domain, use exponential backoff for errors, and respect crawl-delay directives

2. Inverted Index Architecture

The inverted index maps each word to the list of documents containing it. This is the data structure that makes keyword search fast.

Hints to consider:

For each term, store a posting list: sorted list of (document_id, term_frequency, positions)
Shard the index by document ID so each shard holds a complete index of its document subset (document-partitioned)
Alternatively, shard by term (term-partitioned) for certain workloads — discuss trade-offs
Compress posting lists (variable-byte encoding, delta encoding) to reduce storage and improve cache hit rates

3. Query Processing and Ranking

When a user searches "best pizza NYC," the system must find documents matching all terms, score them, and return the top results quickly.

Hints to consider:

Use TF-IDF or BM25 as the baseline text relevance score
Combine text relevance with page authority signals (like PageRank) for the final ranking
For multi-word queries, intersect posting lists efficiently using skip pointers
Use a scatter-gather pattern: send the query to all index shards in parallel, merge top-K results from each shard

4. Keeping the Index Fresh

The web changes constantly. Users expect recent content to be searchable without waiting days.

Hints to consider:

Separate the index into segments: a large base index (refreshed less often) and a small incremental index (updated frequently)
Merge the incremental index into the base periodically
Use a change detection system: re-crawl pages at intervals proportional to their update frequency
Discuss near-real-time indexing: newly crawled pages are added to the incremental index within minutes

Practice/Apple/Design a Text-Based Search System

Design a Text-Based Search System

System DesignOptional

Problem Statement

Key Requirements

Functional

Keyword search -- users enter queries and receive a ranked list of relevant web pages
Result details -- each result shows title, URL, and a snippet highlighting matching terms
Content freshness -- recently updated pages appear in results within hours
Filtering -- users can filter by domain or time range

Non-Functional

Scalability -- index billions of pages and handle thousands of queries per second
Latency -- return search results in under 500ms at P95
Availability -- the search serving layer must be highly available; crawl pipeline can tolerate brief interruptions
Relevance -- results should be ranked by a combination of text match quality and page authority

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Crawl Pipeline Design

The crawler must systematically discover and fetch web pages at scale without overwhelming target sites or wasting resources on duplicate content.

Hints to consider:

Use a URL frontier (priority queue) that schedules which URLs to crawl next, respecting per-host rate limits and robots.txt
Deduplicate URLs before crawling (canonicalize URLs) and deduplicate content after fetching (fingerprint/simhash)
Prioritize crawling: popular/authoritative sites first, stale content before new content for freshness
Discuss politeness: limit concurrent requests per domain, use exponential backoff for errors, and respect crawl-delay directives

2. Inverted Index Architecture

The inverted index maps each word to the list of documents containing it. This is the data structure that makes keyword search fast.

Hints to consider:

For each term, store a posting list: sorted list of (document_id, term_frequency, positions)
Shard the index by document ID so each shard holds a complete index of its document subset (document-partitioned)
Alternatively, shard by term (term-partitioned) for certain workloads — discuss trade-offs
Compress posting lists (variable-byte encoding, delta encoding) to reduce storage and improve cache hit rates

3. Query Processing and Ranking

When a user searches "best pizza NYC," the system must find documents matching all terms, score them, and return the top results quickly.

Hints to consider:

Use TF-IDF or BM25 as the baseline text relevance score
Combine text relevance with page authority signals (like PageRank) for the final ranking
For multi-word queries, intersect posting lists efficiently using skip pointers
Use a scatter-gather pattern: send the query to all index shards in parallel, merge top-K results from each shard

4. Keeping the Index Fresh

The web changes constantly. Users expect recent content to be searchable without waiting days.

Hints to consider:

Separate the index into segments: a large base index (refreshed less often) and a small incremental index (updated frequently)
Merge the incremental index into the base periodically
Use a change detection system: re-crawl pages at intervals proportional to their update frequency
Discuss near-real-time indexing: newly crawled pages are added to the incremental index within minutes