Design Instacart product catalogue — Instacart

Reference Answer

For a full example answer with detailed architecture diagrams and deep dives, see our Design Yelp guide. While the Yelp guide focuses on business discovery, many of the same patterns around search indexing, location-scoped data, and read-heavy caching strategies apply directly to a product catalog system.

Also review the Search, Caching, and Databases building blocks for background on faceted search, multi-tier caching, and data modeling for hierarchical taxonomies.

Problem Statement

Design a product catalog system that allows users to browse products by categories and subcategories, similar to Instacart's product discovery experience. Users pick a location or store, navigate a hierarchical category tree, filter and sort items, and view product details including price, packaging, and real-time availability.

The system is extremely read-heavy: category pages are viewed millions of times per day, but product details (prices, inventory) change frequently per store. The core challenges are designing a hierarchical taxonomy that works across diverse retailers, keeping product information synchronized in near real-time as prices and stock levels change throughout the day, and delivering sub-second browsing experiences despite the massive dataset. You must handle location-scoped data where every price and availability figure is specific to a particular store, and design caching and indexing strategies that stay fresh without full rebuilds on every update.

Key Requirements

Functional

Category browsing -- users select a store location and navigate a hierarchical category tree (e.g., Produce, then Fruits, then Organic Apples) to discover available products
Filtering and sorting -- users apply multiple filters simultaneously (brand, dietary restrictions, price range, ratings) and sort results by price, popularity, or relevance
Real-time product details -- display current price, stock availability, product images, and nutritional information for each item at the selected store
Efficient pagination -- support smooth scrolling through category pages with thousands of products while maintaining result consistency

Non-Functional

Scalability -- handle 50 million daily active users browsing 100+ million products across 10,000+ store locations with peak traffic during evening hours
Reliability -- 99.95% uptime for browse and search functionality; gracefully degrade by showing cached data if real-time updates are unavailable
Latency -- category page loads under 300ms at p95, search results under 500ms, with sub-50ms cache hits for frequently accessed pages
Consistency -- eventual consistency acceptable for inventory updates (5-15 minute lag), but price changes must propagate within 2 minutes to prevent checkout discrepancies

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Data Modeling and Taxonomy Design

The hierarchical category structure and how you represent products with store-specific attributes is fundamental. A poor data model leads to expensive queries and difficult maintenance.

Hints to consider:

Separate shared product information (brand, description, images, nutritional data) from store-specific data (price, inventory count, promotions) to avoid duplicating core product records across thousands of stores
Represent product variants (different sizes, organic versus conventional) as linked entities with shared parent metadata
Design the category hierarchy with both rigid paths (Bakery then Bread then Whole Wheat) and cross-category discovery (all Gluten-Free items across departments)
Plan a canonical taxonomy that maps diverse retailer-specific category structures into a unified browsing experience

2. Multi-Tier Caching Strategy

Category pages are accessed millions of times but product details change frequently. Your caching layers determine whether the system can meet latency targets cost-effectively.

Hints to consider:

Use CDN caching for static assets (product images, category icons), application-level Redis caching for assembled category pages, and database query caching for individual product lookups
Cache complete category page responses for popular filter and sort combinations with short TTLs (1-5 minutes), and cache individual product objects for long-tail queries
Warm caches proactively for high-traffic categories during peak hours and implement fine-grained invalidation that targets specific products rather than flushing entire category caches
Handle the cold-start problem when users switch stores by pre-populating caches for adjacent stores in the same metro area

3. Real-Time Inventory and Price Synchronization

Retailers push updates through various mechanisms at different frequencies. Stale data causes abandoned carts and eroded customer trust.

Hints to consider:

Design an ingestion pipeline using Kafka that normalizes heterogeneous retailer feeds (APIs, file drops, webhooks) into a canonical update format
Apply partial index updates rather than full rebuilds when a subset of products changes price or availability
Use event streaming to propagate changes through cache layers with targeted invalidation, avoiding cache stampedes during bulk updates
Validate updates against business rules (e.g., reject negative prices) before applying them, and log discrepancies for reconciliation

4. Search and Filter Performance at Scale

Users expect instant results when combining multiple filters over categories containing tens of thousands of products. Poor indexing makes this impossible.

Hints to consider:

Build denormalized search documents in Elasticsearch that embed commonly filtered attributes (brand, dietary tags, price tier) directly, avoiding expensive joins at query time
Partition the search index by store or geographic region to keep working sets manageable and enable parallel query execution
Pre-compute facet counts (number of items per brand, price range distribution) for common filter combinations rather than recalculating on every request
Use keyset pagination with stable cursors instead of offset-based queries to ensure consistency as users scroll and underlying data changes

Suggested Approach

Step 1: Clarify Requirements

Ask about the number of retailers, products per store, and total store count. Confirm whether all stores from a retailer share the same catalog or each location has a unique assortment. Determine acceptable staleness for different data types -- can prices lag by minutes, or must they be real-time? Clarify whether personalization (recommendations, recently viewed) is in scope. Understand peak load patterns and whether traffic is concentrated in specific regions or timezones.

Step 2: High-Level Architecture

Sketch the major components: an Ingestion Layer that normalizes retailer feeds into a canonical product model via Kafka, a Product Service backed by PostgreSQL for master product data, an Inventory Service that tracks store-specific pricing and availability, a Search Service powered by Elasticsearch for category browsing and filtering, a multi-tier Cache Layer (Redis for application-level, CDN for static assets), and API servers that orchestrate client requests. Show two data flows: the read path (user browses a category, Search Service returns product IDs, Inventory Service hydrates with store-specific price and availability from cache), and the write path (retailer sends an update, Ingestion Layer processes and writes to the database, events propagate to Elasticsearch and Redis for cache invalidation).

Step 3: Deep Dive on Data Model and Search

Walk through the search index design. Each Elasticsearch document represents a product-store combination containing: product ID, name, brand, category path array (enabling ancestor queries), store-specific price and availability as nested fields, and pre-computed filter attributes (dietary tags, price tier bucket). Explain the partitioning strategy: shard by store region to keep hot data co-located. Show how a category page query translates to an Elasticsearch filter query with faceted aggregations, returning product IDs that are hydrated with fresh price data from the Redis cache. Discuss how cursor-based pagination works: each page response includes a sort key that the next request uses as a starting point, ensuring stable results even as prices and inventory change between page loads.

Step 4: Address Secondary Concerns

Cover cache invalidation: when a price update arrives via Kafka, the consumer updates the Elasticsearch document and invalidates the specific product's Redis cache entry; category page caches are invalidated only if the price change affects the current sort order or filter results. Discuss monitoring: track cache hit rates per store, Elasticsearch query latencies by category depth, inventory staleness metrics, and ingestion pipeline lag. Address failure scenarios: if Elasticsearch is degraded, fall back to PostgreSQL queries with reduced filter support and show a staleness indicator to users. Mention consistency around checkout: re-validate price and availability at order submission time against the authoritative database, not the cache. Touch on how the system evolves to support features like personalized ranking and real-time deal notifications.

Real Interview Quotes

"Design the Instacart product catalog page. You can get products by categories. Each category has subcategories."

Related Learning

Yelp -- location-scoped search, category browsing, and read-heavy caching strategies
Search Systems -- Elasticsearch for faceted search and geospatially-scoped product indexing
Caching -- multi-tier caching with fine-grained invalidation for dynamic product data
Message Queues -- Kafka for retailer feed ingestion and change event propagation
Databases -- PostgreSQL for canonical product data and store-specific inventory

Reference Answer

Also review the Search, Caching, and Databases building blocks for background on faceted search, multi-tier caching, and data modeling for hierarchical taxonomies.

Problem Statement

Key Requirements

Functional

Category browsing -- users select a store location and navigate a hierarchical category tree (e.g., Produce, then Fruits, then Organic Apples) to discover available products
Filtering and sorting -- users apply multiple filters simultaneously (brand, dietary restrictions, price range, ratings) and sort results by price, popularity, or relevance
Real-time product details -- display current price, stock availability, product images, and nutritional information for each item at the selected store
Efficient pagination -- support smooth scrolling through category pages with thousands of products while maintaining result consistency

Non-Functional

Scalability -- handle 50 million daily active users browsing 100+ million products across 10,000+ store locations with peak traffic during evening hours
Reliability -- 99.95% uptime for browse and search functionality; gracefully degrade by showing cached data if real-time updates are unavailable
Latency -- category page loads under 300ms at p95, search results under 500ms, with sub-50ms cache hits for frequently accessed pages
Consistency -- eventual consistency acceptable for inventory updates (5-15 minute lag), but price changes must propagate within 2 minutes to prevent checkout discrepancies

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Data Modeling and Taxonomy Design

The hierarchical category structure and how you represent products with store-specific attributes is fundamental. A poor data model leads to expensive queries and difficult maintenance.

Hints to consider:

Separate shared product information (brand, description, images, nutritional data) from store-specific data (price, inventory count, promotions) to avoid duplicating core product records across thousands of stores
Represent product variants (different sizes, organic versus conventional) as linked entities with shared parent metadata
Design the category hierarchy with both rigid paths (Bakery then Bread then Whole Wheat) and cross-category discovery (all Gluten-Free items across departments)
Plan a canonical taxonomy that maps diverse retailer-specific category structures into a unified browsing experience

2. Multi-Tier Caching Strategy

Category pages are accessed millions of times but product details change frequently. Your caching layers determine whether the system can meet latency targets cost-effectively.

Hints to consider:

Use CDN caching for static assets (product images, category icons), application-level Redis caching for assembled category pages, and database query caching for individual product lookups
Cache complete category page responses for popular filter and sort combinations with short TTLs (1-5 minutes), and cache individual product objects for long-tail queries
Warm caches proactively for high-traffic categories during peak hours and implement fine-grained invalidation that targets specific products rather than flushing entire category caches
Handle the cold-start problem when users switch stores by pre-populating caches for adjacent stores in the same metro area

3. Real-Time Inventory and Price Synchronization

Retailers push updates through various mechanisms at different frequencies. Stale data causes abandoned carts and eroded customer trust.

Hints to consider:

Design an ingestion pipeline using Kafka that normalizes heterogeneous retailer feeds (APIs, file drops, webhooks) into a canonical update format
Apply partial index updates rather than full rebuilds when a subset of products changes price or availability
Use event streaming to propagate changes through cache layers with targeted invalidation, avoiding cache stampedes during bulk updates
Validate updates against business rules (e.g., reject negative prices) before applying them, and log discrepancies for reconciliation

4. Search and Filter Performance at Scale

Users expect instant results when combining multiple filters over categories containing tens of thousands of products. Poor indexing makes this impossible.

Hints to consider:

Build denormalized search documents in Elasticsearch that embed commonly filtered attributes (brand, dietary tags, price tier) directly, avoiding expensive joins at query time
Partition the search index by store or geographic region to keep working sets manageable and enable parallel query execution
Pre-compute facet counts (number of items per brand, price range distribution) for common filter combinations rather than recalculating on every request
Use keyset pagination with stable cursors instead of offset-based queries to ensure consistency as users scroll and underlying data changes

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Step 3: Deep Dive on Data Model and Search

Step 4: Address Secondary Concerns

Real Interview Quotes

"Design the Instacart product catalog page. You can get products by categories. Each category has subcategories."

Related Learning

Yelp -- location-scoped search, category browsing, and read-heavy caching strategies
Search Systems -- Elasticsearch for faceted search and geospatially-scoped product indexing
Caching -- multi-tier caching with fine-grained invalidation for dynamic product data
Message Queues -- Kafka for retailer feed ingestion and change event propagation
Databases -- PostgreSQL for canonical product data and store-specific inventory