Practice/Meta/Design Top K Songs Widget for Spotify

Design Top K Songs Widget for Spotify

Product DesignMust

Problem Statement

Design a widget that displays a user's most frequently played songs from streaming platforms like Spotify over configurable time windows (7 days, 30 days, or all-time). The widget should aggregate play history data, compute rankings, and provide an embeddable interface that updates in near real-time as users listen to music.

Your system needs to handle millions of active users who collectively generate billions of song play events daily. The widget should be responsive, show accurate rankings even under high load, and integrate seamlessly with third-party APIs while respecting rate limits. Consider how you'll aggregate historical data, compute rankings efficiently, cache results, and handle the eventual consistency of distributed streaming data.

Key Requirements

Functional

Play Event Ingestion -- capture song play events from streaming services with metadata like song ID, artist, timestamp, and user ID
Top K Ranking Computation -- calculate the K most played songs for each user over different time windows (7d, 30d, all-time)
Widget API -- provide RESTful endpoints that return ranked song lists with play counts and metadata
Time Window Filtering -- support dynamic time range queries without recomputing from scratch
Real-time Updates -- reflect new plays in rankings within reasonable delay bounds

Non-Functional

Scalability -- support 50M daily active users with 500M song plays per day
Reliability -- maintain 99.9% availability with graceful degradation during failures
Latency -- widget loads should complete in under 300ms for p99
Consistency -- eventual consistency acceptable with rankings converging within 5 minutes

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Data Ingestion and Stream Processing

Handling high-velocity play events requires careful design of your ingestion pipeline. Interviewers want to see how you'll process streaming data reliably while managing third-party API constraints.

Hints to consider:

Consider using a message queue (Kafka, Kinesis) to buffer play events and decouple ingestion from processing
Discuss webhook vs polling approaches for receiving data from Spotify, and how to handle rate limits
Think about batching strategies to reduce downstream processing load
Plan for duplicate detection since streaming platforms may send duplicate events

2. Aggregation and Ranking Algorithm

Computing top K songs efficiently across multiple time windows for millions of users is computationally intensive. The core challenge is balancing accuracy, freshness, and computational cost.

Hints to consider:

Explore time-bucketed aggregation (hourly/daily rollups) to avoid scanning all raw events
Consider probabilistic data structures like Count-Min Sketch for memory-efficient counting
Discuss tradeoffs between exact counts vs approximate algorithms (HyperLogLog, Top-K sketches)
Think about pre-computing rankings periodically vs computing on-demand
Consider how to efficiently update rankings incrementally as new plays arrive

3. Caching Strategy and Data Storage

The widget will be queried frequently for the same user's data. Smart caching dramatically reduces backend load and improves latency.

Hints to consider:

Use multi-layer caching (CDN for static assets, Redis for computed rankings, application cache)
Design cache keys that include user ID and time window parameters
Determine appropriate TTL values balancing freshness vs cache hit rate
Consider write-through vs write-behind caching for ranking updates
Think about cache warming strategies for popular users or anticipated traffic spikes

4. Time Window Queries and Historical Data

Supporting multiple time windows (7d, 30d, all-time) without redundant computation requires thoughtful data organization.

Hints to consider:

Use time-series databases (InfluxDB, TimescaleDB) optimized for temporal queries
Consider maintaining separate aggregation tables for each time window granularity
Think about data retention policies and when to archive or delete old play events
Discuss how to handle timezone differences for "last 7 days" calculations
Plan for efficient backfill when users first connect their streaming account

Suggested Approach

Step 1: Clarify Requirements

Confirm critical details before designing:

What is K? Is it fixed (top 10) or user-configurable?
Do we need to support multiple streaming services or just Spotify initially?
Should rankings update in real-time or is periodic refresh (every 5 min) acceptable?
Do we need to display additional metadata like album art, artist info, or play counts?
What's the expected read-to-write ratio? How many widget loads vs new play events?
Should we handle edge cases like users with very few plays or brand new accounts?

Practice/Meta/Design Top K Songs Widget for Spotify

Design Top K Songs Widget for Spotify

Product DesignMust

Problem Statement

Key Requirements

Functional

Play Event Ingestion -- capture song play events from streaming services with metadata like song ID, artist, timestamp, and user ID
Top K Ranking Computation -- calculate the K most played songs for each user over different time windows (7d, 30d, all-time)
Widget API -- provide RESTful endpoints that return ranked song lists with play counts and metadata
Time Window Filtering -- support dynamic time range queries without recomputing from scratch
Real-time Updates -- reflect new plays in rankings within reasonable delay bounds

Non-Functional

Scalability -- support 50M daily active users with 500M song plays per day
Reliability -- maintain 99.9% availability with graceful degradation during failures
Latency -- widget loads should complete in under 300ms for p99
Consistency -- eventual consistency acceptable with rankings converging within 5 minutes

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Data Ingestion and Stream Processing

Handling high-velocity play events requires careful design of your ingestion pipeline. Interviewers want to see how you'll process streaming data reliably while managing third-party API constraints.

Hints to consider:

Consider using a message queue (Kafka, Kinesis) to buffer play events and decouple ingestion from processing
Discuss webhook vs polling approaches for receiving data from Spotify, and how to handle rate limits
Think about batching strategies to reduce downstream processing load
Plan for duplicate detection since streaming platforms may send duplicate events

2. Aggregation and Ranking Algorithm

Computing top K songs efficiently across multiple time windows for millions of users is computationally intensive. The core challenge is balancing accuracy, freshness, and computational cost.

Hints to consider:

Explore time-bucketed aggregation (hourly/daily rollups) to avoid scanning all raw events
Consider probabilistic data structures like Count-Min Sketch for memory-efficient counting
Discuss tradeoffs between exact counts vs approximate algorithms (HyperLogLog, Top-K sketches)
Think about pre-computing rankings periodically vs computing on-demand
Consider how to efficiently update rankings incrementally as new plays arrive

3. Caching Strategy and Data Storage

The widget will be queried frequently for the same user's data. Smart caching dramatically reduces backend load and improves latency.

Hints to consider:

Use multi-layer caching (CDN for static assets, Redis for computed rankings, application cache)
Design cache keys that include user ID and time window parameters
Determine appropriate TTL values balancing freshness vs cache hit rate
Consider write-through vs write-behind caching for ranking updates
Think about cache warming strategies for popular users or anticipated traffic spikes

4. Time Window Queries and Historical Data

Supporting multiple time windows (7d, 30d, all-time) without redundant computation requires thoughtful data organization.

Hints to consider:

Use time-series databases (InfluxDB, TimescaleDB) optimized for temporal queries
Consider maintaining separate aggregation tables for each time window granularity
Think about data retention policies and when to archive or delete old play events
Discuss how to handle timezone differences for "last 7 days" calculations
Plan for efficient backfill when users first connect their streaming account

Suggested Approach

Step 1: Clarify Requirements

Confirm critical details before designing:

What is K? Is it fixed (top 10) or user-configurable?
Do we need to support multiple streaming services or just Spotify initially?
Should rankings update in real-time or is periodic refresh (every 5 min) acceptable?
Do we need to display additional metadata like album art, artist info, or play counts?
What's the expected read-to-write ratio? How many widget loads vs new play events?
Should we handle edge cases like users with very few plays or brand new accounts?