Practice/Meta/Design Top K Songs Widget for Spotify
Design Top K Songs Widget for Spotify
Product DesignMust
Problem Statement
Design a widget that displays a user's most frequently played songs from streaming platforms like Spotify over configurable time windows (7 days, 30 days, or all-time). The widget should aggregate play history data, compute rankings, and provide an embeddable interface that updates in near real-time as users listen to music.
Your system needs to handle millions of active users who collectively generate billions of song play events daily. The widget should be responsive, show accurate rankings even under high load, and integrate seamlessly with third-party APIs while respecting rate limits. Consider how you'll aggregate historical data, compute rankings efficiently, cache results, and handle the eventual consistency of distributed streaming data.
Key Requirements
Functional
- Play Event Ingestion -- capture song play events from streaming services with metadata like song ID, artist, timestamp, and user ID
- Top K Ranking Computation -- calculate the K most played songs for each user over different time windows (7d, 30d, all-time)
- Widget API -- provide RESTful endpoints that return ranked song lists with play counts and metadata
- Time Window Filtering -- support dynamic time range queries without recomputing from scratch
- Real-time Updates -- reflect new plays in rankings within reasonable delay bounds
Non-Functional
- Scalability -- support 50M daily active users with 500M song plays per day
- Reliability -- maintain 99.9% availability with graceful degradation during failures
- Latency -- widget loads should complete in under 300ms for p99
- Consistency -- eventual consistency acceptable with rankings converging within 5 minutes
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Data Ingestion and Stream Processing
Handling high-velocity play events requires careful design of your ingestion pipeline. Interviewers want to see how you'll process streaming data reliably while managing third-party API constraints.
Hints to consider:
- Consider using a message queue (Kafka, Kinesis) to buffer play events and decouple ingestion from processing
- Discuss webhook vs polling approaches for receiving data from Spotify, and how to handle rate limits
- Think about batching strategies to reduce downstream processing load
- Plan for duplicate detection since streaming platforms may send duplicate events
2. Aggregation and Ranking Algorithm
Computing top K songs efficiently across multiple time windows for millions of users is computationally intensive. The core challenge is balancing accuracy, freshness, and computational cost.
Hints to consider:
- Explore time-bucketed aggregation (hourly/daily rollups) to avoid scanning all raw events
- Consider probabilistic data structures like Count-Min Sketch for memory-efficient counting
- Discuss tradeoffs between exact counts vs approximate algorithms (HyperLogLog, Top-K sketches)
- Think about pre-computing rankings periodically vs computing on-demand
- Consider how to efficiently update rankings incrementally as new plays arrive
3. Caching Strategy and Data Storage
The widget will be queried frequently for the same user's data. Smart caching dramatically reduces backend load and improves latency.
Hints to consider:
- Use multi-layer caching (CDN for static assets, Redis for computed rankings, application cache)
- Design cache keys that include user ID and time window parameters
- Determine appropriate TTL values balancing freshness vs cache hit rate
- Consider write-through vs write-behind caching for ranking updates
- Think about cache warming strategies for popular users or anticipated traffic spikes
4. Time Window Queries and Historical Data
Supporting multiple time windows (7d, 30d, all-time) without redundant computation requires thoughtful data organization.
Hints to consider:
- Use time-series databases (InfluxDB, TimescaleDB) optimized for temporal queries
- Consider maintaining separate aggregation tables for each time window granularity
- Think about data retention policies and when to archive or delete old play events
- Discuss how to handle timezone differences for "last 7 days" calculations
- Plan for efficient backfill when users first connect their streaming account
Suggested Approach
Step 1: Clarify Requirements
Confirm critical details before designing:
- What is K? Is it fixed (top 10) or user-configurable?
- Do we need to support multiple streaming services or just Spotify initially?
- Should rankings update in real-time or is periodic refresh (every 5 min) acceptable?
- Do we need to display additional metadata like album art, artist info, or play counts?
- What's the expected read-to-write ratio? How many widget loads vs new play events?
- Should we handle edge cases like users with very few plays or brand new accounts?