Spotify's "Duplicate Song Detection in Music Catalog" interview question focuses on identifying near-identical tracks in a large music library using audio processing and data engineering techniques.
Candidates must design a system or write SQL queries to detect duplicate songs in Spotify's catalog. Songs are considered duplicates if they share identical or highly similar audio fingerprints (e.g., via perceptual hashing like Chromaprint or MFCC features), despite differences in metadata like title, artist name, or file version (remasters, live versions). The goal is deduplication for data engineering pipelines, handling petabyte-scale data with SQL for metadata matching and audio processing for content similarity. Tags emphasize scalable SQL joins, sharding, and integrating audio signal processing libraries.[1][3]
Assume these tables (common in Spotify SQL problems):
SELECT s1.song_id, s2.song_id FROM songs s1 JOIN songs s2 ON s1.song_id < s2.song_id -- Avoid self/self-matches WHERE hamming_distance(s1.audio_fingerprint, s2.audio_fingerprint) <= 5 AND ABS(s1.duration_ms - s2.duration_ms) <= 2000; -- Tempo filter
This scales via approximate nearest neighbors (e.g., Annoy index) or LSH partitioning.[3]| Duplicate Group ID | Song IDs | Similarity Score | Action Recommended | |--------------------|-----------------------|------------------|--------------------| | grp_001 | | 0.98 | Merge to 123 | | grp_002 | | 1.00 | Unique |[4]