Google doesn't publicly release the exact internal interview question titled "Design Large-Scale Video Deduplication," but it aligns closely with common ML system design problems discussed in prep resources and research papers for roles involving system design, ML systems, content moderation, and machine learning. No full problem statement with input/output examples or explicit constraints appears verbatim across public sources like Glassdoor, LeetCode forums, or GitHub interview repos.[1][2]
This question typically asks you to design a scalable system to detect and deduplicate near-identical videos (e.g., reuploads, edited copies) across a platform like YouTube, handling billions of videos daily while minimizing storage costs and enabling fast moderation.[5][9]
Key elements include:
Public ML system design mocks and papers suggest these formats (no canonical examples exist publicly).[2][4]
Input:
Output:
{ "query_video_id": "vid_123", "matches": [ { "matched_video_id": "vid_456", "similarity_score": 0.95, "match_type": "near_duplicate", "transformation_detected": ["cropped", "resized"], "confidence": 0.98 } ], "action": "deduplicate" // or "store_new" }
Drawn from research on large-scale video dedup systems like Maze and ViDeDup.[9][5]
Step 1: Video Fingerprinting
Step 2: Indexing & Search
Step 3: Matching & Dedup