Design Large-Scale Video Deduplication — Google

Google doesn't publicly release the exact internal interview question titled "Design Large-Scale Video Deduplication," but it aligns closely with common ML system design problems discussed in prep resources and research papers for roles involving system design, ML systems, content moderation, and machine learning. No full problem statement with input/output examples or explicit constraints appears verbatim across public sources like Glassdoor, LeetCode forums, or GitHub interview repos.[1][2]

Typical Problem Framing

This question typically asks you to design a scalable system to detect and deduplicate near-identical videos (e.g., reuploads, edited copies) across a platform like YouTube, handling billions of videos daily while minimizing storage costs and enabling fast moderation.[5][9]

Key elements include:

Functional requirements: Given a new video, identify if it matches existing ones (exact duplicates or near-duplicates with transformations like cropping, compression, watermarks).
Non-functional requirements: Process 1M+ uploads/day, sub-second query latency, 99.99% availability, handle petabytes of data.

Inferred Input/Output Examples

Public ML system design mocks and papers suggest these formats (no canonical examples exist publicly).[2][4]

Input:

Video file (e.g., MP4, 10-60s duration).
Metadata: uploader ID, title, tags, upload timestamp.
Query: "Check if this matches any of 10B stored videos."

Output: { "query_video_id": "vid_123", "matches": [ { "matched_video_id": "vid_456", "similarity_score": 0.95, "match_type": "near_duplicate", "transformation_detected": ["cropped", "resized"], "confidence": 0.98 } ], "action": "deduplicate" // or "store_new" }

Core Constraints

Drawn from research on large-scale video dedup systems like Maze and ViDeDup.[9][5]

Scale: 10B+ videos, 1PB+ storage, 1M queries/day.
Video variations: Handle edits (speed change ±20%, frame rate shifts, overlays, mirroring).
Performance: <500ms end-to-end latency, 99% precision/recall on duplicates.
Resources: Distribute across 1000s of nodes; use 256-dim embeddings per keyframe.
Edge cases: Very short videos (<5s), high-motion content, low-quality uploads.

Standard Solution Outline

Step 1: Video Fingerprinting

Extract keyframes (e.g., every 0.5s or scene changes).
Generate embeddings via CNN (ResNet/ViT) + temporal pooling.[3]

Step 2: Indexing & Search

Locality-Sensitive Hashing (LSH) or ANN (FAISS) for high-dim vectors.
Shard by hash buckets across clusters.[1]

Step 3: Matching & Dedup

Threshold similarity (e.g., cosine >0.9).
Use MapReduce for batch processing.[1]