File Deduplication — Anthropic

File Deduplication Overview
The "File Deduplication" interview question associated with Anthropic, tagged with Hash, File System, and Optimization, typically involves designing or implementing a system to identify and handle duplicate files efficiently in large-scale file systems. It emphasizes scalable approaches using hashing techniques, without loading entire files into memory, and considers optimizations like chunking, collision handling, and parallelization.[1][5]

Problem Statements

Two closely related variants appear in sources linked to Anthropic interviews:

Implement file deduplication at scale: Write a program to deduplicate files in a very large directory tree. Identify groups of identical files without loading entire files into memory. Outline hashing (e.g., size filter, partial hash, full hash), chunking for large files, and hash collisions. Support replacing duplicates with hard links (when safe) and a dry-run report of duplicate sets. Explain time/space complexity, disk I/O batching, and parallelization across cores or machines.[1]
Detect duplicate files efficiently: Given access to a large file system with paths and read access to contents, design an algorithm to identify byte-identical duplicate groups across directories. Filter by size, use partial then full hashing to minimize I/O, handle collisions/memory/parallelization. Extend to incremental updates, cross-machine dedup, and hard links/content-addressed storage.[5]

Related LeetCode Variant (609. Find Duplicate File in System)

A common coding problem often discussed in interview contexts matches the tags closely:
Given a list of strings paths, where each string is "root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ...", return a list of groups of duplicate file paths (full paths like "root/a/1.txt"). Only include groups with 2+ files.[3]

Input/Output Examples

From LeetCode 609, which aligns with the problem's core (hash-based duplicate detection):
Input:
["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root 4.txt(efgh)"][3] Output:
[["root/a/1.txt","root/c/3.txt"], ["root/a/2.txt","root/4.txt"]]

1.txt and 3.txt share content "abcd".
2.txt and 4.txt share "efgh".

Extended Example:
Input: ["root/a 1.txt(hello) 2.txt(world)", "root/b 3.txt(hello)", "root/c/d 4.txt(hello) 5.txt(foo)"]
Output: [["root/a/1.txt", "root/b/3.txt", "root/c/d/4.txt"]] (only "hello" group has duplicates).[3]

No full I/O examples were found specifically for the Anthropic system-design variants, as they focus on design discussion over pure coding.

Constraints

Scale: Very large directory trees or file systems; files potentially exceed memory (e.g., millions of files, TB-scale).[5][1]
No full memory load: Avoid reading entire files; use streaming/partial reads.[1]
Hashing: Handle collisions (e.g., full hash verification after partial); filter by size first.[5][1]
I/O Efficiency: Batch disk reads; minimize seeks for large/chunked files.[1]
Output: Groups of duplicate paths; optional dry-run or hard-link replacement (safe only if no open handles/writable files).[1]
LeetCode-specific: String inputs simulate paths/files; total content length reasonable for hashing; 1 <= paths.length <= 2^16, content up to 100 chars (inferred from standard problem).[3] Time/space: Expected O(n) average via hashing, where n is total files; discuss disk-based structures for memory limits.[5][3][1]