Coding - Deduplicate Files

[ OK ] 00ba78a1-bdfe-4436-89f8-c295e04e351d — full content available

[ INFO ] category: Coding difficulty: unknown freq: first seen: 2026-03-05

[UNKNOWN][CODING]High Frequency

$ cat problem.md

Problem Statement: You are given an array of file paths on a file system. Two files are the same if and only if their canonical paths are identical. Two files are duplicates if their content is exactly the same. Your task is to identify all groups of duplicate files.

Examples:

Given filePaths = ["root/a/1.txt", "root/a/b/1.txt", "root/a/c/1.txt", "root/c/d/1.txt", "root/a/2.txt"], return ["root/a/1.txt", "root/a/b/1.txt", "root/a/c/1.txt"]. The files "root/a/1.txt", "root/a/b/1.txt", and "root/a/c/1.txt" have the same content.

Constraints:

1 <= filePaths.length <= 2 * 10^5
1 <= filePaths[i].length <= 10^^5
filePaths[i] represents a valid path in a file system.
filePaths[i] contains only lowercase letters, and the characters '/', and '.'.

Hints:

Use a hash map to store the content of each file and its paths.
Iterate through each file path, split the path into components, and calculate the hash of the file's content.
If the hash already exists in the map, add the current file path to the list of duplicates associated with that hash.
If the hash does not exist, add it to the map with the current file path as the list of duplicates.

Solution: `python class Solution: def removeDuplicateFiles(self, filePaths: List[str]) -> List[List[str]]: from collections import defaultdict content_map = defaultdict(list)

    def get_file_content(path):
        parts = path.split('/')
        file_name = parts[-1]
        file_content = ''
        for part in parts[:-1]:
            file_content += part + '/'
        file_content += file_name.split('.')[0]
        return int(file_content)

    for path in filePaths:
        content = get_file_content(path)
        content_map[content].append(path)
    
    return [paths for paths in content_map.values() if len(paths) > 1]

Explanation: The solution uses a hash map (defaultdict) to store the content of each file as keys and their paths as values in a list. The get_file_content function is used to calculate the content of a file by concatenating the directory paths and the file name without the extension. The main loop iterates through each file path, calculates its content, and appends the path to the list of duplicates if the content already exists in the map. Finally, the solution returns a list of lists, where each inner list contains the paths of files with the same content.

NOT_FOUND

user@intervues:~/anthropic$