LeetCode 609: Find Duplicate File in System

We build a map: content -> [file_path]. If there are more than one files against a content, they are duplicates.

Say there are $n$ files and the longest file path has length $m$ , then time: $\mathcal{O}(n \cdot m)$ , space is proportional to the total number of characters including contents.

class Solution:
    def map_content(self, path, content_map):
        parts = path.split(" ")
        dir = parts[0]
        for i in range(1, len(parts)):
            f = parts[i]
            marker = f.find("(")
            f_name = f[:marker]
            f_content = f[marker:]
            content_map[f_content] = content_map.get(f_content, []) + [
                f"{dir}/{f_name}"
            ]

    def findDuplicate(self, paths: List[str]) -> List[List[str]]:
        content_map = {}
        for p in paths:
            self.map_content(p, content_map)
        return [files for _, files in content_map.items() if len(files) > 1]

Follow-up

Imagine you are given a real file system, how will you search files? DFS or BFS?

We should use DFS. Because, a real file system is more wider than it is deeper, so BFS may run into memory limit. Also, we need the contents of one file at a time, so DFS is sufficient. If a heuristic is available for how likely a directory may contain duplicate files, we may use BFS for the first few levels and then do DFS on a prioritized manner on the discovered directories.

If the file content is very large (GB level), how will you modify your solution?

With a large file, we cannot directly put its content in the dict as key. We need to use a smaller proxy for the content. We can use a Cryptographic hash like SHA256 as a proxy for the file content. If there is just a single file against a SHA256 hash, we can be certain that file content is unique. If there are multiple files against a hash, there is a small chance that two different contents collided against the hash. So, we need to validate that the contents are identical. To do that, we can match a pair of contents with each other.

If you can only read the file by 1kb each time, how will you modify your solution?

We can create a Merkle tree of SHA256 hashes. Each leaf hash will be computed on 1 kb. Then each pair of leaf hashes will create a parent hash, etc. until we have just one root hash. The root hash will work as a reliable proxy for the file content. If a file has 1 GB content, there will be about 1 million leaves in the Merkle tree. Each leaf hash is 256 bits or 32 bytes long. So, the memory requirement is $\frac{32 \times 10^6}{10^9} \approx 3.2\%$ . Again, when there are multiple files against a root hash, we can compare their contents 1 kb at a time.

What is the time complexity of your modified solution? What is the most time-consuming part and memory-consuming part of it? How to optimize?

For each file we now need to compute about 1 million SHA256 hashes. Apart from the DFS of the file system, computing Merkle tree incurs overhead. The memory requirement for the Merkle tree is modest, about $32$ MB. Disk I/O will dominate the time complexity. If it takes $r$ seconds to read 1 kb file chunk, computing leaf hashes takes about $10^6 \cdot r$ seconds. Note, there are about another million internal hashes in the Merkle tree. So, total time is about $2 \cdot 10^6 \cdot r$ seconds. To optimize we can parallelize. As soon as two leaf hashes are available, the parent hash will be computed concurrently with the next leaf hashes.

Given most files have unique contents, in the end we may need to validate only a handful of duplicates. In the worst case, we may need to compare the duplicate files pairwise. The time for comparing two 1 GB files will again be dominated by disk I/O and would be similar to the Merkle tree’s time complexity.

How to make sure the duplicated files you find are not false positive?

Say false positive means two files with different contents have been declared as duplicates. Although Merkle tree would be a very reliable proxy for a file content, there still is a nonzero probability of hash collisions. To be absolutely sure that they are duplicates, we need to compare file contents.

unreasonably effective