Project - Deduplicator

Goofy logo.

About

This little program is (yet another) deduplicator. It works on file sizes and checksums (MD5 and SHA1). It is around 500 SLOC in size.

Rationale

Fdupes works great, but spends a lot of time chewing through huge files. It is also difficult to deal with the output.

What makes 'Project - De-duplicator' stand apart?

Download

The source is available here, and is released under the GPLv3 license. Please send questions or comments to . Or, even better, send a pull request on GitLab ☺.

Examples

Here are some examples (note these are included in the git repository):

$ ###########################################################        
$ # See documentation.
$ ###########################################################
$ deduplicator -h 
General Options:
  -h [ --help ]                         This help message.
  -b [ --number-of-bytes-to-consider ] arg (=1000000)
                                        The maximum number of bytes of each 
                                        file to consider. Choose something 
                                        large enough so that file headers are 
                                        overcome, but not so large that the 
                                        whole file is used, which can be 
                                        wasteful if not needed.
  -S [ --smallest-file-size ] arg (=1)  Only consider files with sizes >= this 
                                        many bytes.
  -u [ --uri ] arg                      A directory or file to include.
  -n [ --null-terminated-stdin ]        (Default.) Accept null-terminated uris 
                                        from stdin. Useful for interoperability
                                        with other programs.
  -R [ --read-from-stdin ]              Read uris from stdin, one per line. 
                                        Useful for interoperability with other 
                                        programs.
  -t [ --trial-run-only ]               Perform a dry-run or trial-run. Do not 
                                        remove anything.
  -o [ --older-modification-time-score ] arg (=0)
                                        Score to apply to files which have an 
                                        older modification time. File with 
                                        highest score 'wins' and is retained. 
                                        Score is applied once for every file 
                                        the given file is older than. So every 
                                        unique comparison within the matching 
                                        group is performed. For two files, 
                                        there is one comparison. For three 
                                        files, three comparisons. For N files, 
                                        there are N*(N-1)/2 comparisons. It is 
                                        therefore best not to mix this rule 
                                        with other rules unless the number of 
                                        potential copies is bounded and known, 
                                        and the relative rule weights can be 
                                        chosen appropriately.
  -i [ --icase-ecmascript-regex-rule ] arg
                                        Criteria for a regex-based rule. 
                                        Followed by '===' and the real-valued 
                                        score to apply. File with highest score
                                        'wins' and is retained. Operates on 
                                        full file path. Case-insensitive.


$ ###########################################################    
$ # Find duplicates in pdf book collection.
$ ###########################################################    
$ find /home/hal/books/ -type f -iname '*pdf' -print0 |\
    deduplicator \
            -b 50000000 `# Only compare the first XXX bytes.` \
            -n          `# Read NULL-terminated strings from stdin.` \
            -i ' '===-1.0   `# Penalize files with spaces in their names (normally cleaned).` \
            -i 'xdgopenedfile'===-10.0 `# Penalize files with extra-long names.` \
            -i '-1-'===-1.0 \
            -i '-2-'===-1.0 \
            -i '-3-'===-1.0 \
            -i '___'===-3.0 \
            -i '2014'===-1.0 `# Anything with 2014 in name penalized.` \
            -t `# Perform a trial run only. Do not remove anything.`

...

$ ###########################################################    
$ # Find duplicates in subset of photo collection.
$ ###########################################################    
$ # Only bother with photos from 2015{08,09,10}, and only bother with large files.
$ find /home/hal/Photos/2015{08,09,10}* -size +512k -type f -print0 |\
    deduplicator \
            -S 10000    `# Ignore small files.` \
            -b 10000000 `# Only compare the first XXX bytes.` \
            -n          `# Read NULL-terminated strings from stdin.` \
            -i '^/home/hal/ArchivedPhotos/'===10.0    `# Prefer files already archived.` \
            -i '.*XXth-Misc_Dropbox.*'===-1.0   `# Penalize Dropbox auto-uploads if another copy is around.` \
                                                `# These have been tampered with by Dropbox, e.g., filenames.` \
            -i ' '===-1.0   `# Penalize files with spaces in their names. You would probably detox these files.` \
            -i '20\d{6}_\d{6}\.'===10.0   `# Prefer files with your favourite style of timestamp.` \
            -t `# Perform a trial run only. Do not remove anything.`

...

$ ###########################################################    
$ # Find and remove duplicates in current directory, keeping the oldest duplicates.
$ ###########################################################    
$ deduplicator -o 1.0 ./
...
Number of files found: ...
Number of files with non-unique sizes: ...
The amount of reclaimed space will be: ... MB.
...