A Configurable File De-duplicator

Hal Clark

About

This little program is (yet another) file deduplicator. It works on file sizes and checksums (MD5 and SHA1). It is around 500 SLOC in size. It is written in C++.

Rationale

Fdupes works great, but spends a lot of time chewing through huge files. It is also difficult to deal with the output.

What makes Project - De-duplicator stand apart?

You can specify to only use the first N bytes of files. This substantially speeds up comparisons of large files.
Duplicates can be ‘scored’ according to a variety of information, including:
1. Full paths (regex-based scoring).
2. Modification times.
(More criteria are possible. This was all I needed when writing).

The top-scoring duplicate is retained. Others can be deleted. Or low-scoring duplicates can be emitted for further processing.

The scoring system is completely controlled by the user. This system makes it easy to prioritize some collection over others. See the examples for more information about the scoring system.
Small files can be ignored. You specify what ‘small’ means.
Null-terminated or newline-terminated lists of files are accepted.

Download

The source is available here and is released under the GPLv3 license.

Examples

Here are some examples (note these are included in the git repository):

$ ###########################################################        
$ # See documentation.
$ ###########################################################
$ deduplicator -h 
General Options:
  -h [ --help ]                   This help message.

  -b [ --number-of-bytes-to-consider ] arg (=1000000)
                                  The maximum number of bytes of each 
                                  file to consider. Choose something 
                                  large enough so that file headers are 
                                  overcome, but not so large that the 
                                  whole file is used, which can be 
                                  wasteful if not needed.

  -S [ --smallest-file-size ] arg (=1)
                                  Only consider files with sizes >= this 
                                  many bytes.

  -u [ --uri ] arg                A directory or file to include.

  -n [ --null-terminated-stdin ]  (Default.) Accept null-terminated uris 
                                  from stdin. Useful for interoperability
                                  with other programs.

  -R [ --read-from-stdin ]        Read uris from stdin, one per line. 
                                  Useful for interoperability with other 
                                  programs.

  -t [ --trial-run-only ]         Perform a dry-run or trial-run. Do not 
                                  remove anything.

  -o [ --older-modification-time-score ] arg (=0)
                                  Score to apply to files which have an 
                                  older modification time. File with 
                                  highest score 'wins' and is retained. 
                                  Score is applied once for every file 
                                  the given file is older than. So every 
                                  unique comparison within the matching 
                                  group is performed. For two files, 
                                  there is one comparison. For three 
                                  files, three comparisons. For N files, 
                                  there are N*(N-1)/2 comparisons. It is 
                                  therefore best not to mix this rule 
                                  with other rules unless the number of 
                                  potential copies is bounded and known, 
                                  and the relative rule weights can be 
                                  chosen appropriately.

  -i [ --icase-ecmascript-regex-rule ] arg
                                  Criteria for a regex-based rule. 
                                  Followed by '===' and the real-valued 
                                  score to apply. File with highest score
                                  'wins' and is retained. Operates on 
                                  full file path. Case-insensitive.


$ ###########################################################    
$ # Find duplicates in pdf book collection.
$ ###########################################################    
$ find /home/hal/books/ -type f -iname '*pdf' -print0 |\
    deduplicator \
      `# Only compare the first XXX bytes.` \
      -b 50000000 \
      `# Read NULL-terminated strings from stdin.` \
      -n \
      `# Penalize files with spaces in their names.` \
      -i ' '===-1.0 \
      `# Penalize files with long names.` \
      -i 'xdgopenedfile'===-10.0 \
      -i '-1-'===-1.0 \
      -i '-2-'===-1.0 \
      -i '-3-'===-1.0 \
      -i '___'===-3.0 \
      `# Penalize files from 2020 (a terrible year in general).` \
      -i '2020'===-1.0 \
      `# Perform a trial run only. Do not remove anything.`
      -t 

...

$ ###########################################################    
$ # Find duplicates in subset of photo collection.
$ ###########################################################    
$ # Only bother with photos from 2015{08,09,10} and large files.
$ find /home/hal/Photos/2015{08,09,10}* -size +512k -type f -print0 |\
    deduplicator \
      `# Ignore small files.` \
      -S 10000 \
      `# Only compare the first XXX bytes.` \
      -b 10000000 \
      `# Read NULL-terminated strings from stdin.` \
      -n \
      `# Prefer these paths.` \
      -i '^/home/hal/ArchivedPhotos/'===10.0 \
      `# Penalize these paths.` \
      -i '.*XXth-Misc.*'===-1.0 \
      `# Penalize all files with spaces in their names.` \
      -i ' '===-1.0 \
      `# Prefer your favourite style of timestamp.` \
      -i '20\d{6}_\d{6}\.'===10.0  \
      `# Perform a trial run only. Do not remove anything.` \
      -t 

...

$ ###########################################################    
$ # Find and remove duplicates in current directory, keeping the oldest duplicates.
$ ###########################################################    
$ deduplicator -o 1.0 ./
...
Number of files found: ...
Number of files with non-unique sizes: ...
The amount of reclaimed space will be: ... MB.
...

Feedback

Please send questions, comments, or pull requests here.