This little program is (yet another) file deduplicator. It works on file sizes and checksums (MD5 and SHA1). It is around 500 SLOC in size. It is written in C++.
Fdupes works great, but spends a lot of time chewing through huge files. It is also difficult to deal with the output.
What makes Project - De-duplicator
stand apart?
You can specify to only use the first N bytes of files. This substantially speeds up comparisons of large files.
Duplicates can be ‘scored’ according to a variety of information, including:
(More criteria are possible. This was all I needed when writing).
The top-scoring duplicate is retained. Others can be deleted. Or low-scoring duplicates can be emitted for further processing.
The scoring system is completely controlled by the user. This system makes it easy to prioritize some collection over others. See the examples for more information about the scoring system.
Small files can be ignored. You specify what ‘small’ means.
Null-terminated or newline-terminated lists of files are accepted.
The source is available here and is released under the GPLv3 license.
Here are some examples (note these are included in the git repository):
$ ###########################################################
$ # See documentation.
$ ###########################################################
$ deduplicator -h
General Options:
-h [ --help ] This help message.
-b [ --number-of-bytes-to-consider ] arg (=1000000)
The maximum number of bytes of each
file to consider. Choose something
large enough so that file headers are
overcome, but not so large that the
whole file is used, which can be
wasteful if not needed.
-S [ --smallest-file-size ] arg (=1)
Only consider files with sizes >= this
many bytes.
-u [ --uri ] arg A directory or file to include.
-n [ --null-terminated-stdin ] (Default.) Accept null-terminated uris
from stdin. Useful for interoperability
with other programs.
-R [ --read-from-stdin ] Read uris from stdin, one per line.
Useful for interoperability with other
programs.
-t [ --trial-run-only ] Perform a dry-run or trial-run. Do not
remove anything.
-o [ --older-modification-time-score ] arg (=0)
Score to apply to files which have an
older modification time. File with
highest score 'wins' and is retained.
Score is applied once for every file
the given file is older than. So every
unique comparison within the matching
group is performed. For two files,
there is one comparison. For three
files, three comparisons. For N files,
there are N*(N-1)/2 comparisons. It is
therefore best not to mix this rule
with other rules unless the number of
potential copies is bounded and known,
and the relative rule weights can be
chosen appropriately.
-i [ --icase-ecmascript-regex-rule ] arg
Criteria for a regex-based rule.
Followed by '===' and the real-valued
score to apply. File with highest score
'wins' and is retained. Operates on
full file path. Case-insensitive.
$ ###########################################################
$ # Find duplicates in pdf book collection.
$ ###########################################################
$ find /home/hal/books/ -type f -iname '*pdf' -print0 |\
deduplicator \
`# Only compare the first XXX bytes.` \
-b 50000000 \
`# Read NULL-terminated strings from stdin.` \
-n \
`# Penalize files with spaces in their names.` \
-i ' '===-1.0 \
`# Penalize files with long names.` \
-i 'xdgopenedfile'===-10.0 \
-i '-1-'===-1.0 \
-i '-2-'===-1.0 \
-i '-3-'===-1.0 \
-i '___'===-3.0 \
`# Penalize files from 2020 (a terrible year in general).` \
-i '2020'===-1.0 \
`# Perform a trial run only. Do not remove anything.`
-t
...
$ ###########################################################
$ # Find duplicates in subset of photo collection.
$ ###########################################################
$ # Only bother with photos from 2015{08,09,10} and large files.
$ find /home/hal/Photos/2015{08,09,10}* -size +512k -type f -print0 |\
deduplicator \
`# Ignore small files.` \
-S 10000 \
`# Only compare the first XXX bytes.` \
-b 10000000 \
`# Read NULL-terminated strings from stdin.` \
-n \
`# Prefer these paths.` \
-i '^/home/hal/ArchivedPhotos/'===10.0 \
`# Penalize these paths.` \
-i '.*XXth-Misc.*'===-1.0 \
`# Penalize all files with spaces in their names.` \
-i ' '===-1.0 \
`# Prefer your favourite style of timestamp.` \
-i '20\d{6}_\d{6}\.'===10.0 \
`# Perform a trial run only. Do not remove anything.` \
-t
...
$ ###########################################################
$ # Find and remove duplicates in current directory, keeping the oldest duplicates.
$ ###########################################################
$ deduplicator -o 1.0 ./
...
Number of files found: ...
Number of files with non-unique sizes: ...
The amount of reclaimed space will be: ... MB.
...
Please send questions, comments, or pull requests here.