DuMP3 logo
Google
 
Web dump3.sourceforge.net
Powered by SourceForge.net Logo
HTML coding Powered by htp
Proudly South African Proudly South African

Problem Statement

Is your hard drive filled with backups of holiday pictures or downloaded images? How do you find and delete similar files if they have different formats, resolutions or rotations?
Is your hard drive filled with ripped or downloaded music? How do you find all the duplicates in your collection if they have different bit rates or formats?
How do you find duplicate text files or binary files on your computer?
Do you get a program to handle each case individually or would you rather have one program that does it all?
Here is my solution:

Introducing DuMP3

DuMP3 (derived from Duplicate MP3) is a Java program to find any duplicate or similar file.

It finds files by calculating a fingerprint based on the image, audio or text data for each file and then comparing the fingerprints. It does not compare filenames or even ID3 tags (even though plugin classes could be written that perform these operations). Calculated fingerprints can be stored in a MySQL database so that they do not have to be calculated again.

As an extra bonus DuMP3 will mark the files that can not be read or decoded correctly as corrupt or with a signature mismatch.

DuMP3 can find files that are not exact duplicates:

  • Binary files are compared by SHA1 hash (configurable to any MD hash)
  • Text files that were changed by addition or deletion (2 fingerprint algorithms available)
  • Pictures in different formats, sizes and/or rotations (BMP, GIF, JPEG, JPEG2000, PNG, PNM, RAW, TIFF)
  • Audio files that were recorded at different bit rates or saved in different formats (AU, AIF, WAV, MP3, OGG)
  • plugin fingerprint classes can be written for any file where inexact matching is needed (fonts, videos, etc)

News

Known Bugs/Issues

Some valid pictures are marked as corrupt because javax.imageio.ImageIO.read(File) does not understand the format. This is a limitation of Sun's implementation of the (old) JAI image decoders and not of DuMP3. Two examples are: CMYK encoded JPEG and RLE encoded BMP.

Some very large pictures (around 4000x4000 pixels) could cause DuMP3 to run out of memory and crash.

Sometimes GIF files contain JPEG data or vice versa. This occurs usually in images downloaded from websites. These pictures will be marked as having a signature mismatch but fingerprinting will still be attempted.

Limits

DuMP3 is subject to some Java limitations as well as some limitations in the libraries I have chosen. The F.A.Q. covers most of them.


Translations of this page  |  English  |  Deutsch  |  Français  |