Wednesday, February 10, 2010

Find Duplicate Files in the Terminal

I posted an Automator Service last week for finding duplicate photo's in an iPhoto Library.  Here is a slightly modified version of the internal script it uses. You can save this script and run it in a terminal to find duplicate file of any kind in any directory tree of your choice.  This can also be included in Automater actions itself with the Shell Script action.

findDuplicates.pl
#!/usr/bin/perl

# ##################################### #
# Filename:      findDuplicates.pl
# Author:        Jeremy Pyne
# Licence:       CC:BY/NC/SA  http://creativecommons.org/licenses/by-nc-sa/3.0/
# Last Update:   02/10/2010
# Version:       1.5
# Requires:      perl
# Description:
#   This script will look through a directory of files and find and duplicates.  It will then
#   return a list of any such duplicates it finds.  This is done by calculating the md5 checksum
#   of each file and recording it along with the filename.  Then the list is sorted by the checksum
#   and read in line by line.  Any time multiple records in a row share a checksum the file names
#   are written out to stdout.  As a result all empty files will be flagged as duplicates as well.
# ##################################### #

# Get the path from the command line.  Thos could be expanded to provide more granular control.
$dir = shift;

# Set up the location of the temp files.
$file = "/tmp/pictures.txt";
$sort = "/tmp/sorted.txt";

# Find all files in the selected directory and calculate their md5sum.  This is by far the longest step.
`find "$dir" -type file -print0 | xargs -0 md5 -r > $file`;
# Sort the resulting file by the md5sum's.
`sort $file > $sort`;

open FILE, "<$sort" or die $!;

my $newmd5;
my $newfile;
my $lastmd5;
my $lastfile;
my $lastprint = 0;

# Read each line fromt he file.
while() {
        # Extract the md5sum and the filename.
        $_ =~ /([^ ]+) (.+)/;

        $newmd5 = $1;
        $newfile = $2;

        # If this is the same checksum as the last file then flag it.
        if($1 =~ $lastmd5)
        {
                # If this is the first duplicate for this checksup then print the first file's name.
                if(!$lastprint)
                {
                        print("$lastfile\n");
                        $lastprint = 1;
                }
                # Print the conflicting file's name/
                print("$newfile\n");
        }
        else
        {
                $lastprint = 0;
        }

        # Record the last filename and checksup for future testing.
        $lastmd5 = $newmd5;
        $lastfile = $newfile;
}

close(FILE);

# Remove the temp files.
unlink($file);
unlink($sort);

2 comments:

lauren said...

I tried the Automator Service that you posted for finding duplicate photo's in an iPhoto Library.Its good to see that you provided modification and that also for free. Thanks.digital signature PDF

Unknown said...

i would suggest you to try DuplicateFilesDeleter , it can help resolve duplicate files issue.