Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 50 (Aug 2003)

[suggested title: ``Finding similar images'']

I admit it. Like anyone else with a decent-speed connection to the net, I collect a lot of images. For example, in this column a few months ago, I described a program that looks through Yahoo! news images for pictures of Oregon and some of my favorite singing stars.

Sometimes, an image travels multiple paths before it ends up on my disk, and thus gets saved under different names. But that's a waste of disk, so I want to eliminate duplicates where I can. At first I was using a simple tool that computed the MD5 hash of each image in my collection, and eliminated the duplicates easily.

But frequently, the image has been scaled or maybe even re-rendered at a different JPEG quality, or converted from JPEG to PNG. In that case, the actual bits are not the same, even though the image is nearly the same on my screen.

So, a few days ago, I set out to write a program that could find similar images, not just identical images. My selected strategy is to reduce each image to a 4-by-4 grid of RGB values, yielding a 48-number vector of values from 0 to 255. Regardless of the re-rendering or resizing of the image (or even minor touchups), this value should be identical (or close) for images with the same original source. After a few hours of experimenting and tweaking, the results of my work can be seen in [listing one, below].

Lines 1 through 3 start nearly every program I write, turning on warnings, compiler restrictions, and disabling the buffering of STDOUT.

Line 5 pulls in the move routine defined in the core File::Copy module. I'll be using this to rename any corrupt images, an interesting outcome of having to scan them anyway.

Line 7 pulls in the CPAN-located Image::Magick interface, also known as PerlMagick. I'm not sure why I have such a hard time installing and using image libraries, but Image::Magick is certainly a prime example of a finicky-to-install and vastly underdocumented image manipulation module. However, when you get it to work, it is indeed Magick.

Line 8 pulls in the Cache::FileCache module, part of the Cache::Cache distribution found in the CPAN. Because the conversion of an image to its vector can take time, I cache the results. The cache key is computed as something that will not change even if I rename the item on the disk, which makes it easy for me to move my images around or give them more meaningful names without losing the work done on the image.

Line 10 predeclares the warnif subroutine, used in the middle of the program, but not defined until the end, so that I can use it without parens.

Lines 12 to 15 collect my ``likely to change'' constants.

Line 13 defines the fuzz factor. If the average absolute difference between each of the corresponding 48 numbers in each of the vectors of the two images doesn't exceed this value, we can call it a match. I found 5 to be a nice compromise between too many false positives (two similar images being declared identical) and too many false negatives (not seeing a pair when the pair was there).

If the value of $CORRUPT in line 14 is defined, then a directory by this name is created if necessary, and any corrupt image (according to Image::Magick) is renamed into the directory as they are seen. If set to undef instead, then the image is merely ignored (with a warning).

Lines 17 to 20 define a file-based cache, located below the invoker's home directory. I use the glob operator to expand ~/.filecache to its full name. The glob is used in a literal slice to extract the first of what is hopefully only one response value. (There are about a dozen other ways of doing this, but this one worked the first time I tried it, and should hopefully be sufficiently portable.)

Line 23 defines the array holding the ``buckets''. Each bucket contains an arrayref. The first element of that referenced array is itself an arrayref of the 48-integer vector for all images in that bucket. The remaining elements are the filenames.

The bucket strategy is a simple linear search: as each new file is examined, its vector is compared to the vectors of all previously computed buckets. If there's a match, the name is added to the end of the list. If there's no match, a new bucket is added to the end, with the unmatched vector, and initially the one filename for that vector. While I initially imagined this to be hideously slow (it's an O(N**2) algorithm, I think), in practice I found that I could identify matching images of an 8000-image test directory in about two or three CPU minutes on my laptop. That's ``fast enough'', as they say.

Lines 25 to 77 form the main per-file loop. Because I had to pop out of nested loops, I gave this loop a label of ``FILE''. It's good to name loops based on the noun of the thing being processed, so the shorthand of next FILE reads nicely. The loop reads @ARGV for as long as it is there, being peeled off one element at a time into $file in line 26.

Lines 27 to 33 permit the list of names to include directories, which will be recursively processed. If a directory name is seen, then the directory is opened, and the contents are examined. Every file beginning with a dot is discarded, and the remaining elements have the directory name prepended to them. By unshifting the value back into @ARGV, we get a depth-first scan of the directories.

While I probably could have used File::Find here, I decided to open-code the steps instead. I had already gotten the program to work for all of the files specified on the command line, but I wanted to test the program with even more files than are permitted in a command line. By adding the code to replace any directory with its contents, I could just use . as one of the entries instead, and it seemed to work as I needed it to work.

Line 35 ensures that I process only plain files. I test against the magic underscore filehandle which contains a cache of the information on the previous test (line 27) to avoid making redundant duplicate system calls for information.

Line 37 to 39 compute the cache key for this file that will remain constant even if the file is renamed. The dev/ino pair of numbers uniquely define the specific Unix inode on the specific disk on which the file is located, which doesn't change as long as the file is merely renamed. And the modification timestamp will also remain constant if the file is renamed, but will change if the contents are somehow altered. Thus, the triple of dev/ino/mtime is a useful way to track a particular version of an item even if it has been renamed.

Line 41 defines the array to hold the 48-element vector for the image, while line 43 provides a lable for tracing the operation.

Lines 44 to 46 try to get the vector simply by looking at the cache. If that's possible, then we don't have any hard work to do for this particular image.

Otherwise, starting in line 48, we'll compute the vector. First, an Image::Magick object is created (line 48) and then the file is read in as an image (line 49). If there's an error, we then decide if the file needs to be renamed (line 50). If the error string contains corrupt or unexpected end-of-file (the two cases I saw most frequently), then lines 51 to 53 move the file into the designated directory, keeping its original image name, but discarding the source directory. Yes, this would be a problem to replace the file if I was scanning a large hierarchy, but then again, the image is already corrupt, so it's probably no big deal. If there's any other error, the file is merely skipped and noted (line 55).

Once the image object is created, we show some statistics in line 59 about the image size. Lines 60 through 62 normalize the image (adjusting the brightness and contrast for maximum range), change the image to a 4x4 grid, and then set the ``type'' to rgb, which permits us to extract the RGB triples in line 63. ImageToBlob returns a 48-byte string, which gets expanded by the unpack operation into a series of values from 0 to 255. Line 64 stores this vector into the cache for later extraction.

Lines 66 to 76 implement the bucket matching algorithm described earlier. First, a linear scan is made of all the existing buckets using the loop starting in line 66. Line 67 defines the accumulated error as we look at pairs of values from the two vectors being compared. Lines 68 walks an index value through the list of 0 through 47, so we can examine each pair. Line 69 increases the error sum by the new differences value. If the value exceeds the total permitted according to the fuzz factor, then we bail and go on to the next bucket. Note that this means that for many vector comparisons, we can abort very early, as soon as the sum of the differences exceeds 240 units (5 times 48). Thus, if the upper left pixel of the new image is bright red, but the upper left pixel of the image being compared is black, the difference is already 255, and we abort after one comparison. I think this is the part that allows me to run the 8000 images in reasonable time.

If a given image matches ``close enough'' to a particular bucket, we end up at line 72. We add the filename to the end of the bucket list, and then report that this image has been identified as similar enough, giving a list of all previous matching names. Note that this is not necessarily the final list, because later images may also match this bucket. Also note that this set of comparisons isn't quite right, because we use the vector of only the first image of a bucket for all remaining comparisons, even though there's a slim possibility that a new image might have matched one of the other bucket members instead. Again, this is still close enough and fast enough for my purposes.

If the bucket list is completely scanned, but we still haven't found a match, we end up in line 76, which simply adds a new bucket with the current vector and filename.

Lines 79 to 96 handle the interesting task of displaying the similar images, letting me choose which (if any) to delete. First, the bucket is turned into an array in line 80, and the vector is discarded in line 81. If there's only one item in the bucket, we skip it in line 82.

Lines 83 to 86 create a montage of the images. First, a container image is created, then loaded with all of the images using their names in line 84. Then a montage is created that scales each image to within a 400 by 400 thumbnail, annotated below by the image number (in brackets), the image file name, the width and height, and the number of bytes of the image file size (often an indication of differing JPEG quality).

Lines 87 and 88 display this montage onto my currently selected X11 server. The popup window can scroll if needed (using a separate scroll widget). However, usually, I just need to glance at the images, determine if they are the same, and if so, which one (or many) that will need to be deleted by noting the image number in the brackets. I then press space to dismiss this image window, and am presented with a prompt in line 89. If I want to delete any of the values, I enter the integer value there, or just press return if I don't want to delete anything. I can enter more than one value by using spaces or commas as a delimiter, because line 90 rips out any consecutive integer sequence as one of the values. The grep keeps the integers from being out of range (I once typed 0 and deleted the wrong image by mistake).

Lines 91 to 95 take this list of integers and gets the corresponding filenames back from the image object. Note that I don't use the original @names array, because I'm paranoid that maybe Image::Magick might have skipped over some files that have disappeared or become unreadable between the time that I scanned them and put them into the bucket and the time that I'm now displaying their montage. Line 93 provides a trace of the action, and line 94 does the deed, blowing away that redundant waste of disk space.

Lines 98 to 102 provide a convenience subroutine used earlier, to display the warning if a value happens to be true.

And there you have it. Lots of stuff going on there, but simple in its design, and yet powerful and customizable. And now my disk is getting cleaner, because I can remove those redundant images. Until next time, enjoy!

Listings

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use File::Copy qw(move);
        =6=     
        =7=     use Image::Magick;
        =8=     use Cache::FileCache;
        =9=     
        =10=    sub warnif;
        =11=    
        =12=    ## config
        =13=    my $FUZZ = 5; # permitted average deviation in the vector elements
        =14=    my $CORRUPT = "CORRUPT"; # if defined, rename corrupt images into this dir
        =15=    ## end config
        =16=    
        =17=    my $cache = Cache::FileCache->new({
        =18=                                       namespace => 'findimagedupes',
        =19=                                       cache_root => (glob("~/.filecache"))[0],
        =20=                                      });
        =21=    
        =22=    
        =23=    my @buckets;
        =24=    
        =25=    FILE: while (@ARGV) {
        =26=      my $file = shift;
        =27=      if (-d $file) {
        =28=        opendir DIR, $file or next FILE;
        =29=        unshift @ARGV, map {
        =30=          /^\./ ? () : "$file/$_";
        =31=        } sort readdir DIR;
        =32=        next FILE;
        =33=      }
        =34=    
        =35=      next FILE unless -f _;
        =36=    
        =37=      my (@stat) = stat(_) or die "should not happen: $!";
        =38=    
        =39=      my $key = "@stat[0, 1, 9]"; # dev/ino/mtime
        =40=    
        =41=      my @vector;
        =42=    
        =43=      print "$file ";
        =44=      if (my $data = $cache->get($key)) {
        =45=        print "... is cached\n";
        =46=        @vector = @$data;
        =47=      } else {
        =48=        my $image = Image::Magick->new;
        =49=        if (my $x = $image->Read($file)) {
        =50=          if (defined $CORRUPT and $x =~ /corrupt|unexpected end-of-file/i) {
        =51=            print "... renaming into $CORRUPT\n";
        =52=            -d $CORRUPT or mkdir $CORRUPT, 0755 or die "Cannot mkdir $CORRUPT: $!";
        =53=            move $file, $CORRUPT or warn "Cannot rename: $!";
        =54=          } else {
        =55=            print "... skipping ($x)\n";
        =56=          }
        =57=          next FILE;
        =58=        }
        =59=        print "is ", join("x",$image->Get('width', 'height')), "\n";
        =60=        warnif $image->Normalize();
        =61=        warnif $image->Resize(geometry => '4x4!');
        =62=        warnif $image->Set(magick => 'rgb');
        =63=        @vector = unpack "C*", $image->ImageToBlob();
        =64=        $cache->set($key, [@vector]);
        =65=      }
        =66=      BUCKET: for my $bucket (@buckets) {
        =67=        my $error = 0;
        =68=        INDEX: for my $index (0..$#vector) {
        =69=          $error += abs($bucket->[0][$index] - $vector[$index]);
        =70=          next BUCKET if $error > $FUZZ * @vector;
        =71=        }
        =72=        push @$bucket, $file;
        =73=        print "linked ", join(", ", @$bucket[1..$#$bucket]), "\n";
        =74=        next FILE;
        =75=      }
        =76=      push @buckets, [[@vector], $file];
        =77=    }
        =78=    
        =79=    for my $bucket (@buckets) {
        =80=      my @names = @$bucket;
        =81=      shift @names;                 # first element is vector
        =82=      next unless @names > 1;       # skip unique images
        =83=      my $images = Image::Magick->new;
        =84=      $images->Read(@names);
        =85=      my $montage =
        =86=        $images->Montage(geometry => '400x400', label => "[%p] %i %wx%h %b");
        =87=      print "processing...\n";
        =88=      $montage->Display();
        =89=      print "Delete? [none] ";
        =90=      my @dead = grep { $_ >= 1 and $_ <= @$images } <STDIN> =~ /(\d+)/g;
        =91=      for (@dead) {
        =92=        my $dead_name = $images->[$_ - 1]->Get('base-filename');
        =93=        warn "rm $dead_name\n";
        =94=        unlink $dead_name or warn "Cannot rm $dead_name: $!";
        =95=      }
        =96=    }
        =97=    
        =98=    use Carp qw(carp);
        =99=    sub warnif {
        =100=     my $value = shift;
        =101=     carp $value if $value;
        =102=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.