Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 32 (Jan 2002)

[suggested title: Smoke and Mirrors]

The other day, I was looking at rsync to set up the publishing of my website from a CVS-managed archive. I thought it would be simple to use rsync in ``archive'' mode to accurately mirror a staging directory. But I just couldn't get the hooks right, and the --exclude syntax still bugged me. And I wanted to ignore specific differences, and add mail notification when certain pages got updated.

So I hunted around for an rsync-like tool in Perl that I could modify to fit my needs. I didn't think it'd be that hard, so I figured someone had already invented that wheel. I looked first in the Perl Power Tools project (http://language.perl.com/ppt/), but alas, not there. Even searching the net for ``Perl'' and ``mirror'' or ``rsync'' didn't help. So I did the next best thing: I wrote one from scratch.

I thought it would be short, but by the time I had something decent, I ended up with [listing one, below]. Now, a very important warning before I go on. This is a work-in-progress version. I've not put a lot of time into it, although I will in the future. If you use this version, and you lose data, don't sue me. Also, since it's pretty long, I'll just hit the high points in the listing, where the cool stuff actually lives.

Lines 1 through 3 start nearly every program I write, turning on warnings, enabling compiler restrictions, and disabling output buffering.

Lines 5 through 7 bring in the needed modules. File::Find and File::Copy are part of the core Perl, while File::Compare is in the CPAN at the moment (originally written by my good friend and fellow Perl instructor, Joseph Hall).

Lines 9 to 12 show that this is a work in progress, by including a to-do list right in the source.

Lines 16 to 33 take the place of command-line arguments. Since I'm using this code for a specific task, I just put all the goodies right here in the program.

The source tree is defined by $SRC, with the root of the tree as the first element of the two-element array. The second element is a coderef that will be called to see if a particular file should be excluded from consideration. Two parameters will be passed: the basename of the file, and the full path to the file. In this case, I'm ignoring any GNU Emacs editing backups, as well as any file that is inside a CVS holding area. That's probably not enough for my production use, but it was fine for testing to ensure that the concepts were down pat. Those items get a non-zero return value, and everything else gets a 0, meaning ``yes, this file is significant''. If $SRC is a simple scalar instead of an arrayref, then that's specifying only the top of the tree, with nothing excluded.

Similarly, $DST defines the top of the destination tree, and another ignore list. Having two separate lists is something I couldn't get rsync to do, and this is what started the whole rewrite in Perl.

Lines 29 to 31 set parameters that are similar in function to their rsync counterparts. (In fact, I'm the reason that --delete-excluded exists in rsync, which they added in response to a bug report from me.)

Line 33 sets up a trace on the actions performed, so I can see if it's doing the right thing. In production, I'd probably leave this turned off, so that only error messages are shown.

Lines 37 to 44 set up the three phases of mirroring. First, I go through the destination tree, and see if anything's present that's missing from the source tree. If so, it gets deleted since it's no longer needed. Then, files get copied from the source to the destination. And finally, we fix up the permissions and timestamps and such. To understand the parameters, let's skip down to the subroutine, starting in line 50.

Line 50 begins the ``tree-walker''. Given a ``from'' tree and a ``to'' tree, it does some basic existence checks, and calls a ``compare'' routine to see if an ``action'' needs to be performed.

The first parameter (scarfed in line 51) determines whether it's a finddepth (bottom up) or a find (top down) walk. For the delete phase, for example, we need to delete the files within a directory before we delete the directory itself, so I pass a 1 to indicate that contents are done before the containers. However, to add new things, the directories must exist before the contents, so I pass 0 instead.

Lines 52 and 53 fetch the ``from'' tree and ``to'' tree, which will be some arrangement of the $SRC and $DST values from above. The walk_expand helper function helps us deal with it being sometimes an arrayref and sometimes a simple scalar.

Lines 54 and 55 extract the compare and action coderefs. Line 56 notes if we should use the ignore function on the destination as well as the source, to implement $DELETE_EXCLUDED during the delete phase.

Lines 68 to 68 are where most of the business happens. We call either find or finddepth from File::Find, passing it a coderef for the wanted subroutine, and the starting point of $from. For each found item, we see if it's of interest (line 60), and if so, compute the corresponding filename in the $to tree. If that entry doesn't exist (line 62) or is not of interest (line 63) or is ``different'' (line 64), we call the action routine, passing the full source and destination paths.

So, that handles the top level algorithm, used three times. Next, we have to provide appropriate compare and action routines for each phase, starting in line 75.

The delete comparison routine is simple. An existing file is worth deleting if the ``mode'' stat value shows that they are different fundamental types (file versus directory versus symbolic link, etc), as computed in line 78. Yes, smoke and mirrors, but straightforward if you understand the return value from stat.

The delete action routine (starting in line 82) is a bit more complicated: it could be a file or a directory, so we just try each in turn. Now this will still fail if it's a non-empty directory, but hopefully we've already deleted all the contents before we get here, and if we haven't, something is amiss anyway.

Line 93 begins the copy compare routine. Again, if it's a different type, it's automatically different. However, for files, we now look at more things before saying ``yes they are different''. If they have different modification times, that's a clear yes. But we can also check the access time, and the actual contents as well. Note that I don't bother checksumming, since my ``Perl rsync'' isn't being used over a network. I just compare the two files directly.

Line 107 begins the copy action routine. For safety, I copy the file into a temporary name, and then rename it (starting in line 112). Of course, things can break there, so I check the results of each step and issue an error if needed. For directories, simply creating the directory with mode 0777 is enough (in line 123), since the next phase will patch up the permissions.

And finally, the cleanup phases. Lines 133 to 137 define the compare routines, announcing that a file needs cleaning up if it differs in timestamps, ownership, or permissions. To quickly check this, I put all those numbers into an array, and then interpolate the array into a double-quoted string, which puts spaces between the elements. If the two created strings differ, then at least one thing differed. Yeah, there are other ways. This is Perl, after all.

Lines 139 to 159 handle the actual patchups. First, I bail quickly if either item is a symlink, since I really don't have symlinks well inserted into the program yet. Otherwise, for each of permissions (starting in line 143), timestamps (line 149) and ownership (line 155), I look at the old value, and set the new value if they differ.

And there you have it, a Perl-like version of the ``archive'' mode of rsync, with plenty of places to fine-tune the behaviour for my specific application. I'll be tweaking this in the future, so contact me if you want to see the updated version when I'm done. Until next time, enjoy!

Listing

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use File::Find qw(find finddepth);
        =6=     use File::Copy qw(copy);
        =7=     use File::Compare qw(compare);
        =8=     
        =9=     ## todo:
        =10=    ##  symlinks (must determine sensible rewrite rules)
        =11=    ##  hardlinks (maintain dev/ino maps for source and dest)
        =12=    ##  sparse files?
        =13=    
        =14=    ## start config
        =15=    
        =16=    my $SRC = ["/home/merlyn/www-src",
        =17=               sub { # ignore source-management things:
        =18=                 return 1 if $_[0] =~ /~\z/; # files ending in tilde
        =19=                 return 1 if $_[1] =~ m{/CVS(\z|/)}; # CVS files
        =20=                 0;
        =21=               },
        =22=              ];
        =23=    my $DST = ["/home/merlyn/public_html",
        =24=               sub { # ignore web-management things:
        =25=                 return 1 if $_[0] eq ".htaccess"; # sacred
        =26=                 0;
        =27=               }];
        =28=    
        =29=    my $DELETE_EXCLUDED = 1;
        =30=    my $CHECK_CONTENT = 1;
        =31=    my $CHECK_ATIME = 0;
        =32=    
        =33=    my $TRACE = 1;
        =34=    
        =35=    ## end config
        =36=    
        =37=    ## delete phase:
        =38=    walk (1, $DST, $SRC, \&delete_compare, \&delete_action, $DELETE_EXCLUDED);
        =39=    
        =40=    ## copy directories/files phase:
        =41=    walk (0, $SRC, $DST, \&copy_compare, \&copy_action);
        =42=    
        =43=    ## clean up meta-stuff phase:
        =44=    walk (1, $SRC, $DST, \&cleanup_compare, \&cleanup_action);
        =45=    
        =46=    exit 0;
        =47=    
        =48=    ## subroutines:
        =49=    
        =50=    sub walk {
        =51=      my $find_func = shift(@_) ? \&finddepth : \&find;
        =52=      my ($from, $from_ignore) = walk_expand(shift);
        =53=      my ($to, $to_ignore) = walk_expand(shift);
        =54=      my $compare = shift;
        =55=      my $action = shift;
        =56=      my $delete_excluded = shift;
        =57=    
        =58=      $find_func->
        =59=        (sub {
        =60=           return if $from_ignore and $from_ignore->($_, $File::Find::name);
        =61=           my $to_name = $to.substr($File::Find::name, length($from));
        =62=           if (not -e $to_name
        =63=               or $delete_excluded and $to_ignore->($_, $File::Find::name)
        =64=               or $compare->($File::Find::name, $to_name)
        =65=              ) {
        =66=             $action->($File::Find::name, $to_name);
        =67=           }
        =68=         }, $from);
        =69=    }
        =70=    
        =71=    sub walk_expand {
        =72=      ref($_[0]) ? @{$_[0]} : $_[0];
        =73=    }
        =74=    
        =75=    sub delete_compare {            # compare two existing files for differences
        =76=      my ($dst, $src) = @_;
        =77=      my @s = map [stat $_], @_;
        =78=      return 1 if ($s[0][2] >> 12) <=> ($s[1][2] >> 12); # not the same type
        =79=      return 0;
        =80=    }
        =81=    
        =82=    sub delete_action {
        =83=      my ($dst, $src) = @_;
        =84=      if (unlink $dst) {
        =85=        warn "rm $dst\n" if $TRACE;
        =86=      } elsif (rmdir $dst) {
        =87=        warn "rmdir $dst\n" if $TRACE;
        =88=      } else {
        =89=        warn "#ERROR# cannot eliminate $dst\n";
        =90=      }
        =91=    }
        =92=    
        =93=    sub copy_compare {
        =94=      my ($src, $dst) = @_;
        =95=      my @s = map [stat $_], @_;
        =96=      return 1 if ($s[0][2] >> 12) <=> ($s[1][2] >> 12); # not the same type
        =97=      if (not -l $src and -f _) {   # plain files both of ya
        =98=        return 1 if $s[0][9] <=> $s[1][9]; # not same mtime
        =99=        return 1 if $CHECK_ATIME
        =100=         and $s[0][8] <=> $s[1][8]; # not same atime
        =101=       return 1 if $CHECK_CONTENT
        =102=         and compare $src, $dst;   # not same content
        =103=     }
        =104=     0;                            # not different
        =105=   }
        =106=   
        =107=   sub copy_action {
        =108=     my ($src, $dst) = @_;
        =109=     if (-l $src) {
        =110=       warn "#ERROR# cannot symlink from $src to $dst (yet)\n";
        =111=     } elsif (-f $src) {
        =112=       if (copy $src, my $new = "$dst.$$.".time) {
        =113=         warn "cp $src $new\n" if $TRACE;
        =114=         if (rename $new, $dst) {
        =115=           warn "mv $new $dst\n" if $TRACE;
        =116=         } else {
        =117=           warn "#ERROR# cannot mv $new $dst: $!\n";
        =118=         }
        =119=       } else {
        =120=         warn "#ERROR# cannot cp $src $new: $!\n";
        =121=       }
        =122=     } elsif (-d $src) {
        =123=       if (mkdir $dst, 0777) {
        =124=         warn "mkdir $dst\n" if $TRACE;
        =125=       } else {
        =126=         warn "#ERROR# cannot mkdir $dst: $!\n";
        =127=       }
        =128=     } else {
        =129=       warn "#ERROR# don't know how to copy $src to $dst\n";
        =130=     }
        =131=   }
        =132=   
        =133=   sub cleanup_compare {
        =134=     my ($src,$dst) = @_;
        =135=     my @s = map [(lstat $_)[4,5,8,9], (stat _)[2] & 07777], @_;
        =136=     return "@{$s[0]}" cmp "@{$s[1]}";
        =137=   }
        =138=   
        =139=   sub cleanup_action {
        =140=     return if grep -l, @_;
        =141=     my ($src, $dst) = @_;
        =142=     my @s = map [lstat $_], @_;
        =143=     if ((my $oldperm = $s[0][2] & 07777) != ($s[1][2] & 07777)) {
        =144=       warn "setting perms on $dst\n" if $TRACE;
        =145=       chmod $oldperm, $dst
        =146=         or warn "#ERROR# can't update perms on $dst: $!";
        =147=     }
        =148=   
        =149=     if ("$s[0][8] $s[0][9]" ne "$s[1][8] $s[1][9]") {
        =150=       warn "setting times on $dst\n" if $TRACE;
        =151=       utime $s[0][8], $s[0][9], $dst
        =152=         or warn "#ERROR# can't update times on $dst: $!";
        =153=     }
        =154=   
        =155=     if ("$s[0][4] $s[0][5]" ne "$s[1][4] $s[1][5]") {
        =156=       $< and chown $s[0][4], $s[0][5], $dst
        =157=         or warn "#ERROR# can't update ownership of $dst: $!";
        =158=     }
        =159=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.