Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Perl Journal magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Download this listing!

Perl Journal Column 14 (Jul 2004)

[Suggested title: ``Cleaning up a symlink mess'']

The box that hosts www.stonehenge.com also takes care of www.geekcruises.com, the company web-site for my buddy, ``Captain'' Neil. As such, I have a dual role: I'm not only a frequent Geek Cruise attendee -- I'm also the webmaster!

Recently, I noticed that Neil had moved a few pages around on his site to reorganize some of the information on past cruises. As quite a few links have been announced and bookmarked to the old location for a given page, he didn't want to break those. So he naively placed a symbolic link from the old location to the new location. This means that a reference to the old location such as:

  http://www.geekcruises.com/cruises/2003/perlwhirl3.html

would be the same as:

  http://www.geekcruises.com/past_cruises/perlwhirl3.html

because he had moved the page as follows:

  $ cd /data/web/geekcruises # the DocumentRoot for his server
  $ cd cruises/2003
  $ mv perlwhirl3.html ../past_cruises
  $ ln -s ../past_cruises/perlwhirl3.html .

Now, at first glance, this appears to ``work''. When either page is referenced, the same material is delivered by the server.

However, there's no way for anyone outside my server to know that these two pages are absolutely identical. This means that any cache (including browser caches, outward border caches at large organizations, or even our own reverse proxy cache) would now have two copies of the same material, having fetched the material needlessly twice.

Worse, some of the relative URLs are now somewhat broken. In the original location, getting back up to the index page requires ../../index.html, but in the new location, it was merely ../index.html. It was for this reason that I actually noticed the symlinks in the first place, because a badly constructed web crawler was sucking down multiple copies of the website, thinking that each index.html at the top level was different as well!

The correct way to move such a page that might have been bookmarked or indexed is to have Apache issue an http redirect when the old URL is referenced. For example, in the configuration file for the Geek Cruises website, we can add:

  Redirect /cruises/2003/perlwhirl3.html http://www.geekcruises.com/past_cruises/perlwhirl3.html

With this line in the configuration, a browser requesting the old URL will be asked to fetch the new URL instead. This redirect (also called an external redirect) is sufficient to ensure that caches will cache only one version (at the new URL), and indexers such as Google will invalidate the old URL over time.

Now, Neil doesn't have direct access to the web server configuration master file, but he can add .htaccess files in the various affected directories. That particular command can be placed directly into the cruises/2003 subdirectory, and it would have the same result.

When I saw a few dozen of these symlinks all over the document tree for Neil's server, I explained this to him, and then said it'd actually be a small matter of programming to automatically replace all of those symlinks with updated .htaccess files. When all I heard was silence at the other end of the connection, I recognized that I'd need to write the program myself, since I'd now claimed it could be done. And that program is in [listing one, below].

Lines 1 through 3 start nearly every program I write, enabling warnings for development, and compiler restrictions (forbidding undeclared variables, symbolic references and barewords), and turning off the pesky output buffering.

Lines 5 through 9 define my configuration parameters. The $URL is needed because an external redirect has to include the hostname, and there's no easy way to get at that from inside the .htaccess file. The $USER and $GROUP are the values for the newly created or updated .htaccess file; I'm running this as root so I have to set it correctly for Neil to be able to edit later. And $MODE gives the new permissions for a new .htaccess file.

Line 11 pulls in the abs_path routine from the core module Cwd.

Lines 13 and 14 use my File::Finder module (found in the CPAN) to easily get a list of directories below the document root.

Lines 17 to 60 iterate over each of those directories, which can be considered completely separately. Line 18 finds all the symbolic links within that directory, using a simple grep over the result of a glob. Note that I'm presuming Unix file syntax here, but that's safe, because I know my server box is not likely to ever be anything but Unix. Had I wanted this a bit more portable, I'd use File::Spec to construct the path.

Line 20 sets up the list of @deletes. These are the candidate symbolic links that are being replaced with .htaccess redirects, and can be deleted once the updated .htaccess file is in place. Line 21 computes the name of the .htaccess file for this particular directory.

Lines 23 to 45 process each symbolic link that was found in the directory separately. First, the target of the symbolic link is read in line 24 and 25. If the $path is not defined, it's either not a symbolic link or something went horribly wrong, and we ignore it.

Next, lines 26 and 27 ignore absolute symbolic links. I'm not sure why this code is in there, but it seemed to be the safest thing to do, since I only wanted to fix relative symbolic links. I've learned over the years that when you have root power, and you're mucking around with stuff and deleting and replacing a lot of files, it's safest to try to ignore everything that doesn't precisely fit your desired goal.

Line 28 uses abs_path to compute the resulting absolute path of the symbolic link target. Line 29 is left over from debugging, where I wanted to see if my calculations were correct for all of the existing links.

Lines 30 and 31 strip off the document root path from the source of the symbolic link. I need to do this to ensure that my Redirect command was framed in terms of URLs and not Unix pathnames. The \Q quotes any metacharacters in the pathname. Again, this is a safe thing to do, even though I know there are no metacharacters in the particular paths I've configured at the top. Always be very conservative with Root Power.

Line 32 takes the matched tail part of the symbolic link source and builds the source URL for the Redirect. I'm presuming the $1 here has been properly set from the previous match, and there's no possible way I'm using the value from a stale match.

Lines 33 through 35 repeat the stripping and building for the destination path, although I now have to create a full scheme-based URL for the path. Without the http prefix, Apache would have treated this operation as an internal redirect, with all the same problems as a simple symlink, because no indication would be sent to the client that something had moved.

Lines 36 to 42 create the NEW handle to which the new .htaccess file is written. This happens only once per directory, because after the first time, the @deletes array contains some previous entry. The existing .htaccess file (if any) is also copied to the beginning of the new .htaccess file. Note that we're using the OLD filehandle in a list context, so it gets slurped in as a list of lines, then immediately dumped to the new filehandle.

Line 43 writes the proper Redirect command to the new .htaccess file. Line 44 marks the symbolic link as one to be deleted once the .htaccess file is in place.

Lines 46 to 59 are executed once per directory, but only if some symbolic link was found that was eligible for removal.

First, line 47 closes the output file handle to ensure that the data is completely flushed.

Then, lines 48 through 51 set the permissions and ownership on the new file to be as defined by the configuration parameters at the top of the program.

Lines 52 to 53 try to rename the existing .htaccess file out of the way to end in .OLD. Again, with Great Power comes Great Responsibility, including the mandate to have a Great Undo when one makes a Great Mistake. So, after each run of this program, I can verify that I've not completely mangled the .htaccess files, and then delete the .OLD files manually. And carefully.

Lines 54 and 55 move the newly created .htaccess file into place. Note that because I've created the completed file in a separate location, and then rename it atomically into place, there's no chance that the live Apache process will read a partially written .htaccess file. This is a very important principle when dealing with live production activity.

Finally, lines 56 to 58 delete the existing useless symbolic links one at a time. And that's all there is!

Oddly enough, as I was writing this article, I discovered Yet Another Symlink in the tree. So I ran the code once again, and sure enough, it got replaced with the right redirect. Good thing I've kept this code around. Looks like it's time to send Neil another refresher message, or maybe just disable symbolic links in his tree. Until next time, enjoy!

Listings

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     my $URL = "http://www.geekcruises.com";;
        =6=     my $DIR = "/data/web/geekcruises";
        =7=     my $USER = 2100;
        =8=     my $GROUP = 2100;
        =9=     my $MODE = 0644;
        =10=    
        =11=    use Cwd qw(abs_path);
        =12=    
        =13=    use File::Finder;
        =14=    my @dirs = File::Finder->type('d')->in($DIR);
        =15=    
        =16=    # print "$_\n" for @dirs;
        =17=    for my $dir (@dirs) {
        =18=      my @symlinks = grep -l, glob "$dir/*";
        =19=      # print "$dir: @symlinks\n";
        =20=      my @deletes;
        =21=      my $htaccess = "$dir/.htaccess";
        =22=    
        =23=      for my $symlink (@symlinks) {
        =24=        defined(my $path = readlink($symlink)) or
        =25=          warn("Cannot read $symlink: $!"), next;
        =26=        $path =~ m{^/} and
        =27=          warn("skipping absolute $path for $symlink\n"), next;
        =28=        my $abs_path = abs_path("$dir/$path");
        =29=        # print "$symlink -> $path => $abs_path\n";
        =30=        $symlink =~ m{^\Q$DIR\E/(.*)}s or
        =31=          warn("$symlink doesn't begin with $DIR"), next;
        =32=        my $original_url = "/$1";
        =33=        $abs_path =~ m{^\Q$DIR\E/(.*)}s or
        =34=          warn("$abs_path doesn't begin with $DIR"), next;
        =35=        my $redirect_url = "$URL/$1";
        =36=        unless (@deletes) {
        =37=          ## print "in $dir...\n";
        =38=          open NEW, ">$htaccess.NEW" or die;
        =39=          if (open OLD, $htaccess) {
        =40=            print NEW <OLD>;
        =41=          }
        =42=        }
        =43=        print NEW "Redirect $original_url $redirect_url\n";
        =44=        push @deletes, $symlink;
        =45=      }
        =46=      if (@deletes) {
        =47=        close NEW;
        =48=        chown $USER, $GROUP, "$htaccess.NEW"
        =49=          or die "Cannot chown $htaccess.NEW: $!";
        =50=        chmod $MODE, "$htaccess.NEW"
        =51=          or die "Cannot chmod $htaccess.NEW: $!";
        =52=        ! -e $htaccess or rename $htaccess, "$htaccess.OLD"
        =53=          or die "Cannot mv $htaccess $htaccess.OLD: $!";
        =54=        rename "$htaccess.NEW", $htaccess
        =55=          or die "Cannot mv $htaccess.NEW $htaccess: $!";
        =56=        for (@deletes) {
        =57=          unlink $_ or warn "Cannot unlink $_: $!";
        =58=        }
        =59=      }
        =60=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Perl Journal Column 14 (Jul 2004)

Listings