Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Perl Journal magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
![]() |
Download this listing! | ![]() |
![]() |
![]() |
Perl Journal Column 14 (Jul 2004)
[Suggested title: ``Cleaning up a symlink mess'']
The box that hosts www.stonehenge.com
also takes care of
www.geekcruises.com
, the company web-site for my buddy, ``Captain''
Neil. As such, I have a dual role: I'm not only a frequent Geek
Cruise attendee -- I'm also the webmaster!
Recently, I noticed that Neil had moved a few pages around on his site to reorganize some of the information on past cruises. As quite a few links have been announced and bookmarked to the old location for a given page, he didn't want to break those. So he naively placed a symbolic link from the old location to the new location. This means that a reference to the old location such as:
http://www.geekcruises.com/cruises/2003/perlwhirl3.html
would be the same as:
http://www.geekcruises.com/past_cruises/perlwhirl3.html
because he had moved the page as follows:
$ cd /data/web/geekcruises # the DocumentRoot for his server $ cd cruises/2003 $ mv perlwhirl3.html ../past_cruises $ ln -s ../past_cruises/perlwhirl3.html .
Now, at first glance, this appears to ``work''. When either page is referenced, the same material is delivered by the server.
However, there's no way for anyone outside my server to know that these two pages are absolutely identical. This means that any cache (including browser caches, outward border caches at large organizations, or even our own reverse proxy cache) would now have two copies of the same material, having fetched the material needlessly twice.
Worse, some of the relative URLs are now somewhat broken. In the
original location, getting back up to the index page requires
../../index.html
, but in the new location, it was merely
../index.html
. It was for this reason that I actually noticed the
symlinks in the first place, because a badly constructed web crawler
was sucking down multiple copies of the website, thinking that each
index.html
at the top level was different as well!
The correct way to move such a page that might have been bookmarked or indexed is to have Apache issue an http redirect when the old URL is referenced. For example, in the configuration file for the Geek Cruises website, we can add:
Redirect /cruises/2003/perlwhirl3.html http://www.geekcruises.com/past_cruises/perlwhirl3.html
With this line in the configuration, a browser requesting the old URL will be asked to fetch the new URL instead. This redirect (also called an external redirect) is sufficient to ensure that caches will cache only one version (at the new URL), and indexers such as Google will invalidate the old URL over time.
Now, Neil doesn't have direct access to the web server configuration
master file, but he can add .htaccess
files in the various
affected directories. That particular command can be placed directly
into the cruises/2003
subdirectory, and it would have the same
result.
When I saw a few dozen of these symlinks all over the document tree
for Neil's server, I explained this to him, and then said it'd
actually be a small matter of programming to automatically replace all
of those symlinks with updated .htaccess
files. When all I heard
was silence at the other end of the connection, I recognized that I'd
need to write the program myself, since I'd now claimed it could be
done. And that program is in [listing one, below].
Lines 1 through 3 start nearly every program I write, enabling warnings for development, and compiler restrictions (forbidding undeclared variables, symbolic references and barewords), and turning off the pesky output buffering.
Lines 5 through 9 define my configuration parameters. The $URL
is
needed because an external redirect has to include the hostname, and
there's no easy way to get at that from inside the .htaccess
file.
The $USER
and $GROUP
are the values for the newly created or
updated .htaccess
file; I'm running this as root
so I have to
set it correctly for Neil to be able to edit later. And $MODE
gives the new permissions for a new .htaccess
file.
Line 11 pulls in the abs_path
routine from the core module Cwd
.
Lines 13 and 14 use my File::Finder
module (found in the CPAN) to
easily get a list of directories below the document root.
Lines 17 to 60 iterate over each of those directories, which can be
considered completely separately. Line 18 finds all the symbolic
links within that directory, using a simple grep
over the result of
a glob
. Note that I'm presuming Unix file syntax here, but that's
safe, because I know my server box is not likely to ever be anything
but Unix. Had I wanted this a bit more portable, I'd use
File::Spec
to construct the path.
Line 20 sets up the list of @deletes
. These are the candidate
symbolic links that are being replaced with .htaccess
redirects,
and can be deleted once the updated .htaccess
file is in place.
Line 21 computes the name of the .htaccess
file for this particular
directory.
Lines 23 to 45 process each symbolic link that was found in the
directory separately. First, the target of the symbolic link is read
in line 24 and 25. If the $path
is not defined, it's either not a
symbolic link or something went horribly wrong, and we ignore it.
Next, lines 26 and 27 ignore absolute symbolic links. I'm not sure why this code is in there, but it seemed to be the safest thing to do, since I only wanted to fix relative symbolic links. I've learned over the years that when you have root power, and you're mucking around with stuff and deleting and replacing a lot of files, it's safest to try to ignore everything that doesn't precisely fit your desired goal.
Line 28 uses abs_path
to compute the resulting absolute path of the
symbolic link target. Line 29 is left over from debugging, where I
wanted to see if my calculations were correct for all of the existing
links.
Lines 30 and 31 strip off the document root path from the source of
the symbolic link. I need to do this to ensure that my Redirect
command was framed in terms of URLs and not Unix pathnames. The \Q
quotes any metacharacters in the pathname. Again, this is a safe
thing to do, even though I know there are no metacharacters in the
particular paths I've configured at the top. Always be very
conservative with Root Power.
Line 32 takes the matched tail part of the symbolic link source and
builds the source URL for the Redirect
. I'm presuming the $1
here has been properly set from the previous match, and there's no
possible way I'm using the value from a stale match.
Lines 33 through 35 repeat the stripping and building for the
destination path, although I now have to create a full scheme-based
URL for the path. Without the http
prefix, Apache would have
treated this operation as an internal redirect, with all the same
problems as a simple symlink, because no indication would be sent to
the client that something had moved.
Lines 36 to 42 create the NEW
handle to which the new .htaccess
file is written. This happens only once per directory, because after
the first time, the @deletes
array contains some previous entry.
The existing .htaccess
file (if any) is also copied to the
beginning of the new .htaccess
file. Note that we're using the
OLD
filehandle in a list context, so it gets slurped in as a list
of lines, then immediately dumped to the new filehandle.
Line 43 writes the proper Redirect
command to the new .htaccess
file. Line 44 marks the symbolic link as one to be deleted once the
.htaccess
file is in place.
Lines 46 to 59 are executed once per directory, but only if some symbolic link was found that was eligible for removal.
First, line 47 closes the output file handle to ensure that the data is completely flushed.
Then, lines 48 through 51 set the permissions and ownership on the new file to be as defined by the configuration parameters at the top of the program.
Lines 52 to 53 try to rename the existing .htaccess
file out of the
way to end in .OLD
. Again, with Great Power comes Great
Responsibility, including the mandate to have a Great Undo when one
makes a Great Mistake. So, after each run of this program, I can
verify that I've not completely mangled the .htaccess
files, and
then delete the .OLD
files manually. And carefully.
Lines 54 and 55 move the newly created .htaccess
file into place.
Note that because I've created the completed file in a separate
location, and then rename it atomically into place, there's no chance
that the live Apache process will read a partially written
.htaccess
file. This is a very important principle when dealing
with live production activity.
Finally, lines 56 to 58 delete the existing useless symbolic links one at a time. And that's all there is!
Oddly enough, as I was writing this article, I discovered Yet Another Symlink in the tree. So I ran the code once again, and sure enough, it got replaced with the right redirect. Good thing I've kept this code around. Looks like it's time to send Neil another refresher message, or maybe just disable symbolic links in his tree. Until next time, enjoy!
Listings
=1= #!/usr/bin/perl -w =2= use strict; =3= $|++; =4= =5= my $URL = "http://www.geekcruises.com"; =6= my $DIR = "/data/web/geekcruises"; =7= my $USER = 2100; =8= my $GROUP = 2100; =9= my $MODE = 0644; =10= =11= use Cwd qw(abs_path); =12= =13= use File::Finder; =14= my @dirs = File::Finder->type('d')->in($DIR); =15= =16= # print "$_\n" for @dirs; =17= for my $dir (@dirs) { =18= my @symlinks = grep -l, glob "$dir/*"; =19= # print "$dir: @symlinks\n"; =20= my @deletes; =21= my $htaccess = "$dir/.htaccess"; =22= =23= for my $symlink (@symlinks) { =24= defined(my $path = readlink($symlink)) or =25= warn("Cannot read $symlink: $!"), next; =26= $path =~ m{^/} and =27= warn("skipping absolute $path for $symlink\n"), next; =28= my $abs_path = abs_path("$dir/$path"); =29= # print "$symlink -> $path => $abs_path\n"; =30= $symlink =~ m{^\Q$DIR\E/(.*)}s or =31= warn("$symlink doesn't begin with $DIR"), next; =32= my $original_url = "/$1"; =33= $abs_path =~ m{^\Q$DIR\E/(.*)}s or =34= warn("$abs_path doesn't begin with $DIR"), next; =35= my $redirect_url = "$URL/$1"; =36= unless (@deletes) { =37= ## print "in $dir...\n"; =38= open NEW, ">$htaccess.NEW" or die; =39= if (open OLD, $htaccess) { =40= print NEW <OLD>; =41= } =42= } =43= print NEW "Redirect $original_url $redirect_url\n"; =44= push @deletes, $symlink; =45= } =46= if (@deletes) { =47= close NEW; =48= chown $USER, $GROUP, "$htaccess.NEW" =49= or die "Cannot chown $htaccess.NEW: $!"; =50= chmod $MODE, "$htaccess.NEW" =51= or die "Cannot chmod $htaccess.NEW: $!"; =52= ! -e $htaccess or rename $htaccess, "$htaccess.OLD" =53= or die "Cannot mv $htaccess $htaccess.OLD: $!"; =54= rename "$htaccess.NEW", $htaccess =55= or die "Cannot mv $htaccess.NEW $htaccess: $!"; =56= for (@deletes) { =57= unlink $_ or warn "Cannot unlink $_: $!"; =58= } =59= } =60= }