Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 31 (Nov 1998)

A popular thing to do on the web is create a ``mirror'' -- a duplicate of a collection of web files into another location, either for personal use, or to provide the information on a second web server.

Many general purpose solutions for mirroring entire web hierarchies exist. Two that come to mind right away are w3get (from the GNU project) and w3mir (found in the CPAN).

But those both require a lot of setup, and careful thinking. Necessary if you have a major task, but what about when you're only mirroring a few files, or just one directory?

For example, let's look at a small task. The Internet Relay Chat (IRC) EF-NET channel called #perl is frequented by Perl Hackers such as myself. Now, the channel activity is 24-hours a day, but I can't stick around all the time. Luckily, there's an IRC bot that is logging all the channel traffic into files that are accessible on the web.

The files are all in one directory, and the names are all predictable (being constructed from the date), so there's no point in parsing the HTML returned from the directory just to find out the name of the latest file. We can compute that.

But if we just fetched every file from the current log down to the oldest file every few hours, I would probably be banned from the server. Luckily, there's a routine called LWP::Simple::mirror that creates and or updates a file based on timestamps, minimizing the traffic to the web server.

So, I whipped out a quick program to do the mirroring, and the resulting program is in [listing one, below].

Line one invokes my installed Perl (still slightly ahead of the ISP's Perl), and adds the -T and -w switches. The -T switch forces any input data to be untrusted, although there's really no input data in this program. And -w turns on compile-time and run-time warnings. Hopefully, there won't be any of those either.

Line 2 turns on all the recommended compile-time restrictions, disabling symbolic reference, requiring variables to be declared, and removing the interpretation of ``barewords'' as quoted strings. I add this to all programs over a few lines long.

Line 3 disables output buffering. This program produces very little output, and what we see, we want to see immediately.

Line 5 brings in the LWP::Simple module. This is part of the LWP library, found at http://www.perl.com/CPAN/modules/by-module/LWP. Pick the latest file that begins with libwww-perl, and fetch it, if you don't already have it. Alternatively, you can invoke

        perl -MCPAN -eshell

and tell it to

        install LWP

at the prompt. That looks simpler.

Line 6 pulls in the HTTP::Status module, defining constants for common errors. (This is another part of LWP, so if you get one, you've got them both.) I think LWP::Simple pulls this in as well, but there's no problem with being redundant here, since I use the constants later. It also made it easier for me to remember what manpage to look up when I wanted more info. Nearly always, documenting stuff for your maintenance programmers is A Good Thing.

Line 8 defines the base URL where we're going to be mirroring from. Note that com is here replaced by Xcom. If you're going to try this program once to see what it does, you'll need to remove the X. And, I cannot stress enough times to please do not keep using this program to mirror this particular directory. This is a just a proof-of-concept.

Line 9 executes a chdir to the directory containing the mirrored files. Again, if you want to really run this program, you'll need to make that a decent path. As it is, it would work only on my machine (unless your login name is merlyn and you had created the appropriate directory).

The mirrored files are stored within this directory, so a chdir was the easiest way to make the directories line up. It also happens to be the location where I store the program and call it from cron, but there's no chance that a mirrored file will accidentally overwrite the program. This would be a security hole, and is prevented by recognizing that the filenames are all determined by formulas (later), and the mirroring program is called a name in this directory that can never be computed, even by accident.

Line 11 enables ``first time'' mode, if the variable named $FIRSTTIME is set to non-zero. In this mode, ``not modified'' errors are ignored. We do this to initially create the mirrored file database. No matter where we get interrupted, if at all, we can restart the program, and it will simply not fetch the ones it has already done. When we finally make a pass that gets seven ``not found'' errors, we're done, and we can turn off this mode. Note that the listing already has $FIRSTTIME to 0, and again, that's for safety, in case you're not reading this text.

An aside: I hear many people have run programs from my past columns blindly without even reading the corresponding article. This is not likely to get you the results you want, and could quite possibly get others upset with you. Please don't do that. These are not ready-to-run programs, but merely proof of concept to illustrate a particular Web Technique. Enough said.

Lines 13 through 36 form the body of the program, as a single for loop. It's formatted a little odd, because the ``initialization, test, increment'' part of the loop spans lines 14 through 17. I rarely use for loops, preferring instead to build it out of naked blocks (blocks without being part of a larger construct), but the for loop stuff actually worked nicely here.

Lines 14 and 15 are the initialization. I set up the $when variable to be the Unix epoch time of day, minus five hours. This is because I want to get the gmtime operator (later) to give me a format for Eastern Standard Time. I determined that this was the time zone of the machine I was mirroring from, at least as far as when a new logfile was being created. There's also a variable $errors, initially set to 0.

Note that both of these variables are declared as my variables, even though they are inside the top of the for loop. As of Perl 5.004, this makes these variables local to the loop, minimizing the number of globals! What fun. My personal rule is ``as few globals as is absolutely necessary''. I have extensive use of my variables inside subroutines, other block structures and if necessary, at file scope level, just to ensure that a variable is only alive for as long as it needs, or rather, a variable is only visible to the smallest scope of the program that needs it.

Line 16 tells the for loop to continue as long as we have seen less than seven errors.

Line 17 is executed at the bottom of the loop, whether we got there by getting to the last statement, or got there via next. That's the beauty of using a for loop -- we can make sure the code is executed each time no matter how we get around. This code offsets the value of $when (defined above) by 24 hours. Don't do this if you use localtime instead of gmtime... you'll be really disappointed when you cross a ``Savings Time'' or ``Summer Time'' boundary, depending on where you are. (My ISP is in Arizona, which doesn't do any clock shifting, so I rarely need to think of that.)

The ``Savings Time'' problem will manifest itself by not fetching the first hours worth of stuff during the summer, because we think it's earlier than it is. The next pass (if you're doing hourly or greater) will fetch it, however. I erred on the side of the minimum fetches, not the maximum fetches, again to be kind to the server.

Line 18 takes the timestamp and extracts it into a tm_struct via the list invocation of gmtime. We're most interested in the parts of this value that represent the day of the month, the month number, and the year.

Lines 19 through 23 construct a filename string in $name that looks like Perl.log.07Oct98, which is the filename for the October 7th file. Note that I've performed a bad Y2K failure here, by integer-mod-ing the year by 100. However, I'm not sure whether the server will go to Perl.log.01Jan100 (raw gmtime output) or Perl.log.01Jan00 (mod 100 like I've done) on the first day of the new millenium. I suspect the latter, so I'm betting on it.

Now, we have a nice potential logfile name. Note that this did not require going out to the server to get the directory, and parsing the result. That can be rather tricky. Since all the filenames are computable (although we don't know how far back they go), we can simply construct them in reverse cronological order.

Line 24 keeps the human being (or the cron program) aware of what's going on. (Hmm, could have just printed this, since I unbuffered STDOUT, so either one goes out immediately. Oh well.)

Line 25 is where all the web-activity happens. The LWP::Simple::mirror function (here imported as mirror) returns an integer, an HTTP status value of why things broke if at all. The first parameter is the remote URL, and the second parameter is the local filename. If the file exists, its modtime timestamp is used to send an ``if modified since'' request to the server. Otherwise, a standard fetch is issued.

Line 26 notes if this value is RC_OK. If that's true, we fetched a page (it existed, and we either had no copy or an out-of-date copy). When that happens, I reset the error counter, and go on to the next one.

Anything other than that is deemed to be an ``error'' of some kind. The two most common errors will be that we already have an up-to-date version of the file (the modified timestamp of the local file equals the last-modified HTTP header of the remote file), or that the server cannot find the file (it's an invalid URL). We handle those cases in lines 28 to 35.

Lines 28 through 30 note the case where we have an up-to-date copy already. This is indicated by a RC_NOT_MODIFIED code returned, and the local file is left untouched. Note that when this happens, there's very little exchanged between the web server and this program -- just some headers that say ``hey, you already have that version''.

Line 29 notes this case for the output log on standard error. Line 30 handles the ``first time'' mode. If we're doing this for the first time, we don't want seven successful ``not yet modifieds'' to stop us, because we may not have necessarily gotten all the way down to the oldest file. So, I klugely cheat, and bump $errors back down by one when $FIRSTTIME is set. If I had designed this program better, I'd have made the check further up so that $errors wasn't incremented in this case. Hey, I wrote this program in a half hour, so cut me a little slack.

Lines 31 and 32 handle the ``not found'' case. This is an expected error if we've reached the end of the database, or if we're off slightly by asking for a file before it gets created because of timeclock differences. Either way, it just bumps the error count, and we're fine.

Finally, lines 33 and 34 handle any other error. The status_message function (imported from HTTP::Status) shows the full text of an arbitrary HTTP status value.

And that's it. How I'm running this program (yeah, I know, I told you not to, but I'm active on #perl, and I like to see what's said while I'm gone) is as follows.

First, I ran the program once, with $FIRSTTIME set to 1. As it turns out, I aborted the program a few times to make sure that the initial mirroring step was working properly. It did.

Then, I set $FIRSTTIME to 0, and pointed my cron at it, so that it would run every two hours. And it does. Only the logfile that has been written gets copied -- everything else is just a ``not modified'', so it's minimum drain on the server.

And, let me repeat one last time: do not mirror this particular directory on a regular basis... I don't want them to turn the data off. In fact, I wouldn't be surprised if amagosa moves the web address right after this column hits the stands.

Until next time, keep looking in the mirror.

Listings

        =1=     #!/home/merlyn/bin/perl -Tw
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use LWP::Simple;
        =6=     use HTTP::Status;
        =7=     
        =8=     my $WHERE = "http://cronos.toldyouso.Xcom/~select/EFnet-Perl/";;
        =9=     chdir "/home/merlyn/Perl/PoundPerl" or die "chdir: $!";
        =10=    
        =11=    my $FIRSTTIME = 0;              # ignore "not modified" errors
        =12=    
        =13=    for (
        =14=         my $when = time - 5*60,            # EST (missing first hour in EDT, oh well)
        =15=         my $errors = 0;
        =16=         $errors < 7;
        =17=         $when -= 86400) {
        =18=      my @when = gmtime $when;
        =19=      my $name = sprintf
        =20=        "Perl.log.%02d%s%02d",
        =21=        $when[3],
        =22=        ( qw/Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec/ )[$when[4]],
        =23=        $when[5] % 100;             # NOT Y2K, but they probably aren't either
        =24=      warn "fetching $name\n";
        =25=      my $rc = mirror("$WHERE/$name", $name);
        =26=      $errors = 0, next if $rc == RC_OK;
        =27=      $errors++;
        =28=      if ($rc == RC_NOT_MODIFIED) {
        =29=        warn "$name: not modified\n";
        =30=        $errors-- if $FIRSTTIME;
        =31=      } elsif ($rc == RC_NOT_FOUND) {
        =32=        warn "$name: not found\n";
        =33=      } else {
        =34=        warn "$name: Unexpected error: ", status_message($rc), " ($rc)\n";
        =35=      }
        =36=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.