Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 10 (January 1997)

I just got myself a shiny new virtual web server at www.stonehenge.com. However, the www.stonehenge.com name has been around for a long time (over a year), having been aliased to www.teleport.com. Before, to get to my stuff specifically, you had to say www.stonehenge.com/~merlyn, which was the same as www.teleport.com/~merlyn. But www.stonehenge.com/~madamex was also the same as www.teleport.com/~madamex, giving you madamex's web pages at Teleport. But that's not so, now, since I can now control everything that appears at www.stonehenge.com, and www.stonehenge.com/~madamex is just another (missing) directory.

I didn't think this would be a very big deal, moving the www.stonehenge.com address like that. What I hadn't counted on was the number of people that had been accessing www.teleport.com as www.stonehenge.com, including a number of references in hotlists and stuff. Worse, the Lycos search engine had apparently decided that the entire 16,000 users web pages at Teleport all belonged under www.stonehenge.com. Ugh!

So, I started getting bad hits on my shiny new virtual web server from the first hour I had set it up. For example, someone would try to follow a link for www.stonehenge.com/~tangent/scb.html, and get back nothing except the standard error text. Not very informative, said I, so I decided to help them along a bit.

The Apache web server allows me to create a CGI script to handle particular kinds of errors. In this case, the error was type ``404'' (document not found). By placing this directive:

        ErrorDocument 404 /cgi/404-handler

into a configuration or ``.htaccess'' file, all 404-type errors cause the script /cgi/404-handler to be invoked using the standard CGI protocol, along with a few extra variables that make sense only in an error condition.

I wrote my own error handler to create a custom message. The standard output of the program is returned to the client (according to CGI protocol), while the standard error is attached to the error log. Thus, there are really two ``outputs''. The error log looks like:

        [Tue Sep 24 16:39:31 1996] [404 ERROR] [/~tangent/scb.html] [dial8.ecicnet.org] [http://iceonline.com/home/rodc/jacpac5.html]

while the resulting HTML as viewed by the client looks like:

        <HEAD><TITLE>File Not found</TITLE></HEAD>
        <BODY><H1>File Not found</H1>
        The requested URL /~tangent/scb.html was not found on this server.<P>
        Perhaps you were looking for something at Teleport's web-server,
        such as <A HREF="http://www.teleport.com/~tangent/scb.html">
        http://www.teleport.com/~tangent/scb.html</a>;?
        </BODY>

Notice that besides indicating that the file isn't found, the script automatically guesses what the original Teleport address must have looked like, and spits it out in selectable form. (I'm trying to be nice.)

So, let's take a look at 404-handler, shown in Listing 1 [below].

Lines 1 and 2 begin nearly every program I write, turning on taintchecking, warnings, and good-programming restrictions (declared variables, no poetry-mode, and no soft references).

Lines 3 through 10 handle the error log message. Because the error log is already opened on standard error, we can just write it there. The output is a series of fields enclosed in brackets, including the time of day, ``404 ERROR'', and three environment variables. Note that if the variable is empty or missing, we substitute a ``-''.

These environment variables give information about why the 404 handler was called. REDIRECT_URL is the requested web page that wasn't found. REMOTE_HOST is where the request came from. And HTTP_REFERER is the contents of the ``referer:'' header from the browser, which can often indicate the web page that contains the link, but not necessarily. With proper parsing of the error log later, we can try to come up with the culprit.

The remainder of the program is concerned with getting the message back to the browser. Line 11 captures the request url into a nicer variable, possibly giving a recognizably bad address in the event of some bogus or missing value.

Lines 12 through 30 form an error-trapping eval block. If anything goes wrong in this section, we'll fall through to line 31. Nice technique, and illustrated many times in my past columns.

Line 13 checks whether the URL begins with /~. If it does, it most certainly was a Teleport reference, as there have never been any documents at my nice shiny new webserver that looked like that. Therefore, we have a candidate for the rewrite to teleport, and lines 14 to 28 handle that.

Line 14 copies the redirect URL into a new temporary variable, which is transformed into a HTML-safe variable in line 15. The ugly regular expression jiggles every control character, every character in the second half of the ASCII set, and a few dangerous variables into their equivalent entities. Note that I say the quote mark twice, but that's just to keep CPerl mode happy in GNU Emacs.

Line 16 constructs a Teleport URL from the HTML-ized URL. The $html variable always begins with a slash, so there's no need to put another one in.

Lines 17 through 26 print out the HTML. Note the use of $html and multiple uses of $tp_html to cause the message to be customized properly.

If we didn't get a /~ URL, that means the client really was looking for a URL at www.stonehenge.com, but that URL doesn't exist. In this case, I mimic the standard response in lines 31 through 41. Note however that I've added an HTML comment to let me know the response is coming from my program instead of the standard place. This came in handy while I was debugging to verify that I had my ``.htaccess'' file set up correctly.

Once I started getting the errors into the log file, I noticed that they often had usable referer addresses in them. I became curious about the content of those pages, to see if there really were bad references there, so I started cutting and pasting the URLs into a browser. After about a dozen of those, I thought ``this ought to be automated''. So, I whipped up a program to do that.

The purpose of the program is to wander through the error log, look for 404 errors that have a reasonable referer URL, fetch that page, and then look for both bad links to the non-existing pages and someone to tell about that.

The output of this program looks like:

        [Tue Sep 24 16:39:31 1996 /~tangent/scb.html http://iceonline.com/home/rodc/jacpac5.html]
          hit: href="http://www.stonehenge.com/~tangent/scb.html">Solid
          mailto: HREF="mailto:rodc@iceonline.com">rodc@iceonline.com</A></font><P>

Notice that the program is telling me the original error info as well as a likely HTML source reference to the bad URL, and any ``mailto:'' URLs in the page (a likely first cut at a human that might be able to fix it). This program is presented in [listing 2].

Lines 1 through 3 provide a standard header. Line 5 pulls in the LWP::Simple library, part of the LWP stuff. LWP::Simple provides a nice routine call ``get'' -- very handy for fetching URLs when you don't need a lot of specialized error checking.

Lines 7 through 12 define a list of places where the error log might be found. I compress the really oldest logs into error_log.DIGITS.gz, and the more recent log files are still alive in error_log.DIGITS. The current error log is just error_log. Note the use of map here: I'm transforming error_log.123.gz into ``gunzip <error_log.123.gz|'' so that when the open call gets the string later, it automatically launches a process to uncompress it. Yet another neat trick.

Line 14 creates the %seen hash, which will help us keep track of the URLs processed so far.

Lines 15 to 35 process each line of the files specified in the @ARGV array, set above. Line 16 rejects any of those lines that aren't 404 errors. Lines 17 through 20 break the lines into the wanted fields.

Lines 21 through 23 screen down the potential URLs, eliminating things that aren't HTML, that are the result of queries, or have been already processed. If they make it that far, line 24 shows the result as a potential fetch.

Line 25 performs an entire web fetch. Yes. That's it. The URL is turned into the text that it represents. Well, most of the time. Sometimes, the stuff can't be fetched, and that leaves $content undefined. Which is exactly what gets checked and reported in lines 26 through 29.

Line 30 scans the resulting text for ``stonehenge'', and returns the entire non-blank chunk in which the word appears. If any are found, lines 31 through 34 display the hits, and the mailto's, using a similar scanning technique.

So, there you have a it. A custom 404 handler, and a program to help track down where all those errors are coming from. I'm thinking of writing a fancier error log parser that looks for the REV=made info or candidate mailto's in parent documents. Maybe that'll end up in a future column. Until then, happy parsing!

Listings

        =0=     ###### listing 1 ######
        =1=     #!/usr/bin/perl -Tw
        =2=     use strict;
        =3=     print STDERR
        =4=       join (" ",
        =5=             map "[$_]",
        =6=             scalar localtime,
        =7=             "404 ERROR",
        =8=             map { $ENV{$_} || "-" }
        =9=             qw(REDIRECT_URL REMOTE_HOST HTTP_REFERER)),
        =10=      "\n";
        =11=    my $red_url = $ENV{REDIRECT_URL} || "?unknown?";
        =12=    eval {
        =13=      if ($red_url =~ /^\/~.*/s) {
        =14=        my $html = $red_url;
        =15=        $html =~ s/[\x00-\x20"<&>"\x80-\xff]/&\#@{[ord$&]}\;/g;
        =16=        my $tp_html = "http://www.teleport.com$html";
        =17=        print <<DQ;
        =18=    Content-type: text/html
        =19=    Status: 404 Not Found
        =20=    
        =21=    <HEAD><TITLE>File Not found</TITLE></HEAD>
        =22=    <BODY><H1>File Not found</H1>
        =23=    The requested URL $html was not found on this server.<P>
        =24=    Perhaps you were looking for something at Teleport's web-server,
        =25=    such as <A HREF="$tp_html">$tp_html</a>?
        =26=    </BODY>
        =27=    DQ
        =28=        exit 0;
        =29=      }
        =30=    };
        =31=    print <<"DQ";
        =32=    Content-type: text/html
        =33=    Status: 404 Not Found
        =34=    
        =35=    <HEAD><TITLE>File Not found</TITLE></HEAD>
        =36=    <BODY><H1>File Not found</H1>
        =37=    The requested URL $red_url was not found on this server.<P>
        =38=    Try looking at the <a href="http://www.stonehenge.com/">home page</a>.
        =39=    <!-- This is a custom message. -->
        =40=    </BODY>
        =41=    DQ
        =0=     ###### listing 2 ######
        =1=     #!/usr/bin/perl
        =2=     use strict;
        =3=     $|=1;
        =4=     
        =5=     use LWP::Simple;
        =6=     
        =7=     my $LOGDIR = "/home/merlyn/Logs";
        =8=     @ARGV = (
        =9=              (map { "gunzip <$_|" } <$LOGDIR/error_log.*[0-9].gz>),
        =10=             <$LOGDIR/error_log.*[0-9]>,
        =11=             "$LOGDIR/error_log",
        =12=            );
        =13=    
        =14=    my %seen;
        =15=    while (<>) {
        =16=      next unless /404 ERROR/;
        =17=      s/^\[//;
        =18=      s/\]\s*$//;
        =19=      my @fields = split /\] \[/;
        =20=      my ($time, $wanted, $ref) = @fields[0,2,4];
        =21=      next unless $ref =~ /^http:/; # solid HTTP fetch
        =22=      next if $ref =~ /\?/;         # no CGI searches
        =23=      next if $seen{$ref}++;        # once only
        =24=      print "[$time $wanted $ref]\n";
        =25=      my $content = get $ref;
        =26=      unless (defined $content) {
        =27=        print "... content not available\n";
        =28=        next;
        =29=      }
        =30=      my @stonehenge = $content =~ /(\S*stonehenge\S*)/mig;
        =31=      if (@stonehenge) {
        =32=        print map "  hit: $_\n", @stonehenge;
        =33=        print map "  mailto: $_\n", $content =~ /(\S*mailto:\S*)/mg;
        =34=      }
        =35=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.