Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 35 (Mar 1999)

Ahh, yes. Link verification. You've got a site with dozens, hundreds, or perhaps even thousands of links. After all, that's the purpose of HTML, to be able to link from one phrase to another page. But the resources referenced by those links (both internal links to your website, and external links to other websites) occasionally move (either accidentally or deliberately). And then you've got that ugly evil condition -- a bad link.

In fact, it's such a common problem that I've been using various tools that I had written about in this column to perform link verification. It started with a simple link checker in October 96, then advanced to a full cross referencer in June 97, and then a parallel link checker in July of 98. Well, it's getting to be about time to make it even better.

The problem with my past link verifiers is that they needed to run over the entire tree in one run. This is no big deal if you have 20 or 30 pages at most (like I did with the past versions), but once you start having a hundred pages or so, you have to be fairly lucky that your system doesn't crash some time in the middle of the run. And now that I've been adding more external links to my web page, there are now more and more things that have to be checked. So, a link verifier that has to scan the entire tree in one run just doesn't scale well.

But there was another problem with doing it in one run. Sometimes, a perfectly valid URL triggers a website that is temporarily down, or maybe just overloaded. So, since we're checking that URL only once in a run, we'll get a false alarm. And worse, when that happens, we won't scan the links that are present on that page, so we could miss the rest of a major run.

So what I needed was a way to give a link verifier a memory, so that it could restart if interrupted, and remember that a temporary failure wasn't so bad because of a prior successful fetch. And that's what I've created here.

This link verifier holds a database of each visited URL and its associated parameters as a Storable-saved disk file. The Storable module (found in the CPAN) can write an arbitrary data structure (lists, hashes, and references) to a file in a very fast compact format, and restore it later. For each URL, we'll keep track of when we've visited it last, when it was last modified, and even the links that it points at. If the program is interrupted, the current state of the scanning is saved as well.

(I'm indebted to my friend Rachel Taylor, of the US Census Bureau, for first having discussions of this approach, and getting me to think about how it could eventually be implemented. Thanks, Rachel!)

Although the prior link verifier was parallelized using the LWP::Parallel package, I abandoned using that module when I discovered that Gisle Aas (co-author of LWP) was working on a more tightly integrated package (to support HTTP/1.1), now in alpha release as LWPng. However, the new code is highly underdocumented in its alpha state, so I went back to normal non-parallel evaluation for this first radically improved link checker.

And the resulting program is in [listing one]. As it weighs in at roughly 300 lines, I'm not going to do my usual line-by-line description, but merely point out the highlights.

Please note the copyright on lines 5 through 7. Although my programs are meant to merely be examples for you to incorporate into your own programs, you can certainly run this program nearly as-is, as long as you adjust the appropriate configuration constants.

Lines 9 through 11 pull in the CPAN modules that we'll need. The module is fairly new portion of the LWP set of modules, so if this fails for you, be sure to get the latest release. If you're using to do your installation, you can simply execute the following:

  $ perl -MCPAN -eshell
  cpan> install Bundle::LWP
  [..much output..]
  cpan> quit

Line 13 makes it easier to specify a number of seconds as a multiple of a number of days.

Lines 15 through 74 specify all the configuration parameters for executions of this program. Obviously, there are quite a few configuration constants here. If you're going to use the program more or less as written, I suggest you read the description of how the program works first.

Line 17 declares where the ``memory'' for this program is stored. Don't worry about creating the file, but make sure you don't have some other file there already.

Lines 19 and 20 define a list of top-level URLs that should be scanned. You'll need to add only some URL in each disconnected part of the tree to be scanned. Any links forward from this page will be found recursively. Normally there'll be only one or two entries here. Please note that I've added an X to the end of each name here to prevent clueless people from blindly running a downloaded version of this program against my website accidentally. (If you just want to see the program run, remove the X's, and run it. But please don't do that repeatedly, because it can be somewhat resource-sucking on my website.)

Lines 22 to 45 define a PARSE subroutine. This subroutine will be repeatedly handed a URI object as each URL is revealed (including the starting URLs given in @CHECK). The PARSE routine must return a coded value to indicate how interesting the URL is. For a URL that should not only be fetched, but having its contents parsed for further URLs if it is a HTML page, return the value 2. For URLs that should merely be verified to see if they exist (sometimes called pinging in my prior link checkers), return a 1. If the URL should be noted in the cross reference, but not verified in any way, a 0 suffices. And if the URL is of no interest whatsoever (and should not even show up in the cross references), a -1 works.

For my site, I'm parsing nearly everything except for a few messy directories. Also, I have a picture hierarchy (see last month's column), and I really don't care to see the hundreds of JPEG files in the cross reference, so I blast those away with a -1. For URLs that aren't part of my site, I'll simply ping them.

This section makes use of a nice idiom... the single-element foreach loop (spelled for here). This is simply an easy way to give a short local name ($_) to a longer value (like $url-path_query) for a few lines of code. Also, since $_ is the default for regular expression matches, things get even easier when I want to match a string of some kind.

Lines 47 through 61 rewrite observed URLs to enforce uniformity. For example, my website is equally accessible via both and, and if I don't map one to the other here, I get doubled runs through nearly my entire web site. Also, I rewrite everything that ends in index.html to be merely the URL for the directory -- again to ensure that I don't get two different views of the same data. And, since I'm using my ``where are they going'' CGI hack (described in my May 1998 column) for my outgoing links, I strip the CGI invocation to peel out the actual URL. For example, on my pages:


is actually a link to

after my CGI ``go'' script redirects the user. So, I unmangle it here.

Line 63 defines the verbosity flag. If 0, we get only the summary report. If 1, each page will be noted as it is fetched (along with a one-line result status), and if 2, we'll also see all the outbound links for each HTML page as they are noted.

Line 65 is the number of seconds before we recheck any particular URL, mostly as a safety check to keep us from banging the same URL if something goes wrong. Leaving it at 10 is probably a good idea, but you could conceivably set it to one day or more if you want to be very very kind to the site you're scanning, and don't mind occasionally not noticing that something had gone away.

Line 66 is the number of seconds (here specified in days by multiplying the constant defined earlier) before we'll recheck a ``good'' URL. Again, making this higher is nicer to the checked site, but risks missing a page that has gone away.

Line 67 defines how long to remember the URLs that are on a page that appears to have disappeared, perhaps temporarily. This is handy when you have an intermittent site, because even if the page is unfetchable, you'll still follow the links you found in a prior pass. This should definitely be longer than the frequency with which you run an entire pass.

Line 68 defines how bad something has to be before I see it in the summary report in the end. If the time between a good fetch and a fetch of any kind is more than 3 days (in this case), it's considered a bad page. This should be high enough to avoid false alarms for temporarily down pages (like perhaps 3 to 4 times as long as the frequency with which you check each URL), but not so long that you'll miss a truly absent page. If you set this number to 0, you get a nice complete cross reference at the end of the pass, so that's handy to do occasionally. (Perhaps a command-line switch to do that would be nice for a modified version of this program.)

Line 70 tells LWP how many seconds to wait before aborting a particular URL fetch. Again, this should be high enough to avoid false alarms, but low enough so that you don't spend forever waiting for a dead URL. (The default value from LWP before I added this was 3 minutes, which seemed to be nearly 3 minutes too long.)

Finally, line 72 defines whether a 3xx-category HTTP error (redirections) should be considered bad, or just simply followed per instructions. Many sites have a published URL that really redirects to a per-session URL, so the URL isn't bad if it sends out a 301 or 302. (The way I read the spec, however, these guys should all be using 301, even though I've seen 302 on a number of prominent websites.) If $FOLLOW_REDIRECT is false, we report the 3xx error, but if it's true, we'll merely follow the link the same way a browser would. Note that setting this true may in fact obscure URLs that are incorrectly written on your pages, so I suggest running it with false once, fixing any sensible redirects, and then leaving it set to true for the most part.

Lines 76 to 106 define a link parser class, subclassed from HTML::Parser (part of LWP). This is used in line 197.

Lines 108 to 121 set up the LWP::UserAgent instance that will be used to fetch the pages.

Lines 124 to 152 set up the persistant data. The %NEW hash is the current data, while %OLD is the information from the previous pass (if any). @TODO is the list of URLs still to be examined for this pass.

The signals listed in line 135 include everything that is likely to stop the program (such as an interrupt from the keyboard), causing them to save the current database at a convenient time in the processing. Note that the list includes SIGARLM, sent at a requested number of seconds after invoking the alarm() function.

If there is a command-line parameter, we set up an alarm that many seconds later (in line 139), so we can force this program to run for a certain timespan before aborting. For example, you could set up a cron job to run this program for two hours every night beginning at 01:00 AM, so that it covers as much of your site as it can in those two hours early in the morning. (That'd be a command-line parameter of 7200.) If that results in a report, you'll see it in the standard cron mail -- otherwise, it runs silently (presuming $VERBOSE is zero), making slow incremental progress through your site.

Lines 154 through 257 define the main loop, executed once per URL. If next is executed within this block, lines 258 through 288 provide an ``end of loop'' sequence of steps. That's where we note if a shutdown-type signal has come along, or if we've reached the end of a pass.

Lines 292 through 300 define the add_link subroutine, used when a link is discovered (or initially selected in @CHECK) to trigger a scan of the appropriate kind.

Lines 302 to 310 define the dump_relative subroutine, called twice from the report-generation phase in lines 267 through 285 to dump a series of URL links for the cross-reference part of the report.

Line 312 works around a bug I found in version 1.00. Perhaps it won't be needed when you read this.

Note that in line 165, a random item from the todo-list is selected, which distributes the load amongst multiple web servers in the list (if any). It also means that you can't really diff the output of one run of this program against another, because the order of URL visits will be different.

Also note that line 182 uses a previously fetched Last-Modified time to form a request in which we might get a 304 response (not yet modified). This saves a lot of time, because we can use the previously found links instead of reparsing the data (or even re-receiving it!).

Whew! That's a big one, isn't it? When Gisle's LWPng becomes stable, I'll probably hack this program yet again to be parallel. (Sounds like that could be another future column.) Keep watching, and in the meanwhile, enjoy your link-verified web pages!

Listing One

        =1=     #!/home/merlyn/bin/perl -w
        =2=     use strict;
        =3=     $| = 1;
        =5=     ## Copyright (c) 1996,97,98,99 by Randal L. Schwartz
        =6=     ## This program is free software; you can redistribute it
        =7=     ## and/or modify it under the same terms as Perl itself.
        =9=     use URI;
        =10=    use LWP;
        =11=    use Storable;
        =13=    use constant DAYS => 24 * 60 * 60; # for configs below
        =15=    ## begin configure
        =17=    my $DATABASE = "/home/merlyn/.slinkydata";
        =19=    my @CHECK =                     # list of initial starting points
        =20=      qw(http://www.stonehenge.comX/ http://www.effectiveperl.comX/);
        =22=    sub PARSE {
        =23=      ## return 2 to parse if HTML
        =24=      ## return 1 to merely verify existance
        =25=      ## return 0 to not even verify existance, but still xref
        =26=      ## return -1 to ignore entirely
        =27=      my $url = shift;              # URI::URL object (absolute)
        =28=      for ($url->scheme) {
        =29=        return 0 unless /^ftp$/ or /^gopher$/ or /^http$/;
        =30=      }
        =31=      for ($url->host) {
        =32=        if (/\.stonehenge\.com$/) {
        =33=          for ($url->path_query) {
        =34=            return -1 if /\/\?[DNS]=[AD]$/; # silly mod_index
        =35=          }
        =36=          for ($url->path) {
        =37=            return 0 if /^\/(cgi|fors)\// or /col\d\d|refindex/;
        =38=            return -1 if /^\/merlyn\/Pictures\/.*\.jpg$/s;
        =39=          }
        =40=          return 2;                 # default
        =41=        }
        =42=        return 2 if /\.effectiveperl\.com$/;
        =43=        return 1;                   # ping the world
        =44=      }
        =45=    }
        =47=    sub HACK_URL {
        =48=      my $url = shift;              # URI object
        =49=      {
        =50=        $url = $url->canonical;
        =51=        if ($url->scheme eq "http") {
        =52=          $url->host("") if $url->host eq "";
        =53=          if ($url->host eq "") {
        =54=            ($url = URI->new("$1")), redo
        =55=              if $url->path_query =~ /^\/cgi\/go\/(.*)/s;
        =56=            $url->path("$1") if $url->path =~ /^(.*\/)index\.html$/s;
        =57=          }
        =58=        }
        =59=      }
        =60=      $url->canonical;
        =61=    }
        =63=    my $VERBOSE = 2;                # 0 = quiet, 1 = noise, 2 = lots of noise
        =65=    my $RECHECK = 10;               # seconds between rechecking any URL
        =66=    my $RECHECK_GOOD = 1 * DAYS;    # seconds between rechecking good URLs
        =67=    my $FOLLOW_GHOST = 14 * DAYS;   # seconds before tossing bad URLs links
        =68=    my $REPORT = 3 * DAYS;          # seconds before bad enough to report
        =70=    my $TIMEOUT = 15;               # seconds to timeout on fetch
        =72=    my $FOLLOW_REDIRECT = 1;        # follow redirects?
        =74=    ## end configure (no user-servicable parts below this line)
        =76=    BEGIN {
        =77=      package ParseLink;
        =78=      use HTML::Parser;
        =79=      use vars qw(@ISA);
        =81=      @ISA = qw(HTML::Parser);
        =83=      sub new_line {
        =84=        my $self = shift;
        =85=        $self->{Line}++;
        =86=      }
        =88=      ## $self->{Links} = {
        =89=      ##    "url" => { "line" => "count", "line" => "count" ... }, ...
        =90=      ## };
        =91=      sub start {                   # called by parse
        =92=        my $self = shift;
        =93=        my ($tag, $attr) = @_;
        =94=        my $link;
        =95=        $link = $attr->{href} if $tag eq "a";
        =96=        $link = $attr->{src} if $tag eq "img";
        =97=        if (defined $link) {
        =98=          $self->{Links}{$link}{$self->{Line} + 1}++;
        =99=        }
        =100=     }
        =102=     sub get_links {               # $instance->get_links()
        =103=       my $self = shift;
        =104=       $self->{Links};
        =105=     }
        =106=   }                               # end of ParseLink
        =108=   BEGIN {
        =109=     my $AGENT = LWP::UserAgent->new;
        =110=     $AGENT->agent("slinky/1.02 " . $AGENT->agent);
        =111=     $AGENT->env_proxy;
        =112=     $AGENT->timeout($TIMEOUT);
        =114=     sub fetch {
        =115=       if ($FOLLOW_REDIRECT) {
        =116=         $AGENT->request(shift);
        =117=       } else {
        =118=         $AGENT->simple_request(shift);
        =119=       }
        =120=     }
        =121=   }
        =123=   ## the persistant data:
        =124=   my (%OLD, %NEW, @TODO);
        =125=   my $SAVE_WANTED = 0;
        =127=   if (-r $DATABASE) {
        =128=     my $restore = retrieve $DATABASE or die "Cannot retrieve from $DATABASE\n";
        =129=     %OLD = %{$restore->[0]};
        =130=     %NEW = %{$restore->[1]};
        =131=     @TODO = @{$restore->[2]};
        =132=     warn "database restored\n" if $VERBOSE;
        =133=   }
        =135=   for (qw(HUP INT QUIT ALRM TERM)) {
        =136=     $SIG{$_} = sub { $SAVE_WANTED = 1 };
        =137=   }
        =139=   alarm(shift) if @ARGV;
        =141=   ## $NEW{"some url"} = {
        =142=   ##   From => {
        =143=   ##     { "url" => { "line" => 1, "line" => 1, ... } },
        =144=   ##     { "url" => { "line" => 1, "line" => 1, ... } },
        =145=   ##   },
        =146=   ##   To => [like From]
        =147=   ##   Base => "base", ## if base != url
        =148=   ##   Status => "Whatever",
        =149=   ##   Checked => time, ## when did we last look?
        =150=   ##   Good => time, ## when was it good (if ever)
        =151=   ##   LastModified => time, ## when it was good, when was it last modified?
        =152=   ## }
        =154=   URL: {
        =155=     unless (@TODO) {
        =156=       ## prime the pump
        =157=       %OLD = %NEW;
        =158=       %NEW = ();
        =159=       for (0..$#CHECK) {
        =160=         my $url = HACK_URL(URI->new($CHECK[$_]));
        =161=         add_link("REQUESTED:", $_, $url);
        =162=       }
        =163=     }
        =164=     warn @TODO." to go...\n" if $VERBOSE;
        =165=     my $url = splice(@TODO, rand @TODO, 1); # the lucky winner is...
        =166=     $url = URI->new($url);
        =167=     for (qw(Checked Good LastModified)) {
        =168=       $NEW{$url}{$_} = $OLD{$url}{$_} || 0;
        =169=     }
        =170=     my $parse = PARSE($url);
        =171=     if ($parse >= 2) {
        =172=       warn "Parsing $url\n" if $VERBOSE;
        =173=       my $links = $OLD{$url}{To} || {};
        =174=       my $base;
        =175=       if (time < $NEW{$url}{Checked} + $RECHECK or
        =176=           time < $NEW{$url}{Good} + $RECHECK_GOOD and
        =177=           $NEW{$url}{LastModified} > 0) {
        =178=         warn ".. too early to recheck\n" if $VERBOSE;
        =179=         $NEW{$url}{Status} = $OLD{$url}{Status};
        =180=       } else {
        =181=         my $req = HTTP::Request->new(GET => $url);
        =182=         $req->if_modified_since($NEW{$url}{LastModified});
        =183=         my $res = fetch($req);
        =184=         push(@TODO, $url->as_string), next if $SAVE_WANTED;
        =185=         if ($res->is_success) {
        =186=           warn ".. successful fetch\n" if $VERBOSE;
        =187=           $base = $res->base;
        =188=           $NEW{$url}{Checked} = $NEW{$url}{Good} = time;
        =189=           $NEW{$url}{Base} = $base if $base ne $url;
        =190=           $NEW{$url}{LastModified} = $res->last_modified || $res->date;
        =191=           unless ($res->content_type =~ /text\/html/i) {
        =192=             warn ".. not HTML\n" if $VERBOSE;
        =193=             $NEW{$url}{Status} = "Verified (content = ".($res->content_type).")";
        =194=             next;
        =195=           }
        =196=           $NEW{$url}{Status} = "Verified (and parsed)";
        =197=           my $p = ParseLink->new;
        =198=           for ($res->content =~ /(.+|\n)/g) {
        =199=             $p->parse($_);
        =200=             $p->new_line() if $_ eq "\n";
        =201=           }
        =202=           $p->parse(undef);               # signal the end of parse
        =203=           $links = $p->get_links; # key is relative url, value is href
        =204=         } elsif ($res->code == 304) {
        =205=           warn ".. not modified\n" if $VERBOSE;
        =206=           $NEW{$url}{Status} = $OLD{$url}{Status};
        =207=           $NEW{$url}{Checked} = $NEW{$url}{Good} = time;
        =208=         } else {
        =209=           warn ".. not verified\n" if $VERBOSE;
        =210=           $NEW{$url}{Status} = "NOT Verified (status = ".($res->code).")";
        =211=           $NEW{$url}{Checked} = time;
        =212=           next if time > $NEW{$url}{Good} + $FOLLOW_GHOST;
        =213=           warn ".. but following ghost links\n" if $VERBOSE;
        =214=         }
        =215=       }
        =216=       for my $link (sort keys %$links) {
        =217=         my $abs = $link;
        =218=         if ($base) {              # we fetched a page
        =219=           $abs = URI->new_abs($link,$base);
        =220=           $abs->fragment(undef);          # blow away any fragment
        =221=           $abs = HACK_URL($abs)->as_string;
        =222=         }
        =223=         warn "... $abs ($link)\n" if $VERBOSE > 1;
        =224=         for my $line (sort keys %{$links->{$link}}) {
        =225=           add_link($url, $line, $abs) if PARSE(URI->new($abs)) >= 0;
        =226=         }
        =227=       }
        =228=       next;
        =229=     } elsif ($parse >= 1) {
        =230=       warn "Pinging $url\n" if $VERBOSE;
        =231=       if (time < $NEW{$url}{Checked} + $RECHECK or
        =232=           time < $NEW{$url}{Good} + $RECHECK_GOOD) {
        =233=         warn ".. too early to recheck\n" if $VERBOSE;
        =234=         $NEW{$url}{Status} = $OLD{$url}{Status};
        =235=       } else {
        =236=         my $res;
        =237=         for my $method (qw(HEAD GET)) {
        =238=           my $req = new HTTP::Request($method,$url);
        =239=           $res = fetch($req);
        =240=           push(@TODO, $url->as_string), next URL if $SAVE_WANTED;
        =241=           if ($res->is_success) {
        =242=             $NEW{$url}{Status} = "Verified (contents not examined)";
        =243=             $NEW{$url}{Checked} = $NEW{$url}{Good} = time;
        =244=             $NEW{$url}{LastModified} = 0;
        =245=             next URL;
        =246=           }
        =247=         }
        =248=         $NEW{$url}{Status} = "NOT Verified (status = ".($res->code).")";
        =249=         $NEW{$url}{Checked} = time;
        =250=         next;
        =251=       }
        =252=     } else {                      # $parse < 0
        =253=       warn "Skipping $url\n" if $VERBOSE;
        =254=       $NEW{$url}{Status} = "Skipped";
        =255=       $NEW{$url}{Checked} = 0;    # we no longer check this
        =256=       next;
        =257=     }
        =258=   } continue {
        =259=     if ($SAVE_WANTED) {
        =260=       warn "dumping data to $DATABASE...\n" if $VERBOSE;
        =261=       store [\%OLD, \%NEW, \@TODO], $DATABASE;
        =262=       exit 0;
        =263=     }
        =264=     redo if @TODO;
        =265=     warn "dumping data to $DATABASE...\n" if $VERBOSE;
        =266=     store [\%OLD, \%NEW, \@TODO], $DATABASE;
        =267=     print "\nBEGIN REPORT at ".localtime()."\n\n";
        =268=     for my $url (sort keys %NEW) {
        =269=       next if $url =~ /^requested:/i;
        =270=       my $entry = $NEW{$url};     # href
        =271=       next unless $entry->{Checked} > $entry->{Good} + $REPORT;
        =272=       my $status = $entry->{Status};
        =273=       my $base = $entry->{Base};
        =274=       print "$url";
        =275=       print " (base $base)" if defined $base;
        =276=       print ":\n  status: $status\n";
        =277=       for (qw(Checked Good LastModified)) {
        =278=         if (my $stamp = $entry->{$_}) {
        =279=           print "  $_ => ".localtime($stamp)."\n";
        =280=         }
        =281=       }
        =282=       dump_relative($url, "from", $entry->{From});
        =283=       dump_relative($url, "to", $entry->{To});
        =284=     }
        =285=     print "\nEND REPORT\n\n";
        =286=     ## redo;
        =287=     exit 0;
        =288=   }       
        =290=   ## subroutines
        =292=   sub add_link {
        =293=     my ($from,$line,$to) = @_;
        =294=     $NEW{$to}{From}{$from}{$line}++;
        =295=     $NEW{$from}{To}{$to}{$line}++;
        =296=     unless (defined $NEW{$to}{Status}) {
        =297=       push @TODO, $to;
        =298=       $NEW{$to}{Status} = "[to be probed]";
        =299=     }
        =300=   }
        =302=   sub dump_relative {
        =303=     my ($url,$label,$urls) = @_;
        =304=     for my $other_url (sort keys %$urls) {
        =305=       my $relative = URI->new($other_url)->rel($url) || ".";
        =306=       print "  $label $relative at ";
        =307=       print join " ", sort { $a <=> $b } keys %{$urls->{$other_url}};
        =308=       print "\n";
        =309=     }
        =310=   }
        =312=   sub URI::mailto::authority { ""; } # workaround bug

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at or +1 503 777-0095, and welcomes questions on Perl and other related topics.