Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Download this listing! | ||
Web Techniques Column 35 (Mar 1999)
Ahh, yes. Link verification. You've got a site with dozens, hundreds, or perhaps even thousands of links. After all, that's the purpose of HTML, to be able to link from one phrase to another page. But the resources referenced by those links (both internal links to your website, and external links to other websites) occasionally move (either accidentally or deliberately). And then you've got that ugly evil condition -- a bad link.
In fact, it's such a common problem that I've been using various tools that I had written about in this column to perform link verification. It started with a simple link checker in October 96, then advanced to a full cross referencer in June 97, and then a parallel link checker in July of 98. Well, it's getting to be about time to make it even better.
The problem with my past link verifiers is that they needed to run over the entire tree in one run. This is no big deal if you have 20 or 30 pages at most (like I did with the past versions), but once you start having a hundred pages or so, you have to be fairly lucky that your system doesn't crash some time in the middle of the run. And now that I've been adding more external links to my web page, there are now more and more things that have to be checked. So, a link verifier that has to scan the entire tree in one run just doesn't scale well.
But there was another problem with doing it in one run. Sometimes, a perfectly valid URL triggers a website that is temporarily down, or maybe just overloaded. So, since we're checking that URL only once in a run, we'll get a false alarm. And worse, when that happens, we won't scan the links that are present on that page, so we could miss the rest of a major run.
So what I needed was a way to give a link verifier a memory, so that it could restart if interrupted, and remember that a temporary failure wasn't so bad because of a prior successful fetch. And that's what I've created here.
This link verifier holds a database of each visited URL and its
associated parameters as a Storable
-saved disk file. The
Storable
module (found in the CPAN) can write an arbitrary data
structure (lists, hashes, and references) to a file in a very fast
compact format, and restore it later. For each URL, we'll keep track
of when we've visited it last, when it was last modified, and even the
links that it points at. If the program is interrupted, the current
state of the scanning is saved as well.
(I'm indebted to my friend Rachel Taylor, of the US Census Bureau, for first having discussions of this approach, and getting me to think about how it could eventually be implemented. Thanks, Rachel!)
Although the prior link verifier was parallelized using the
LWP::Parallel
package, I abandoned using that module when I
discovered that Gisle Aas (co-author of LWP
) was working on a more
tightly integrated package (to support HTTP/1.1), now in alpha release
as LWPng
. However, the new code is highly underdocumented in its
alpha state, so I went back to normal non-parallel evaluation for this
first radically improved link checker.
And the resulting program is in [listing one]. As it weighs in at roughly 300 lines, I'm not going to do my usual line-by-line description, but merely point out the highlights.
Please note the copyright on lines 5 through 7. Although my programs are meant to merely be examples for you to incorporate into your own programs, you can certainly run this program nearly as-is, as long as you adjust the appropriate configuration constants.
Lines 9 through 11 pull in the CPAN modules that we'll need. The
URI.pm
module is fairly new portion of the LWP set of modules, so
if this fails for you, be sure to get the latest release. If you're
using CPAN.pm
to do your installation, you can simply execute the
following:
$ perl -MCPAN -eshell cpan> install Bundle::LWP [..much output..] cpan> quit $
Line 13 makes it easier to specify a number of seconds as a multiple of a number of days.
Lines 15 through 74 specify all the configuration parameters for executions of this program. Obviously, there are quite a few configuration constants here. If you're going to use the program more or less as written, I suggest you read the description of how the program works first.
Line 17 declares where the ``memory'' for this program is stored. Don't worry about creating the file, but make sure you don't have some other file there already.
Lines 19 and 20 define a list of top-level URLs that should be scanned. You'll need to add only some URL in each disconnected part of the tree to be scanned. Any links forward from this page will be found recursively. Normally there'll be only one or two entries here. Please note that I've added an X to the end of each name here to prevent clueless people from blindly running a downloaded version of this program against my website accidentally. (If you just want to see the program run, remove the X's, and run it. But please don't do that repeatedly, because it can be somewhat resource-sucking on my website.)
Lines 22 to 45 define a PARSE
subroutine. This subroutine will be
repeatedly handed a URI
object as each URL is revealed (including
the starting URLs given in @CHECK
). The PARSE
routine must
return a coded value to indicate how interesting the URL is. For a
URL that should not only be fetched, but having its contents parsed
for further URLs if it is a HTML page, return the value 2. For URLs
that should merely be verified to see if they exist (sometimes called
pinging in my prior link checkers), return a 1. If the URL should
be noted in the cross reference, but not verified in any way, a 0
suffices. And if the URL is of no interest whatsoever (and should not
even show up in the cross references), a -1 works.
For my site, I'm parsing nearly everything except for a few messy directories. Also, I have a picture hierarchy (see last month's column), and I really don't care to see the hundreds of JPEG files in the cross reference, so I blast those away with a -1. For URLs that aren't part of my site, I'll simply ping them.
This section makes use of a nice idiom... the single-element
foreach
loop (spelled for
here). This is simply an easy way to
give a short local name ($_
) to a longer value (like
$url-path_query
) for a few lines of code. Also, since $_
is
the default for regular expression matches, things get even easier
when I want to match a string of some kind.
Lines 47 through 61 rewrite observed URLs to enforce uniformity. For
example, my website is equally accessible via both
www.stonehenge.com
and w3.stonehenge.com
, and if I don't map one
to the other here, I get doubled runs through nearly my entire web
site. Also, I rewrite everything that ends in index.html
to be
merely the URL for the directory -- again to ensure that I don't get
two different views of the same data. And, since I'm using my ``where
are they going'' CGI hack (described in my May 1998 column) for my
outgoing links, I strip the CGI invocation to peel out the actual URL.
For example, on my pages:
/cgi/go/http://www.perl.org/
is actually a link to
http://www.perl.org/
after my CGI ``go'' script redirects the user. So, I unmangle it here.
Line 63 defines the verbosity flag. If 0, we get only the summary report. If 1, each page will be noted as it is fetched (along with a one-line result status), and if 2, we'll also see all the outbound links for each HTML page as they are noted.
Line 65 is the number of seconds before we recheck any particular URL, mostly as a safety check to keep us from banging the same URL if something goes wrong. Leaving it at 10 is probably a good idea, but you could conceivably set it to one day or more if you want to be very very kind to the site you're scanning, and don't mind occasionally not noticing that something had gone away.
Line 66 is the number of seconds (here specified in days by multiplying the constant defined earlier) before we'll recheck a ``good'' URL. Again, making this higher is nicer to the checked site, but risks missing a page that has gone away.
Line 67 defines how long to remember the URLs that are on a page that appears to have disappeared, perhaps temporarily. This is handy when you have an intermittent site, because even if the page is unfetchable, you'll still follow the links you found in a prior pass. This should definitely be longer than the frequency with which you run an entire pass.
Line 68 defines how bad something has to be before I see it in the summary report in the end. If the time between a good fetch and a fetch of any kind is more than 3 days (in this case), it's considered a bad page. This should be high enough to avoid false alarms for temporarily down pages (like perhaps 3 to 4 times as long as the frequency with which you check each URL), but not so long that you'll miss a truly absent page. If you set this number to 0, you get a nice complete cross reference at the end of the pass, so that's handy to do occasionally. (Perhaps a command-line switch to do that would be nice for a modified version of this program.)
Line 70 tells LWP
how many seconds to wait before aborting a
particular URL fetch. Again, this should be high enough to avoid false
alarms, but low enough so that you don't spend forever waiting for a
dead URL. (The default value from LWP
before I added this was 3
minutes, which seemed to be nearly 3 minutes too long.)
Finally, line 72 defines whether a 3xx
-category HTTP error
(redirections) should be considered bad, or just simply followed per
instructions. Many sites have a published URL that really redirects
to a per-session URL, so the URL isn't bad if it sends out a 301 or
302. (The way I read the spec, however, these guys should all be using
301, even though I've seen 302 on a number of prominent websites.) If
$FOLLOW_REDIRECT
is false, we report the 3xx
error, but if it's
true, we'll merely follow the link the same way a browser would. Note
that setting this true may in fact obscure URLs that are incorrectly
written on your pages, so I suggest running it with false once, fixing
any sensible redirects, and then leaving it set to true for the most
part.
Lines 76 to 106 define a link parser class, subclassed from
HTML::Parser
(part of LWP
). This is used in line 197.
Lines 108 to 121 set up the LWP::UserAgent
instance that will be
used to fetch the pages.
Lines 124 to 152 set up the persistant data. The %NEW
hash is the
current data, while %OLD
is the information from the previous pass
(if any). @TODO
is the list of URLs still to be examined for this
pass.
The signals listed in line 135 include everything that is likely to
stop the program (such as an interrupt from the keyboard), causing
them to save the current database at a convenient time in the
processing. Note that the list includes SIGARLM
, sent at a
requested number of seconds after invoking the alarm()
function.
If there is a command-line parameter, we set up an alarm that many
seconds later (in line 139), so we can force this program to run for a
certain timespan before aborting. For example, you could set up a
cron
job to run this program for two hours every night beginning at
01:00 AM, so that it covers as much of your site as it can in those
two hours early in the morning. (That'd be a command-line parameter of
7200.) If that results in a report, you'll see it in the standard
cron
mail -- otherwise, it runs silently (presuming $VERBOSE
is
zero), making slow incremental progress through your site.
Lines 154 through 257 define the main loop, executed once per URL. If
next
is executed within this block, lines 258 through 288 provide
an ``end of loop'' sequence of steps. That's where we note if a
shutdown-type signal has come along, or if we've reached the end of a
pass.
Lines 292 through 300 define the add_link
subroutine, used when a
link is discovered (or initially selected in @CHECK
) to trigger a
scan of the appropriate kind.
Lines 302 to 310 define the dump_relative
subroutine, called twice
from the report-generation phase in lines 267 through 285 to dump a
series of URL links for the cross-reference part of the report.
Line 312 works around a bug I found in URI.pm
version 1.00.
Perhaps it won't be needed when you read this.
Note that in line 165, a random item from the todo-list is
selected, which distributes the load amongst multiple web servers in
the list (if any). It also means that you can't really diff
the
output of one run of this program against another, because the order
of URL visits will be different.
Also note that line 182 uses a previously fetched Last-Modified
time to form a request in which we might get a 304
response (not
yet modified). This saves a lot of time, because we can use the
previously found links instead of reparsing the data (or even
re-receiving it!).
Whew! That's a big one, isn't it? When Gisle's LWPng
becomes
stable, I'll probably hack this program yet again to be parallel.
(Sounds like that could be another future column.) Keep watching, and
in the meanwhile, enjoy your link-verified web pages!
Listing One
=1= #!/home/merlyn/bin/perl -w =2= use strict; =3= $| = 1; =4= =5= ## Copyright (c) 1996,97,98,99 by Randal L. Schwartz =6= ## This program is free software; you can redistribute it =7= ## and/or modify it under the same terms as Perl itself. =8= =9= use URI; =10= use LWP; =11= use Storable; =12= =13= use constant DAYS => 24 * 60 * 60; # for configs below =14= =15= ## begin configure =16= =17= my $DATABASE = "/home/merlyn/.slinkydata"; =18= =19= my @CHECK = # list of initial starting points =20= qw(http://www.stonehenge.comX/ http://www.effectiveperl.comX/); =21= =22= sub PARSE { =23= ## return 2 to parse if HTML =24= ## return 1 to merely verify existance =25= ## return 0 to not even verify existance, but still xref =26= ## return -1 to ignore entirely =27= my $url = shift; # URI::URL object (absolute) =28= for ($url->scheme) { =29= return 0 unless /^ftp$/ or /^gopher$/ or /^http$/; =30= } =31= for ($url->host) { =32= if (/\.stonehenge\.com$/) { =33= for ($url->path_query) { =34= return -1 if /\/\?[DNS]=[AD]$/; # silly mod_index =35= } =36= for ($url->path) { =37= return 0 if /^\/(cgi|fors)\// or /col\d\d|refindex/; =38= return -1 if /^\/merlyn\/Pictures\/.*\.jpg$/s; =39= } =40= return 2; # default stonehenge.com =41= } =42= return 2 if /\.effectiveperl\.com$/; =43= return 1; # ping the world =44= } =45= } =46= =47= sub HACK_URL { =48= my $url = shift; # URI object =49= { =50= $url = $url->canonical; =51= if ($url->scheme eq "http") { =52= $url->host("www.stonehenge.com") if $url->host eq "w3.stonehenge.com"; =53= if ($url->host eq "www.stonehenge.com") { =54= ($url = URI->new("$1")), redo =55= if $url->path_query =~ /^\/cgi\/go\/(.*)/s; =56= $url->path("$1") if $url->path =~ /^(.*\/)index\.html$/s; =57= } =58= } =59= } =60= $url->canonical; =61= } =62= =63= my $VERBOSE = 2; # 0 = quiet, 1 = noise, 2 = lots of noise =64= =65= my $RECHECK = 10; # seconds between rechecking any URL =66= my $RECHECK_GOOD = 1 * DAYS; # seconds between rechecking good URLs =67= my $FOLLOW_GHOST = 14 * DAYS; # seconds before tossing bad URLs links =68= my $REPORT = 3 * DAYS; # seconds before bad enough to report =69= =70= my $TIMEOUT = 15; # seconds to timeout on fetch =71= =72= my $FOLLOW_REDIRECT = 1; # follow redirects? =73= =74= ## end configure (no user-servicable parts below this line) =75= =76= BEGIN { =77= package ParseLink; =78= use HTML::Parser; =79= use vars qw(@ISA); =80= =81= @ISA = qw(HTML::Parser); =82= =83= sub new_line { =84= my $self = shift; =85= $self->{Line}++; =86= } =87= =88= ## $self->{Links} = { =89= ## "url" => { "line" => "count", "line" => "count" ... }, ... =90= ## }; =91= sub start { # called by parse =92= my $self = shift; =93= my ($tag, $attr) = @_; =94= my $link; =95= $link = $attr->{href} if $tag eq "a"; =96= $link = $attr->{src} if $tag eq "img"; =97= if (defined $link) { =98= $self->{Links}{$link}{$self->{Line} + 1}++; =99= } =100= } =101= =102= sub get_links { # $instance->get_links() =103= my $self = shift; =104= $self->{Links}; =105= } =106= } # end of ParseLink =107= =108= BEGIN { =109= my $AGENT = LWP::UserAgent->new; =110= $AGENT->agent("slinky/1.02 " . $AGENT->agent); =111= $AGENT->env_proxy; =112= $AGENT->timeout($TIMEOUT); =113= =114= sub fetch { =115= if ($FOLLOW_REDIRECT) { =116= $AGENT->request(shift); =117= } else { =118= $AGENT->simple_request(shift); =119= } =120= } =121= } =122= =123= ## the persistant data: =124= my (%OLD, %NEW, @TODO); =125= my $SAVE_WANTED = 0; =126= =127= if (-r $DATABASE) { =128= my $restore = retrieve $DATABASE or die "Cannot retrieve from $DATABASE\n"; =129= %OLD = %{$restore->[0]}; =130= %NEW = %{$restore->[1]}; =131= @TODO = @{$restore->[2]}; =132= warn "database restored\n" if $VERBOSE; =133= } =134= =135= for (qw(HUP INT QUIT ALRM TERM)) { =136= $SIG{$_} = sub { $SAVE_WANTED = 1 }; =137= } =138= =139= alarm(shift) if @ARGV; =140= =141= ## $NEW{"some url"} = { =142= ## From => { =143= ## { "url" => { "line" => 1, "line" => 1, ... } }, =144= ## { "url" => { "line" => 1, "line" => 1, ... } }, =145= ## }, =146= ## To => [like From] =147= ## Base => "base", ## if base != url =148= ## Status => "Whatever", =149= ## Checked => time, ## when did we last look? =150= ## Good => time, ## when was it good (if ever) =151= ## LastModified => time, ## when it was good, when was it last modified? =152= ## } =153= =154= URL: { =155= unless (@TODO) { =156= ## prime the pump =157= %OLD = %NEW; =158= %NEW = (); =159= for (0..$#CHECK) { =160= my $url = HACK_URL(URI->new($CHECK[$_])); =161= add_link("REQUESTED:", $_, $url); =162= } =163= } =164= warn @TODO." to go...\n" if $VERBOSE; =165= my $url = splice(@TODO, rand @TODO, 1); # the lucky winner is... =166= $url = URI->new($url); =167= for (qw(Checked Good LastModified)) { =168= $NEW{$url}{$_} = $OLD{$url}{$_} || 0; =169= } =170= my $parse = PARSE($url); =171= if ($parse >= 2) { =172= warn "Parsing $url\n" if $VERBOSE; =173= my $links = $OLD{$url}{To} || {}; =174= my $base; =175= if (time < $NEW{$url}{Checked} + $RECHECK or =176= time < $NEW{$url}{Good} + $RECHECK_GOOD and =177= $NEW{$url}{LastModified} > 0) { =178= warn ".. too early to recheck\n" if $VERBOSE; =179= $NEW{$url}{Status} = $OLD{$url}{Status}; =180= } else { =181= my $req = HTTP::Request->new(GET => $url); =182= $req->if_modified_since($NEW{$url}{LastModified}); =183= my $res = fetch($req); =184= push(@TODO, $url->as_string), next if $SAVE_WANTED; =185= if ($res->is_success) { =186= warn ".. successful fetch\n" if $VERBOSE; =187= $base = $res->base; =188= $NEW{$url}{Checked} = $NEW{$url}{Good} = time; =189= $NEW{$url}{Base} = $base if $base ne $url; =190= $NEW{$url}{LastModified} = $res->last_modified || $res->date; =191= unless ($res->content_type =~ /text\/html/i) { =192= warn ".. not HTML\n" if $VERBOSE; =193= $NEW{$url}{Status} = "Verified (content = ".($res->content_type).")"; =194= next; =195= } =196= $NEW{$url}{Status} = "Verified (and parsed)"; =197= my $p = ParseLink->new; =198= for ($res->content =~ /(.+|\n)/g) { =199= $p->parse($_); =200= $p->new_line() if $_ eq "\n"; =201= } =202= $p->parse(undef); # signal the end of parse =203= $links = $p->get_links; # key is relative url, value is href =204= } elsif ($res->code == 304) { =205= warn ".. not modified\n" if $VERBOSE; =206= $NEW{$url}{Status} = $OLD{$url}{Status}; =207= $NEW{$url}{Checked} = $NEW{$url}{Good} = time; =208= } else { =209= warn ".. not verified\n" if $VERBOSE; =210= $NEW{$url}{Status} = "NOT Verified (status = ".($res->code).")"; =211= $NEW{$url}{Checked} = time; =212= next if time > $NEW{$url}{Good} + $FOLLOW_GHOST; =213= warn ".. but following ghost links\n" if $VERBOSE; =214= } =215= } =216= for my $link (sort keys %$links) { =217= my $abs = $link; =218= if ($base) { # we fetched a page =219= $abs = URI->new_abs($link,$base); =220= $abs->fragment(undef); # blow away any fragment =221= $abs = HACK_URL($abs)->as_string; =222= } =223= warn "... $abs ($link)\n" if $VERBOSE > 1; =224= for my $line (sort keys %{$links->{$link}}) { =225= add_link($url, $line, $abs) if PARSE(URI->new($abs)) >= 0; =226= } =227= } =228= next; =229= } elsif ($parse >= 1) { =230= warn "Pinging $url\n" if $VERBOSE; =231= if (time < $NEW{$url}{Checked} + $RECHECK or =232= time < $NEW{$url}{Good} + $RECHECK_GOOD) { =233= warn ".. too early to recheck\n" if $VERBOSE; =234= $NEW{$url}{Status} = $OLD{$url}{Status}; =235= } else { =236= my $res; =237= for my $method (qw(HEAD GET)) { =238= my $req = new HTTP::Request($method,$url); =239= $res = fetch($req); =240= push(@TODO, $url->as_string), next URL if $SAVE_WANTED; =241= if ($res->is_success) { =242= $NEW{$url}{Status} = "Verified (contents not examined)"; =243= $NEW{$url}{Checked} = $NEW{$url}{Good} = time; =244= $NEW{$url}{LastModified} = 0; =245= next URL; =246= } =247= } =248= $NEW{$url}{Status} = "NOT Verified (status = ".($res->code).")"; =249= $NEW{$url}{Checked} = time; =250= next; =251= } =252= } else { # $parse < 0 =253= warn "Skipping $url\n" if $VERBOSE; =254= $NEW{$url}{Status} = "Skipped"; =255= $NEW{$url}{Checked} = 0; # we no longer check this =256= next; =257= } =258= } continue { =259= if ($SAVE_WANTED) { =260= warn "dumping data to $DATABASE...\n" if $VERBOSE; =261= store [\%OLD, \%NEW, \@TODO], $DATABASE; =262= exit 0; =263= } =264= redo if @TODO; =265= warn "dumping data to $DATABASE...\n" if $VERBOSE; =266= store [\%OLD, \%NEW, \@TODO], $DATABASE; =267= print "\nBEGIN REPORT at ".localtime()."\n\n"; =268= for my $url (sort keys %NEW) { =269= next if $url =~ /^requested:/i; =270= my $entry = $NEW{$url}; # href =271= next unless $entry->{Checked} > $entry->{Good} + $REPORT; =272= my $status = $entry->{Status}; =273= my $base = $entry->{Base}; =274= print "$url"; =275= print " (base $base)" if defined $base; =276= print ":\n status: $status\n"; =277= for (qw(Checked Good LastModified)) { =278= if (my $stamp = $entry->{$_}) { =279= print " $_ => ".localtime($stamp)."\n"; =280= } =281= } =282= dump_relative($url, "from", $entry->{From}); =283= dump_relative($url, "to", $entry->{To}); =284= } =285= print "\nEND REPORT\n\n"; =286= ## redo; =287= exit 0; =288= } =289= =290= ## subroutines =291= =292= sub add_link { =293= my ($from,$line,$to) = @_; =294= $NEW{$to}{From}{$from}{$line}++; =295= $NEW{$from}{To}{$to}{$line}++; =296= unless (defined $NEW{$to}{Status}) { =297= push @TODO, $to; =298= $NEW{$to}{Status} = "[to be probed]"; =299= } =300= } =301= =302= sub dump_relative { =303= my ($url,$label,$urls) = @_; =304= for my $other_url (sort keys %$urls) { =305= my $relative = URI->new($other_url)->rel($url) || "."; =306= print " $label $relative at "; =307= print join " ", sort { $a <=> $b } keys %{$urls->{$other_url}}; =308= print "\n"; =309= } =310= } =311= =312= sub URI::mailto::authority { ""; } # workaround bug