Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Download this listing!

Web Techniques Column 14 (June 1997)

One of the maintenance headaches of maintaining a cool web site is verifying that the off-site links are still valid. (Hey, even making sure the on-site links are valid is tough enough.) Back in this column last September, I showed the ``hverify'' program, a tool that I was using to ensure that outbound links weren't dead.

Well, I've been tweaking the program recently, and am happy to announce the new and improved ``hverify version 2''. I've cleaned up how the links are parsed and followed, but more importantly, version 2 now gives a full cross reference, showing where links are coming from, going to, and even the line on which the anchor is present. I added this because I was having troubles sometimes finding the bad link, and the line number really helped.

So, on with the new improved hverify version 2, presented in [listing one].

The first two lines begin nearly every program I write, turning on Taint-checking, warnings, and compiler restrictions.

Lines 4 through 6 pull in the LWP::UserAgent library (to allow me to fetch web pages), the HTML::Parser library (to locate references) and the URI::URL library (to convert relative links into absolute links and vice-versa).

Lines 10 through 21 define some configuration parameters. Lines 10 and 11 give the list of top-level URLs that will be examined. Here, I've pointed it at the top of my virtual web-server.

Lines 12 through 16 define a subroutine named PARSE, which will be repeatedly passed a URL, and returns 1 if the URL should be fetched and examined for further web links, or 0 otherwise. For me, I'm willing to recursively parse anything at my website, as long as it's not my references page (very hard and long to parse, and crashes Lynx), my on-line columns (because they have dead links due to the translation tool), and the ``fors'' archives.

Now, a URL that fails the PARSE routine won't be parsed for additional links, but I might want to verify that it still exists, and that'll be something that PING must return true for. Here, I'm a little more liberal, allowing any web, gopher, or FTP URL to be checked, except for things at Tom Christiansen's perl.com site, if the purpose is just to go to the CPAN. (This means I would need to check the CPAN references by hand.)

So, these three configuration parameters define the scope of the verification. You'll certainly want to change them for your site, but if you run the program as-is, you'll be checking my site from your machine. Please don't do that. (And letting this thing loose on all of www.yahoo.com is likely to get you a seriously nasty phone call or worse.)

Lines 25 through 54 define a sub-class of the HTML::Parser class, called ParseLink. This subclass is declared by shifting to the ParseLink package (in line 26), and then setting the appropriate @ISA variable (in line 28).

Lines 30 through 33 create an instance method (used on objects of type ParseLink) to set a Line instance variable to a particular value. This is needed so that the cross references know what line they are coming from. When we get down to using this class, I'll explain further.

Lines 38 through 47 define an instance method whose name is defined by the way HTML::Parser is constructed. Each time the parser stumbles across a start tag (ones without a leading slash), the start method gets called. Within this method, $tag is the tag type (lowercased), and $attr is a hashref pointing to the attributes.

In particular, we're looking for ``href'' attributes in ``a'' tags, and ``src'' attributes in ``img'' tags, which lines 42 and 43 grab. If a link is present, line 45 saves it away into an instance variable called Links. The structure of this instance variable is given in the comments in lines 35 through 37; a hash of urls as keys, pointing at a second level hash of line numbers on which the reference occurs (along with the counts of how many times it occurs on that line, typically once).

Lines 49 through 52 define an instance method to pull out the Links instance variable. The caller needs to be aware of the returned structure. In a more robust interface, I'd have methods to insulate the caller from the structure of Links in case I wanted to change the structure. Not needed for this simple application, though.

Lines 55 through 57 create the ``user agent'' allowing this program to connect to various web, gopher, and FTP servers. Notice that the user agent is set to hverify/2.0, and that the environment proxy settings (if any) are honored.

Lines 61 through 69 declare the global database %URL_LIST, including (as comments) the structure of that database. For each URL being examined, we have a list of sources (what points at it), a list of destinations (where does it point, if it's HTML and was parsed), the ``base'' URL (if the server redirects us or there's a base meta-tag) if needed, and the ``status'' of the URL. Ultimately, the job of the link crawler is to set Status to something for every URL in the database. As each HTML page is examined, additional URLs may end up in the database, thus causing further checks on the next pass.

Lines 72 through 74 ``prime the pump''. Starting at the URLs given in @CHECK, we set the Source attribute to [requested], meaning that the reason we looked here is because it was in the initial list. This shows up in the cross reference listing as such.

Once the initial database is setup, we can walk it. It's a big ``forever'' loop in lines 78 through 150. Line 79 determines if we have something to do, or if we're done, by getting all the incomplete URLs from the database. Line 80 bails if that's empty. Otherwise, lines 82 through 149 process each URL in turn.

If the URL qualifies for a full parsing (via the PARSE routine returning true in line 83), then we need to fetch it, see if it's HTML, and if so, look for additional links. Lines 84 through 91 handle the fetching part.

Line 85 creates a request, based on a GET of the target URL. Line 86 actually performs the web access. Lines 87 to 91 notice if the request failed for some reason, and if so, note that as the Status of the particular URL. This marks the URL so that it won't be attempted any more, and if this was a link, it's now a bad link.

Lines 92 through 95 discriminate the response between HTML and everything else. It wouldn't make sense to try to parse a GIF file for HTML links. Again, the status is noted in the database if necessary.

If we make it to line 96, we've got a good HTML page, ready to examine for additional links. Line 96 computes the ``base'' URL, which may be different from the requested URL. The web rules say that relative URLs must be computed against the base URL, so we need to know that. Line 97 records this base URL for later display if so.

Lines 98 through 110 create and use a ParseLink object to parse the HTML. This object is created in line 98. Line 100 splits up the HTML into lines, so that we can notice what line each reference is located. Line 101 sets up a line counter.

Lines 103 through 108 loop through each line. For each line, we need to tell the ParseLink object which line we are looking at. This is handled in 104, which passes the local $line variable into the set_line method. All calls to the start method can then know exactly what line we were on. Line 105 calls the parse method, which in turn will call start for all start tags on that line, which in turn will record all img and href attributes as appropriate. Simple, but elegant.

Line 110 notifies the parser that the content is complete, and is a necessary part of the parsing protocol.

Line 111 reaches in to the ParseLink object and pulls out the Links instance variable, containing the result of all those calls to start. We then walk through the result in lines 112 through 127.

Line 112 gets each URL piece into $link. This piece needs to be made absolute, using the base of the page as the starting point. That's handled in line 113. Lines 114 and 115 are comments to myself to remind me that I chose to use the requested URL for the forward relative links, and the base url for the reverse links.

Lines 116 through 119 compute relative links for the cross-reference output, which I discovered rather rapidly looks much much better than absolute links. The horsing around with $^W is to work around a bug in LWP 5.07, which will hopefully be fixed by the time this column is seen in hardcopy. $forward_rel is then the relative URL leading from the page to some other page, and $backward_rel is the link that might have got us here.

Line 120 grabs the line number hash corresponding to this link, which we then have to walk to add the appropriate cross reference entries. The forward references are added in line 122, which also causes them to be noticed for following later (unless they've already been followed). The reverse references are added in line 124.

Line 128 notes the updated status, causing this URL to be omitted from further processing.

Lines 131 through 145 handle the processing similarly for ``ping-only'' URLs. If we can fetch it, line 138 notes that. If not, line 143 notes the failure.

If a URL is neither ping-only or requested for parsing, line 146 notes that.

Whew! A lot of work just to build up the database. But in the process, we've discovered all the links, recursed through all the sublinks, and also noticed the dead links. Now, it's time to print out the result, in the lines 152 through 167.

Line 152 walks $url through the list of the keys of %URL_LIST. For each element of that list, we further extract the hash reference into $entry, and from there, get both the parse/ping status and base (if applicable).

Lines 156 to 158 print a banner for each URL, giving the URL itself, the base (if not the same as the URL) and the parse/ping status.

Lines 159 to 162 walk through the ``from'' links -- URLs that point at this URL. The links are given relative to the page, which is pretty easy reading. Similarly, lines 163 to 166 give the ``to'' links -- URLs that this page contains. Only parsed HTML pages will have ``to'' links, but all kinds of URLs can have ``from'' links.

And there you have it. Verifying the links in a web tree, along with a good cross reference in both directions. A pretty handy tool, even if you can't specify the starting point from the command line. (Maybe in a future version, eh?) Until next time, keep your links valid...

Listings

        =1=     #!/home/merlyn/bin/perl -Tw
        =2=     use strict;
        =3=     
        =4=     use LWP::UserAgent;
        =5=     use HTML::Parser;
        =6=     use URI::URL;
        =7=     
        =8=     ## begin configure
        =9=     
        =10=    my @CHECK =                     # list of initial starting points
        =11=      qw(http://www.stonehenge.com/index.html);
        =12=    sub PARSE {                     # verify existence, parse for further URLs
        =13=      ## $_[0] is the absolute URL
        =14=      $_[0] =~ m!^http://www\.stonehenge\.com/! and not
        =15=        $_[0] =~ /refindex|col\d\d\.html|fors/;
        =16=    }
        =17=    sub PING {                      # verify existence, but don't parse
        =18=      ## $_[0] is the absolute URL
        =19=      $_[0] =~ m!^(http|ftp|gopher)://! and not
        =20=        $_[0] =~ m!perl\.com/CPAN/!; # presume all CPAN refs are good
        =21=    }
        =22=    
        =23=    ## end configure (no user-servicable parts below this line)
        =24=    
        =25=    BEGIN {
        =26=      package ParseLink;
        =27=    
        =28=      @ParseLink::ISA = qw(HTML::Parser);
        =29=    
        =30=      sub set_line {                # $instance->set_line(nnn)
        =31=        my $self = shift;
        =32=        $self->{Line} = shift;
        =33=      }
        =34=    
        =35=      ## $self->{Links} = {
        =36=      ##    "url" => { "line" => "count", "line" => "count" ... }, ...
        =37=      ## };
        =38=      sub start {                   # called by parse
        =39=        my $self = shift;
        =40=        my ($tag, $attr) = @_;
        =41=        my $link;
        =42=        $link = $attr->{href} if $tag eq "a";
        =43=        $link = $attr->{src} if $tag eq "img";
        =44=        if (defined $link) {
        =45=          $self->{Links}{$link}{$self->{Line}}++;
        =46=        }
        =47=      }
        =48=    
        =49=      sub get_links {               # $instance->get_links()
        =50=        my $self = shift;
        =51=        $self->{Links};
        =52=      }
        =53=    }                               # end of ParseLink
        =54=    
        =55=    my $ua = new LWP::UserAgent;
        =56=    $ua->agent("hverify/2.0");
        =57=    $ua->env_proxy;
        =58=    
        =59=    $| = 1;
        =60=    
        =61=    ## global database
        =62=    my %URL_LIST = ();
        =63=    ## format:
        =64=    ## $URL_LIST{"some url"} = {
        =65=    ##   Source => { "where" => "count", "where" => "count", ... },
        =66=    ##   Dest => { "where" => "count", "where" => "count", ... },
        =67=    ##   Base => "base", ## if base != url
        =68=    ##   Status => "Whatever",  ## undef if not checked yet
        =69=    ## }
        =70=    
        =71=    ## prime the pump
        =72=    for (@CHECK) {
        =73=      $URL_LIST{$_}{Source}{"[requested]"}++;
        =74=    }
        =75=    
        =76=    ## now walk it
        =77=    
        =78=    {
        =79=      my @this_time = grep !defined $URL_LIST{$_}{Status}, keys %URL_LIST;
        =80=      last unless @this_time;
        =81=     URL:
        =82=      for my $url (@this_time) {
        =83=        if (PARSE $url) {
        =84=          ## print "Fetching $url\n";
        =85=          my $request = new HTTP::Request('GET', $url);
        =86=          my $response = $ua->request($request); # fetch!
        =87=          unless ($response->is_success) {
        =88=            $URL_LIST{$url}{Status} =
        =89=              "NOT Verified (status = ".($response->code).")";
        =90=            next URL;
        =91=          }
        =92=          unless ($response->content_type =~ /text\/html/i) {
        =93=            $URL_LIST{$url}{Status} = "Verified (content not HTML)";
        =94=            next URL;
        =95=          }
        =96=          my $base = $response->base;
        =97=          $URL_LIST{$url}{Base} = $base if $base ne $url;
        =98=          my $p = ParseLink->new;
        =99=          {
        =100=           my @content = $response->content =~ /(.*\n?)/g;
        =101=           my $line = 1;
        =102=           {
        =103=             last unless @content;
        =104=             $p->set_line($line);  # tell it the line number
        =105=             $p->parse(shift @content); # and parse it
        =106=             $line++;
        =107=             redo;
        =108=           }
        =109=         }
        =110=         $p->parse(undef);         # signal the end of parse
        =111=         my $links = $p->get_links; # key is relative url, value is href
        =112=         for my $link (sort keys %$links) {
        =113=           my $abs = url($link, $base)->abs;
        =114=           ## requested url is used for forward relative xref links,
        =115=           ## but actual url after redirection is used for backwards links.
        =116=           my ($forward_rel, $backward_rel) = do {
        =117=             local ($^W) = 0;      # workaround for buglet
        =118=             map { $_ || "." } url($abs, $url)->rel, url($base, $abs)->rel;
        =119=           };
        =120=           my $where = $links->{$link}; # key is line number, val is count
        =121=           for my $line (sort keys %$where) {
        =122=             $URL_LIST{$abs}{Source}{"$backward_rel at line $line"} +=
        =123=               $where->{$line};
        =124=             $URL_LIST{$url}{Dest}{"$forward_rel at line $line"} +=
        =125=               $where->{$line};
        =126=           }
        =127=         }
        =128=         $URL_LIST{$url}{Status} = "Verified (and parsed)";
        =129=         next URL;
        =130=       }
        =131=       if (PING $url) {
        =132=         ## print "Verifying $url\n";
        =133=         my $response;
        =134=         for my $method (qw(HEAD GET)) {
        =135=           my $request = new HTTP::Request($method,$url);
        =136=           $response = $ua->request($request); # fetch!
        =137=           if ($response->is_success) {
        =138=             $URL_LIST{$url}{Status} = "Verified (contents not examined)";
        =139=             next URL;
        =140=           }
        =141=         }
        =142=         $URL_LIST{$url}{Status} =
        =143=           "NOT Verified (status = ".($response->code).")";
        =144=         next URL;
        =145=       }
        =146=       $URL_LIST{$url}{Status} = "Skipped";
        =147=       next URL;
        =148=     }
        =149=     redo;
        =150=   }
        =151=   
        =152=   for my $url (sort keys %URL_LIST) {
        =153=     my $entry = $URL_LIST{$url};  # href
        =154=     my $status = $entry->{Status};
        =155=     my $base = $entry->{Base};
        =156=     print "$url";
        =157=     print " (base $base)" if defined $base;
        =158=     print ":\n  status: $status\n";
        =159=     my $sources = $entry->{Source};
        =160=     for my $source (sort keys %$sources) {
        =161=       print "  from $source\n";
        =162=     }
        =163=     my $dests = $entry->{Dest};
        =164=     for my $dest (sort keys %$dests) {
        =165=       print "  to $dest\n";
        =166=     }
        =167=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Web Techniques Column 14 (June 1997)

Listings