Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 27 (Jul 1998)

Roughly two years ago in this column, I took a look at a basic ``link verifier'' script, using off-the-shelf LWP technology to parse HTML, locate the outward links, and recursively descend the web tree looking for bad links. Little did I know the interest I would stir -- it's become one of the most frequently referenced and downloaded script of all my columns! Last year, I updated the script, adding a forward-backward line-number cross-reference to the listing. But this year, I've got something even cooler!

Just recently, the new ``parallel user agent'' has come into a fairly stable implementation. This user agent works like the normal LWP user agent, but allows me to register a number of requests to be performed in parallel within a single process. This means I can scan a website in a small fraction of the time! I decided that this year's annual update to the link checker was to make it parallel.

As always, the program you're seeing here is not intended as a ``ready to run'' script, but it is in fact useful enough as=is that I'm using it to verify my website. The program is given in [Listing one, below]. This program's a long one, compared to previous columns. I'm gonna be rather brief (contrary to my normal style).

Lines 1 and 2 set up the compilation environment: the path to Perl, enabling the taint checks, warnings, and all the compiler restrictions.

Lines 4 and 5 pull in two modules. The URI::URL module is found in the LWP library, and the LWP::Parallel::UserAgent module is found in the LWP::Parallel library. Both of these libraries can be installed directly from the CPAN (see http://cpan.perl.org for information on the CPAN). The qw() parameter causes the callback constants to be pulled in. I'll point those out later.

Lines 7 through 36 should be the only parts of the program that you'll need to configure if you're trying to use this script out of the box. As always, the program is meant to be used as inspiration for your own programs, but if you're lazy, this area is the only part you'll need to change.

Lines 9 and 10 define the @CHECK variable, which should give a list of top-level pages to start the scan. For my scan, everything I want to look at can be reached (eventually) from my company's home page at www.stonehenge.com. If you want to check areas that don't somehow link together, then list the multiple starting points here.

Lines 11 through 19 define the &PARSE routine. This routine is called by the code in the main part of program whenever a link URL is discovered within a page being scanned. The incoming parameter is the absolute URL from the link, and it's up to this routine to determine if the URL is an ``interesting'' URL, worthy of further scan for links. If the subroutine returns a true value, we should consider it a candidate for parsing.

Lines 13 through 17 are a single regular expression match against the head end of the URL. If the URL begins with a web address of either my company site, or the three other sites that I have some direct interest in, then the regular expression matches. Note that this is an /x match, so the embedded whitespace within the regular expression is ignored.

If the URL matches the proper prefix, there may still be portions of the address space that aren't appropriate. In fact, the regular expression in line 18 takes back those URLs within the stonehenge tree that look like the online archives of past columns I've written, along with a few other areas that are best not scanned.

If a URL is not interesting enough to be parsed, we may still want to at least see if it's a valid URL. The &PING subroutine defined in lines 20 through 23 handles this, in a manner similar to the &PARSE routine. Here, I'm saying that any HTTP or FTP URL is worth validating. Note that the URL is first checked against &PARSE, so a URL that matches both will be parsed, not just ``pinged''.

Lines 24 through 30 take into consideration the possibility that there are multiple names for the same contents, such as when there are multiple names for the same host, or multiple aliased paths to the same data. Obviously, this routine is very specific to the website we're scanning, and relies on information we cannot get automatically. The URL to be examined is passed in as the only parameter, and the routine is expected to return a possibly modified URL that will actually be used. For my site, w3.stonehenge.com is the same machine as www.stonehenge.com, so I have a substitution for that. Also, for all normally scanned directories, a reference to index.html within the directory should be replaced by an empty directory reference. Also, I have a number of invocations of my ``where are they going script'' (described in this column a few months back), and the final substitution undoes the effect of that script for these scanned URLs.

Line 31 defines $VERBOSE, which is flipped on here, but most likely should be off when you're running this script for real. Many print statements are controlled by this variable, to let you see what's happening as it's happening.

Line 32 defines the maximum amount of text we're pulling from a text/html page, while scanning it for links. Setting this number too low will make us miss some potential out links, while setting it too high will cause a lot of unnecessary parsing for really huge bookmark lists. (Pages that exceed this size probably should be excluded from &PARSE above anyway.) Here, I've set it to 100K, which will take a pretty good chunk of time to parse already.

Lines 33 and 34 control the ParallelUA's ``parallel'' nature. With these settings, we'll be looking at five hosts in parallel, with up to three different URLs within each host. If you make these numbers bigger, remember that you'll be taxing each server a little harder. Three per host in parallel seems fair to me.

Lines 38 to 68 define a subclass of HTML::Parser, called ParseLink. This specialized subclass will scan an HTML text string, looking for links, keeping track of line numbers. Lines 45 through 48 define an additional instance method (not found in the parent class) to increment the line number each time we see a newline. Lines 53 through 62 override a null base-class method, invoked whenver we see a start tag. Lines 57 and 58 look for the interesting tags, along with their interesting attributes. If we have something interesting, it ends up stored in the structure (documented in lines 50 through 52) via line 60. Lines 64 through 67 define a method to access the database of links for a page, once the page has been completely parsed.

Lines 70 through 75 configure the ``user agent'' for the queries. Think of this as a ``virtual browser''. Line 71 defines the ``User Agent'' string sent to the webserver (visible in the web logs if they record that). Lines 72 through 75 set some other access parameters.

Line 80 sets up the master data structure (%URL_LIST) which holds all of our accumulated knowledge. The definition of this structure is documented in lines 82 through 87.

Lines 90 through 93 use the initial URLs defined in @CHECK to begin the scan. First, they're processed through the &add_url routine to cause them to be added to the request list for the parallel user agent. Then, they're recorded in the database as a ``requested'' URL.

Line 95 is the big (implicit) loop. The program spends nearly all of its run time executing this one statement, which tells the parallel user agent to process all known requests. When this routine returns, we've scanned all scannable items. The ``15'' is the number of seconds we're willing to wait before a particular connection is deemed ``dead''.

After processing all processable requests, we have some cleanup to do, including dumping out the data. Lines 97 through 106 cause any unfinished URL to be marked as bad, using any HTTP response code (if known) as the reason.

Lines 108 to 123 dump out the ultimate report, by walking the %URL_LIST database, printing it out in a nice pretty human-readable format. And that's the end of the main execution.

The first subroutine &add_url comes next. This subroutine is given a URL, and considers it for parsing, pinging, or just ignoring. Line 129 massages the URL through the &HACK_URL routine defined in the configuration section. Line 130 rejects already seen URLs.

Lines 131 through 135 locate and act on &PARSE-able URLs, causing them to be added as a full parsed request. The callback function is defined as &callback_for_parse, defined later. SImilarly, Lines 136 through 140 locate and act on &PING-able URLs, setting up a &callback_for_ping routine for the callback.

If the URL is neither parsable or pingable, it's ignored, which line 142 records.

Lines 147 through 162 define the routine that gets called for each request as it generates data that should be parsed for included URLs. The calling parameters are defined by the protocol. Lines 149 through 151 print a debugging trace message. Line 152 takes the content and adds it to the content for the response. If the total content is still within limits, and it looks like valid text/html stuff, we keep going (by returning a positive length).

However, if the content is too long, or not the right type, we fall into lines 159 through 161. Line 159 parses the content (using the subroutine defined later). Line 160 clears out the content (a minor space optimization enabling us to scan huge trees without running out of major memory). Line 161 returns a special code (defined during the use for LWP::Parallel::UserAgent at the top) enabling the calling routine to know that we're done with this connection.

Lines 164 through 177 define the callback for a URL in which we're just pinging the result. It's similar to the previous routine, but much simpler. If we got here, it's a valid ping, so we record that. The content in either case is unneeded, so all returns from this routine tell the parallel user agent to shut down.

Lines 179 through 220 define the fairly intricate parsing of a particular response's contents looking for URLs contained within. Lines 182 through 186 handle bad responses (like redirects or authorization-needed responses). Lines 187 through 190 handle content that isn't text/html. Lines 191 and 192 handle responses that have a weird base URL.

Most HTML pages are then parsed using the stuff in lines 194 through 199. We start by creating a new ParseLink object, and then pass it a line at a time, or a newline at a time, calling the new_line method as needed to increment the line count. When this is done, line 200 fetches the useful URL links that exit this page.

Lines 201 through 218 process each link, recording the link information on both the outbound and destination page info, in accordance with the structure for %URL_LIST defined above. Also, the link is checked as a potential additional location to visit via the invocation of &add_url in line 204.

Finally, in line 219, we show this particular web page as a good page for the report.

Wow! That was a long one. If you're still reading, you can take this program and drop it into an executable directory of your choosing, making sure you have installed both the LWP and LWP::Parallel libraries. Next, adjust the configuration parts of the script. If you don't, you'll get a nice report of how well I'm maintaining my web site -- not terribly useful to you I'm sure. Finally, just let'er rip, and go grab a nice soft drink. When you get back, look at the result for ``NOT Verified'' -- those are your trouble spots that should be looked into.

You might try playing a bit with the tunable parameters, to see how high you can crank them before it stresses the current Perl signal implementation. I was seeing ``Connection reset by peer'' errors when I cranked too high, so be careful.

While I was playing with this program, I got the idea to use the Parallel UserAgent to stress-test a web server -- especially multiple accesses to the same CGI script. I've just about got something hacked out (imagine two of the Eliza doctors of last month's column talking to each other), so stay tuned for an upcoming column where I'll cover that! Enjoy!

Listing

        =1=     #!/home/merlyn/bin/perl -Tw
        =2=     use strict;
        =3=     
        =4=     use URI::URL;
        =5=     use LWP::Parallel::UserAgent qw(:CALLBACK);
        =6=     
        =7=     ## begin configure
        =8=     
        =9=     my @CHECK =                     # list of initial starting points
        =10=      qw(http://XXwww.stonehenge.com/index.html);
        =11=    sub PARSE {                     # verify existence, parse for further URLs
        =12=      ## $_[0] is the absolute URL
        =13=      $_[0] =~ m!^http://XXwww\.(
        =14=            5sigma|
        =15=            perltraining|
        =16=            effectiveperl|
        =17=            stonehenge)\.com(/|$)!x
        =18=              and not $_[0] =~ /stonehenge.*(col\d\d\.html|fors|refindex)/;
        =19=    }
        =20=    sub PING {                      # verify existence, but don't parse
        =21=      ## $_[0] is the absolute URL
        =22=      $_[0] =~ m!^(http|ftp)://!;
        =23=    }
        =24=    sub HACK_URL {
        =25=      local $_ = shift;
        =26=      s!^http://w3\.stonehenge\.com/!http://www.stonehenge.com/!;
        =27=      s!^(http://www\.stonehenge\.com/(.*/)?)index\.html$!$1!;
        =28=      s!^http://www\.stonehenge\.com/cgi/go/!!;
        =29=      $_;
        =30=    }
        =31=    my $VERBOSE = 1;                # be (very) noisy
        =32=    my $MAX_CONTENT = 100_000;      # maximum content parsed in a URL
        =33=    my $MAX_HOSTS = 5;              # simultaneous host count
        =34=    my $MAX_REQ = 3;                # simultaneous request count
        =35=    
        =36=    ## end configure (no user-servicable parts below this line)
        =37=    
        =38=    BEGIN {
        =39=      package ParseLink;
        =40=      use HTML::Parser;
        =41=      use vars qw(@ISA);
        =42=    
        =43=      @ISA = qw(HTML::Parser);
        =44=    
        =45=      sub new_line {
        =46=        my $self = shift;
        =47=        $self->{Line}++;
        =48=      }
        =49=    
        =50=      ## $self->{Links} = {
        =51=      ##    "url" => { "line" => "count", "line" => "count" ... }, ...
        =52=      ## };
        =53=      sub start {                   # called by parse
        =54=        my $self = shift;
        =55=        my ($tag, $attr) = @_;
        =56=        my $link;
        =57=        $link = $attr->{href} if $tag eq "a";
        =58=        $link = $attr->{src} if $tag eq "img";
        =59=        if (defined $link) {
        =60=          $self->{Links}{$link}{$self->{Line} + 1}++;
        =61=        }
        =62=      }
        =63=    
        =64=      sub get_links {               # $instance->get_links()
        =65=        my $self = shift;
        =66=        $self->{Links};
        =67=      }
        =68=    }                               # end of ParseLink
        =69=    
        =70=    my $AGENT = new LWP::Parallel::UserAgent;
        =71=    $AGENT->agent("pverify/1.2");
        =72=    $AGENT->env_proxy;
        =73=    $AGENT->redirect(0);
        =74=    $AGENT->max_hosts($MAX_HOSTS);
        =75=    $AGENT->max_req($MAX_REQ);
        =76=    
        =77=    $| = 1;
        =78=    
        =79=    ## global database
        =80=    my %URL_LIST = ();
        =81=    ## format:
        =82=    ## $URL_LIST{"some url"} = {
        =83=    ##   Source => { "where" => "count", "where" => "count", ... },
        =84=    ##   Dest => { "where" => "count", "where" => "count", ... },
        =85=    ##   Base => "base", ## if base != url
        =86=    ##   Status => "Whatever",  ## undef if not checked yet
        =87=    ## }
        =88=    
        =89=    ## prime the pump
        =90=    for (@CHECK) {
        =91=      my $url = add_url($_);
        =92=      $URL_LIST{$url}{Source}{"[requested]"}++;
        =93=    }
        =94=    
        =95=    my $ENTRIES = $AGENT->wait(15);
        =96=    
        =97=    print "-----\n" if $VERBOSE;
        =98=    for my $response (map {$ENTRIES->{$_}->response} keys %$ENTRIES) {
        =99=      my $url = $response->request->url;
        =100=     next if ($URL_LIST{$url}{Status} || "") =~ /^[^\[]/; # we got a good one
        =101=     print
        =102=       "patching up bad status for $url: ", $response->code, "\n" if $VERBOSE;
        =103=     $URL_LIST{$url}{Status} =
        =104=       "NOT Verified (status = ".($response->code).")";
        =105=   }
        =106=   print "-----\n" if $VERBOSE;
        =107=   
        =108=   for my $url (sort keys %URL_LIST) {
        =109=     my $entry = $URL_LIST{$url};  # href
        =110=     my $status = $entry->{Status};
        =111=     my $base = $entry->{Base};
        =112=     print "$url";
        =113=     print " (base $base)" if defined $base;
        =114=     print ":\n  status: $status\n";
        =115=     my $sources = $entry->{Source};
        =116=     for my $source (sort keys %$sources) {
        =117=       print "  from $source\n";
        =118=     }
        =119=     my $dests = $entry->{Dest};
        =120=     for my $dest (sort keys %$dests) {
        =121=       print "  to $dest\n";
        =122=     }
        =123=   }
        =124=   
        =125=   ## subroutines
        =126=   
        =127=   sub add_url {
        =128=     my $url = shift;
        =129=     $url = url(HACK_URL $url)->abs->as_string;
        =130=     return $url if defined $URL_LIST{$url}{Status};
        =131=     if (PARSE $url) {
        =132=       print "Fetching $url\n" if $VERBOSE;
        =133=       $URL_LIST{$url}{Status} = "[PARSE]";
        =134=       my $request = new HTTP::Request('GET', $url);
        =135=       $AGENT->register($request,\&callback_for_parse);
        =136=     } elsif (PING $url) {
        =137=       print "Pinging $url\n" if $VERBOSE;
        =138=       $URL_LIST{$url}{Status} = "[PING]";
        =139=       my $request = new HTTP::Request('GET', $url);
        =140=       $AGENT->register($request,\&callback_for_ping);
        =141=     } else {
        =142=       $URL_LIST{$url}{Status} = "Skipped";
        =143=     }
        =144=     $url;
        =145=   }
        =146=   
        =147=   sub callback_for_parse {
        =148=     my ($content, $response, $protocol, $entry) = @_;
        =149=     print "PARSE: Handling answer from '",$response->request->url,": ",
        =150=     length($content), " bytes, Code ",
        =151=     $response->code, ", ", $response->message,"\n" if $VERBOSE;
        =152=     if (length $content) {
        =153=       $response->add_content($content);
        =154=       if (length($response->content) < $MAX_CONTENT
        =155=           and $response->content_type =~ /text\/html/i) {
        =156=         return length $content;   # go get some more
        =157=       }
        =158=     }
        =159=     parse_content_for_response($response);
        =160=     $response->content("");       # discard it (free up memory)
        =161=     return C_ENDCON;              # no more data from here
        =162=   }
        =163=   
        =164=   sub callback_for_ping {
        =165=     my ($content, $response, $protocol, $entry) = @_;
        =166=     print "PING: Handling answer from '",$response->request->url,": ",
        =167=     length($content), " bytes, Code ",
        =168=     $response->code, ", ", $response->message,"\n" if $VERBOSE;
        =169=     my $url = $response->request->url;
        =170=     if ($response->is_success) {
        =171=       $URL_LIST{$url}{Status} = "Verified (contents not examined)";
        =172=     } else {
        =173=       $URL_LIST{$url}{Status} =
        =174=         "NOT Verified (status = ".($response->code).")";
        =175=     }
        =176=     return C_ENDCON;              # ping ok, end connection
        =177=   }
        =178=   
        =179=   sub parse_content_for_response {
        =180=     my $response = shift;
        =181=     my $url = $response->request->url;
        =182=     unless ($response->is_success) {
        =183=       $URL_LIST{$url}{Status} =
        =184=         "NOT Verified (status = ".($response->code).")";
        =185=       return;
        =186=     }
        =187=     unless ($response->content_type =~ /text\/html/i) {
        =188=       $URL_LIST{$url}{Status} = "Verified (content not HTML)";
        =189=       return;
        =190=     }
        =191=     my $base = $response->base;
        =192=     $URL_LIST{$url}{Base} = $base if $base ne $url;
        =193=     
        =194=     my $p = ParseLink->new;
        =195=     for ($response->content =~ /(.+|\n)/g) {
        =196=       $p->parse($_);
        =197=       $p->new_line() if $_ eq "\n";
        =198=     }
        =199=     $p->parse(undef);             # signal the end of parse
        =200=     my $links = $p->get_links; # key is relative url, value is href
        =201=     for my $link (sort keys %$links) {
        =202=       my $abs = url($link,$base);
        =203=       $abs->frag(undef);          # blow away any frag
        =204=       $abs = add_url($abs->abs->as_string);
        =205=       print "... $abs\n" if $VERBOSE;
        =206=       ## requested url is used for forward relative xref links,
        =207=       ## but actual url after redirection is used for backwards links.
        =208=       my ($forward_rel, $backward_rel) = do {
        =209=         map { $_ || "." } url($abs, $url)->rel, url($base, $abs)->rel;
        =210=       };
        =211=       my $where = $links->{$link}; # key is line number, val is count
        =212=       for my $line (sort keys %$where) {
        =213=         $URL_LIST{$abs}{Source}{"$backward_rel at line $line"} +=
        =214=           $where->{$line};
        =215=         $URL_LIST{$url}{Dest}{"$forward_rel at line $line"} +=
        =216=           $where->{$line};
        =217=       }
        =218=     }
        =219=     $URL_LIST{$url}{Status} = "Verified (and parsed)";
        =220=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.