Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Download this listing!

Linux Magazine Column 52 (Oct 2003)

[suggested title: ``Using Class::DBI for a Link Checker'']

Web-site maintenance often includes the rather mundane but important task of making sure that the carefully-placed page links actually go somewhere useful. Over the many years as a Perl columnist, I've written many different versions, evolving as my Perl mastery gets better, and as the tools get more powerful.

I was recently inspired by a lot of buzz I've been hearing about the rapidly developing Class::DBI module: a simple-but-powerful mechanism by which objects can be mapped to SQL-based databases, and the DBD::SQLite module, which puts a full SQL-based transaction-supporting database into a single file of my chosing, without the complexity of running a separate server. These two modules looked like the perfect components for my next major rewrite of the ever-improving link checker.

After a dozen hours or so trying to get my head around these new modules (including debugging things at about seven layers of indirection), I have a workable alpha-version of a new link checker. The technology is similar to the version I wrote for this very column some three years ago, so you might want to refer back to the Sep 2000 issue for comparison. The key features include:

Managed parallel processes to overlap the I/O and DNS lookups
On-disk storage of partial state, to resume after an interruption
Proper management of if-modified-since so that unchanged pages are not refetched and reparsed for links
Flexible reporting control
Flexible mapping of known aliased URLs, such as index.html being replaced by merely a trailing slash
Flexible decisions for URLs, deciding whether to descend, check for existence, merely note, or just ignore

And all this in the 422-line program provided in [listing one, below]. Because the program is fairly long, I'll be hitting only the highlights instead of my usual rambling line-by-line description.

Lines 1 through 7 provide my standard program preamble, including pulling in the expected modules and setting up constants needed for the configuration section. The sigtrap module is used here to provide a death handler for any signals, which is needed so that DBI shuts down properly.

Lines 9 through 61 provide all the configuration options. From here, you decide the behavior and limits of the link checking. Most of the comments indicate the specific values, but let me point out some of the more interesting parts.

If $REPORT is -1, we get a complete dump of all pages seen and their cross-referencing information (inbound and outbound links) and status. When it's 0 as it is here, we get a dump only of all pages that didn't come through as a successful HTTP status. We can also increase this to a number of seconds (say 1 * DAYS), in which case we'll only start hearing about a bad page when it's been bad for at least that long. This helps with intermittent server flakiness.

The @CHECK array provides the starting point (or points) for this web walk. If all pages we want to check are reachable from a given URL, we need only list it here, and all the others will be found. (I've obscured the real URL here: I think it'll be obvious what I was testing against.)

The HACK_URL subroutine takes a URI object and returns a possibly-modified URI object to represent any known aliasing. Here, I'm removing index.html if present, so that those links will appear only as a trailing slash. Obviously, this is very server dependent, and should be constructed with care. When in doubt, start by merely returning the input parameter, and if you notice the same page being fetched twice under different names, use the subroutine to rewrite one into the other.

The PARSE subroutine takes a hacked URL in a URI object, and decides how far to go with it. If PARSE returns 2, the page is fetched, and if the response is HTML, the HTML is parsed for further links. (If you did this to every URL, you'd eventually see the entire web, but it will take quite a while!) If PARSE returns 1, then the URL is ``pinged'' by fetching just enough to figure out if it's good or not. If PARSE returns 0, then the URL is noted for the cross-reference table, but is otherwise ignored. If PARSE returns -1, then the URL is completely ignored. Be very careful writing this routine: scanning through a lot of dynamically generated content is guaranteed to upset the webmaster.

Lines 63 to 166 implement the table classes and base classes as suggested in the Class::DBI documentation. My::DBI (in lines 65 to 99) defines the database linkage itself, and defines methods inherited by all table classes.

Lines 69 and 70 set the database. Here, I'm using DBD::SQLite to create a single file at $DATABASE, which is set earlier to the file .linkchecker in my home directory. I also override AutoCommit so that most transactions are executed on a statement-by-statement basis.

Lines 72 to 78 create the tables of the database if they don't already exist, by calling methods within each table class. If the tables already exist, the error is trapped and ignored. Because the table-creation methods are kept within the program, I can start this script with no advance preparation (other than installing all the necessary modules).

Lines 80 to 98 are derived from the Class::DBI documentation as a mechanism to wrap a code reference within a transaction, automatically aborting the transaction if something goes wrong. I added the ability for the subroutine to return back a response value, which simplified some of the code, as shown later.

Lines 101 to 116 define My::Link, an object class that is mapped to the link table within the database, as declared in line 105. The link table records the links between pages from the src URL to the dst URL, creating a many-to-many relationship amongst the pages.

The SQL to create the link table is defined in lines 106 through 112. __TABLE__ is replaced with link, but by doing it this way, we can derive from this class and inherit this constructor.

Lines 113 through 115 define the columns for Class::DBI's benefit, including the proper object type for retrieval, as well as creating accessor methods for the columns.

Lines 118 to 165 define My::Page, mapped to the page table (as defined in line 123). Most of the heavy lifting happens here.

Using the enum pragma (found in the CPAN), we get four integer constants within this package for the various states in which a page record might exist. State_todo means we need to process this page. State_working means that one of the child processes has started working on the page, but has not yet completed it. State_done means we're done. State_unseen is for a page that has been left over from the previous run, retaining its modification stamp and outbound links to minimize the work on each new run. At the end of a new run, all unseen pages are finally deleted.

Lines 124 to 133 define the SQL to create the table. Note the interpolation of the State_unseen into the double-quoted string.

Lines 134 and 135 declare the table columns to Class::DBI. The first such column is automatically recognized as the primary key.

Lines 137 and 138 define an inverse relationship, namely that link records are referencing page records. By declaring this explicitly, many joins are automatically generated for us. Additionally, deleting a particular My::Page record also has the ``cascading delete'' feature of deleting any inbound or outbound links automatically, minimizing the need to write explicit code for that.

Lines 140 to 149 change a page from State_todo to State_working within a transaction. This is needed because all of the children will be looking at the same ``todo'' list, and picking an item to do. Only one child can be successful at this, and because we've wrapped it in a transaction, it works nicely. The subroutine returns a true/false value to indicate whether it successfully snagged the particular My::Page object for this process to work on.

Lines 151 to 164 similarly add a page to the todo list, again within a transaction to prevent other children from interfering. Only new records or previously unseen records (left over from a previous pass) can be set to State_todo. The new (or reused) My::Page object is returned.

Lines 167 to 175 create an LWP::UserAgent virtual browser to fetch the pages, including giving it a distinguished agent string so we can filter it out of our web hit logs.

The main code begins in line 177. Line 180 creates the tables if needed.

Lines 183 to 186 reset any working pages to todo pages. This happens when the program is interrupted before completing a pass, and will re-trigger any requests that might have been aborted before completion.

Lines 188 to 194 ``primes the pump''. If there aren't any todo or done items, then we need to start somewhere, namely at the @CHECK URLs.

Lines 196 to 213 use the kids_do subroutine (defined later) to perform the task given by the first coderef, in parallel by forking up to a maximum of the lesser of $KIDMAX or whatever number the second coderef returns. Each kid repeatedly grabs all the to-do records, picks one at random, and tries to ``own'' it. If the ``owning'' is successful, do_one_page is called (defined later). The cycle is repeated until nothing shows up in the to-do list, at which point the child exits.

Once the kids have done their work, lines 215 and 216 remove any pages that weren't seen on this pass.

Lines 218 to 248 display the report. Pages that aren't ``bad enough'' are rejected from being displayed in line 222. Information about the page itself gets printed in lines 223 through 230, including the status and timestamps.

Lines 232 through 246 print the cross-reference table, showing all the inbound links (useful if the page is not successful to find out where it was referenced) and outbound links (useful when we're editing a particular page to find all bad links). The links are generally relative, unless they go up two or more levels, in which case they're made absolute from / instead, which seemed to be a good compromise for display.

Lines 250 to 254 set all the State_done pages to State_unseen, for the next pass. And that's all there is for the main program.

Lines 260 to 291 set up the processing of a particular page. Most of the logic decides whether this is a ``parse'' or a ``ping'', and how much information to use (if any) from a previous pass. Note how the mapping of the $page object to the table eliminates any need for me to write SQL code: I'm simply using $page as if it was an in-memory object.

Lines 293 to 378 form the core of the actions. As this code is very similar to this column's September 2000 code, I'll skip the detailed description. The page is fetched (using an LWP callback), content-type and status is noted, links are added, and the page record is updated. The two big differences between this code and previous code is that I'm using HTML::LinkExtor to derive the links, rather than my handwritten code from before, and I'm using objects instead of hashes to store the data.

When a link is seen, add_link (lines 380 to 397) is called to record the link in the database. If a ``base'' is provided, it's used to make a relative link into an absolute link. The links are ``hacked'' to make them canonical, and any ignorable link is ignored. As the links are getting created, there might be multiple links between the same pair of pages, causing the create to throw an exception, as triggered by the SQLite's constraint on the table.

Lines 399 to 440 manage the kids, using code lifted mostly from the August 2000 column, and is better described there. The job here is greatly simplified, however, because we don't need to do anything to the kids other than start them: they figure out their own tasks directly from the database.

Although I've run up against a deadline for this issue (in both space and time), I can see that I really need to create a ``count todo'' method for My::Page, as I'm doing it the hard way in more than one place. Also, I'm sure there's some fine-tuning of some of the code steps, but at least this code runs as-is and runs well. Have fun keeping your website clean of all those nasty broken links. Until next time, enjoy!

Listing

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $| = 1;
        =4=     use sigtrap qw(die untrapped);
        =5=     
        =6=     use URI;
        =7=     use constant DAYS => 86400;     # for specifications
        =8=     
        =9=     ## configuration constants
        =10=    
        =11=    my $DATABASE = (glob "~/.linkchecker")[0];
        =12=    
        =13=    my $VERBOSE = 1;                # 0 = quiet, 1 = noise, 2 = lots of noise
        =14=    
        =15=    my $RECHECK = 0.1 * DAYS;       # seconds between rechecking any URL
        =16=    my $RECHECK_GOOD = 1 * DAYS;    # seconds between rechecking good URLs
        =17=    my $REPORT = 0 * DAYS;          # seconds before bad enough to report
        =18=    
        =19=    my $FOLLOW_REDIRECT = 1;        # follow a redirect as if it were a link
        =20=    my $TIMEOUT = 30;               # timeout on fetch (hard timeout is twice this)
        =21=    my $MAXSIZE = 1048576;          # max size for fetch (undef if fetch all)
        =22=    
        =23=    my $KIDMAX = 5;                 # how many kids to feed
        =24=    
        =25=    my @CHECK =                     # list of initial starting points
        =26=      qw(http://www.perl.borg/);
        =27=    
        =28=    sub PARSE {
        =29=      ## return 2 to parse if HTML
        =30=      ## return 1 to merely verify existence
        =31=      ## return 0 to not even verify existence, but still xref
        =32=      ## return -1 to ignore entirely
        =33=      my $url = shift;              # URI object (absolute)
        =34=      for ($url->scheme) {
        =35=        return 0 unless /^http$/;
        =36=      }
        =37=      for ($url->query) {
        =38=        return -1 if /^C=[DMNS];O=[AD]/; # silly mod_index
        =39=      }
        =40=      for ($url->host) {
        =41=        if (/www\.perl\.borg$/) {
        =42=          for ($url->path) {
        =43=            return 0 if /images|photos/; # boring
        =44=            return 0 if /^\/(tpc|yapc)\/.*(199[89]|200[012])/; # old
        =45=          }
        =46=          return 2;                 # default www.perl.borg
        =47=        }
        =48=    
        =49=        return 0 if /use\.perl\.borg$/;
        =50=        
        =51=      }
        =52=      return 1;                   # ping the world
        =53=    }
        =54=    
        =55=    sub HACK_URL {
        =56=      my $url = shift;              # URI object
        =57=      $url->path("$1") if $url->path =~ /^(.*\/)index\.html$/s;
        =58=      $url->canonical;
        =59=    }
        =60=    
        =61=    ## end configuration constants
        =62=    
        =63=    ### internally-defined classes
        =64=    
        =65=    {
        =66=      package My::DBI;
        =67=      use base 'Class::DBI';
        =68=    
        =69=      __PACKAGE__->set_db('Main', "dbi:SQLite:dbname=$DATABASE", undef, undef,
        =70=                            {AutoCommit => 1});
        =71=    
        =72=      sub CONSTRUCT {
        =73=        my $class = shift;
        =74=        for (qw(My::Page My::Link)) {
        =75=          eval { $_->sql_CONSTRUCT->execute };
        =76=          die $@ if $@ and $@ !~ /already exists/;
        =77=        }
        =78=      }
        =79=    
        =80=      sub atomically {
        =81=        my $class = shift;
        =82=        my $action = shift;         # coderef
        =83=        local $class->db_Main->{AutoCommit}; # turn off AutoCommit for this block
        =84=    
        =85=        my @result;
        =86=        eval {
        =87=          @result = wantarray ? $action->() : scalar($action->());
        =88=          $class->dbi_commit;
        =89=        };
        =90=        if ($@) {
        =91=          warn "atomically got error: $@";
        =92=          my $commit_error = $@;
        =93=          eval { $class->dbi_rollback };
        =94=          die $commit_error;
        =95=        }
        =96=        die $@ if $@;
        =97=        wantarray ? @result : $result[0];
        =98=      }
        =99=    }
        =100=   
        =101=   {
        =102=     package My::Link;
        =103=     our @ISA = qw(My::DBI);
        =104=   
        =105=     __PACKAGE__->table('link');
        =106=     __PACKAGE__->set_sql(CONSTRUCT => <<'SQL');
        =107=   CREATE TABLE __TABLE__ (
        =108=     src TEXT,
        =109=     dst TEXT,
        =110=     PRIMARY KEY (src, dst)
        =111=   )  
        =112=   SQL
        =113=     __PACKAGE__->columns(Primary => qw(src dst));
        =114=     __PACKAGE__->has_a(src => 'My::Page');
        =115=     __PACKAGE__->has_a(dst => 'My::Page');
        =116=   }
        =117=   
        =118=   {
        =119=     package My::Page;
        =120=     our @ISA = qw(My::DBI);
        =121=     use enum qw(:State_ unseen todo working done);
        =122=   
        =123=     __PACKAGE__->table('page');
        =124=     __PACKAGE__->set_sql(CONSTRUCT => <<"SQL");
        =125=   CREATE TABLE __TABLE__ (
        =126=     location TEXT PRIMARY KEY,
        =127=     state INT DEFAULT @{[State_unseen]},
        =128=     last_status TEXT,
        =129=     last_checked INT,
        =130=     last_good INT,
        =131=     last_modified INT
        =132=   )  
        =133=   SQL
        =134=     __PACKAGE__->columns(All => qw(location state last_status
        =135=                                    last_checked last_good last_modified));
        =136=   
        =137=     __PACKAGE__->has_many(inbound => 'My::Link', 'dst', { sort => 'src' });
        =138=     __PACKAGE__->has_many(outbound => 'My::Link', 'src', { sort => 'dst' });
        =139=   
        =140=     sub make_working_atomically {
        =141=       my $self = shift;
        =142=   
        =143=       $self->atomically(sub {
        =144=                           $self->state == State_todo or return undef;
        =145=                           $self->state(State_working);
        =146=                           $self->update;
        =147=                           return 1;
        =148=                         });
        =149=     }
        =150=   
        =151=     sub create_or_make_todo {
        =152=       my $class = shift;
        =153=       my $location = shift;
        =154=   
        =155=       $class->atomically(sub {
        =156=                            my $item = $class->find_or_create({location => $location});
        =157=                            if ((not defined($item->state)
        =158=                                 or $item->state == State_unseen)) {
        =159=                              $item->state(State_todo);
        =160=                              $item->update;
        =161=                            }
        =162=                            $item;
        =163=                          });
        =164=     }
        =165=   }
        =166=   
        =167=   {
        =168=     use LWP::UserAgent;
        =169=     my $AGENT = LWP::UserAgent->new;
        =170=     $AGENT->agent("linkchecker/0.42 " . $AGENT->agent);
        =171=     $AGENT->env_proxy;
        =172=     $AGENT->timeout($TIMEOUT);
        =173=   
        =174=     sub fetch { $AGENT->simple_request(@_) }
        =175=   }
        =176=   
        =177=   ### main code begins here
        =178=   
        =179=   ## initialize database if needed
        =180=   My::DBI->CONSTRUCT;
        =181=   
        =182=   ## reset all working to todo
        =183=   for my $page (My::Page->search(state => My::Page::State_working)) {
        =184=     $page->state(My::Page::State_todo);
        =185=     $page->update;
        =186=   }
        =187=   
        =188=   ## unless any are todo or finished, prime the pump
        =189=   unless (() = My::Page->search(state => My::Page::State_todo)
        =190=           or () = My::Page->search(state => My::Page::State_done)) {
        =191=     print "Starting a new run...\n";
        =192=     My::Page->create_or_make_todo(HACK_URL(URI->new($_))->as_string)
        =193=         for @CHECK;
        =194=   }
        =195=   
        =196=   ## main loop, done by kids:
        =197=   kids_do(sub {                   # the task
        =198=             srand;                # spin random number generator uniquely
        =199=             while (my @todo = My::Page->search(state => My::Page::State_todo)) {
        =200=               my $page = $todo[rand @todo]; # pick one at random
        =201=               unless($page->make_working_atomically) {
        =202=                 # someone else got it
        =203=                 print "$$ wanted ", $page->location, "\n" if $VERBOSE;
        =204=                 next;
        =205=               }
        =206=               ;
        =207=               print "$$ doing ", $page->location, "\n" if $VERBOSE > 1;
        =208=               do_one_page($page);
        =209=             }
        =210=           },
        =211=           sub {                   # max kids needed
        =212=             scalar(() = My::Page->search(state => My::Page::State_todo));
        =213=           });
        =214=   
        =215=   ## clean out any unseen at this point (no longer needed)
        =216=   $_->delete for My::Page->search(state => My::Page::State_unseen);
        =217=   
        =218=   ## display report
        =219=   print "*** BEGIN REPORT ***\n";
        =220=   for my $page (My::Page->search(state => My::Page::State_done,
        =221=                                  {order_by => 'location'})) {
        =222=     next if $page->last_checked <= $page->last_good + $REPORT;
        =223=     my $url = URI->new($page->location);
        =224=     print "$url:\n";
        =225=     print "  Status: ", $page->last_status, "\n";
        =226=     for (qw(checked good modified)) {
        =227=       my $method = "last_$_";
        =228=       my $value = $page->$method() or next;
        =229=       print "  \u\L$_\E: ".localtime($value)."\n";
        =230=     }
        =231=   
        =232=     for my $inbound ($page->inbound) {
        =233=       my $inbound_page = $inbound->src;
        =234=       my $inbound_url = URI->new($inbound_page->location);
        =235=       my $rel = $inbound_url->rel($url);
        =236=       $rel = $inbound_url->path_query if $rel =~ /^\.\.\/\.\./;
        =237=       print "  from $rel\n";
        =238=     }
        =239=     for my $outbound ($page->outbound) {
        =240=       my $outbound_page = $outbound->dst;
        =241=       my $outbound_url = URI->new($outbound_page->location);
        =242=       my $rel = $outbound_url->rel($url);
        =243=       $rel = $outbound_url->path_query if $rel =~ /^\.\.\/\.\./;
        =244=       my $outbound_status = $outbound_page->last_status;
        =245=       print "  to $rel: $outbound_status\n";
        =246=     }
        =247=   }
        =248=   print "*** END REPORT ***\n";
        =249=   
        =250=   ## reset for next pass
        =251=   for my $page (My::Page->search(state => My::Page::State_done)) {
        =252=     $page->state(My::Page::State_unseen);
        =253=     $page->update;
        =254=   }
        =255=   
        =256=   exit 0;
        =257=   
        =258=   ### subroutines
        =259=   
        =260=   sub do_one_page {
        =261=     my $page = shift;             # My::Page
        =262=   
        =263=     my $url = URI->new($page->location);
        =264=     my $parse = PARSE($url);
        =265=     if ($parse >= 2) {
        =266=       print "Parsing $url\n" if $VERBOSE;
        =267=       if (time < ($page->last_checked || 0) + $RECHECK or
        =268=           time < ($page->last_good || 0) + $RECHECK_GOOD) {
        =269=         print "$url: too early to reparse\n" if $VERBOSE;
        =270=         ## reuse existing links
        =271=         My::Page->create_or_make_todo($_->dst->location) for $page->outbound;
        =272=       } else {
        =273=         parse_or_ping($page, $url, "PARSE");
        =274=       }
        =275=     } elsif ($parse >= 1) {
        =276=       print "Pinging $url\n" if $VERBOSE;
        =277=       if (time < ($page->last_checked || 0) + $RECHECK or
        =278=           time < ($page->last_good || 0) + $RECHECK_GOOD) {
        =279=         print "$url: too early to reping\n" if $VERBOSE;
        =280=         $_->delete for $page->outbound; # delete any existing stale links
        =281=       } else {
        =282=         parse_or_ping($page, $url, "PING");
        =283=       }
        =284=     } else {
        =285=       print "Skipping $url\n" if $VERBOSE;
        =286=       $page->last_status("Skipped");
        =287=       $page->last_checked(0);
        =288=     }
        =289=     $page->state(My::Page::State_done);
        =290=     $page->update;
        =291=   }
        =292=   
        =293=   sub parse_or_ping {
        =294=     my $page = shift;             # My::Page
        =295=     my $url = shift;              # URI
        =296=     my $kind = shift;             # "PARSE" or "PING"
        =297=   
        =298=     use HTML::LinkExtor;
        =299=   
        =300=     ## create the request
        =301=     my $request = HTTP::Request->new(GET => "$url");
        =302=     $request->if_modified_since($page->last_modified) if $page->last_modified;
        =303=   
        =304=     ## fetch the response
        =305=     my $content;
        =306=     my $content_type;
        =307=     my $res = fetch
        =308=       ($request,
        =309=        sub {
        =310=          my ($data, $response, $protocol) = @_;
        =311=          unless ($content_type) {
        =312=            if ($content_type = $response->content_type) {
        =313=              if ($kind eq "PING") {
        =314=                print "aborting $url for ping\n" if $VERBOSE > 1;
        =315=                die "ping only";
        =316=              }
        =317=              if ($content_type ne "text/html") {
        =318=                print "aborting $url for $content_type\n" if $VERBOSE > 1;
        =319=                die "content type is $content_type";
        =320=              }
        =321=            }
        =322=          }
        =323=          $content .= $data;
        =324=          if ($MAXSIZE and length $content > $MAXSIZE) {
        =325=            print "aborting $url for content length\n" if $VERBOSE > 1;
        =326=            die "content length is ", length $content;
        =327=          }
        =328=        }, 8192);
        =329=     $res->content($content);      # stuff what we got
        =330=   
        =331=     ## analyze the results
        =332=     if ($res->is_success) {
        =333=       my $now = time;
        =334=       $page->last_checked($now);
        =335=       $page->last_good($now);
        =336=       $page->last_modified($res->last_modified || $res->date);
        =337=       $_->delete for $page->outbound; # delete any existing stale links
        =338=   
        =339=       if ($content_type eq "text/html") {
        =340=         if ($kind eq "PARSE") {
        =341=           print "$url: parsed\n" if $VERBOSE;
        =342=           $page->last_status("Verified and parsed");
        =343=           my %seen;
        =344=           HTML::LinkExtor->new
        =345=               (sub {
        =346=                  my ($tag, %attr) = @_;
        =347=                  $seen{$_}++ or add_link($page, $_) for values %attr;
        =348=                }, $res->base)->parse($res->content);
        =349=         } else {                  # presume $kind = PING
        =350=           print "$url: good ping\n" if $VERBOSE;
        =351=           $page->last_status("Verified (contents not examined)");
        =352=         }
        =353=       } else {
        =354=         print "$url: content = $content_type\n" if $VERBOSE;
        =355=         $page->last_status("Verified (content = $content_type)");
        =356=       }
        =357=     } elsif ($res->code == 304) { # not modified
        =358=       print "$url: not modified\n" if $VERBOSE;
        =359=       my $now = time;
        =360=       $page->last_checked($now);
        =361=       $page->last_good($now);
        =362=       ## reuse existing links
        =363=       My::Page->create_or_make_todo($_->dst->location) for $page->outbound;
        =364=     } elsif ($res->is_redirect) {
        =365=       my $location = $res->header("Location");
        =366=       print "$url: redirect to $location\n" if $VERBOSE;
        =367=       $_->delete for $page->outbound; # delete any existing stale links
        =368=       add_link($page, $location, $res->base) if $FOLLOW_REDIRECT;
        =369=       $page->last_status("Redirect (status = ".$res->code.") to $location");
        =370=       $page->last_checked(time);
        =371=     } else {
        =372=       print "$url: not verified: ", $res->code, "\n" if $VERBOSE;
        =373=       $_->delete for $page->outbound; # delete any existing stale links
        =374=       $page->last_status("NOT Verified (status = ".($res->code).")");
        =375=       $page->last_checked(time);
        =376=     }
        =377=     $page->update;
        =378=   }
        =379=   
        =380=   sub add_link {
        =381=     my $page = shift;             # My::Page
        =382=     my $url_string = shift;       # string
        =383=     my $base = shift;             # maybe undef
        =384=   
        =385=     my $url = $base
        =386=       ? URI->new_abs($url_string, URI->new($base))
        =387=         : URI->new($url_string);
        =388=     $url->fragment(undef);        # blow away any fragment
        =389=     $url = HACK_URL($url);
        =390=     return if PARSE($url) < 0;    # skip any links to non-xref pages
        =391=     print "saw $url\n" if $VERBOSE > 1;
        =392=   
        =393=     my $newpage = My::Page->create_or_make_todo("$url");
        =394=     ## the following might die if there's already one link there
        =395=     eval { My::Link->create({src => $page, dst => $newpage}) };
        =396=     die $@ if $@ and not $@ =~ /uniqueness constraint/;
        =397=   }
        =398=   
        =399=   sub kids_do {
        =400=     my $code_task = shift;
        =401=     my $code_count = shift;
        =402=   
        =403=     use POSIX qw(WNOHANG);
        =404=   
        =405=     my %kids;
        =406=   
        =407=     while (keys %kids or $code_count->()) {
        =408=       ## reap kids
        =409=       while ((my $kid = waitpid(-1, WNOHANG)) > 0) {
        =410=         ## warn "$kid reaped";    # trace
        =411=         delete $kids{$kid};
        =412=       }
        =413=       ## verify live kids
        =414=       for my $kid (keys %kids) {
        =415=         next if kill 0, $kid;
        =416=         warn "*** $kid found missing ***"; # shouldn't happen
        =417=         delete $kids{$kid};
        =418=       }
        =419=       ## launch kids
        =420=       if (keys %kids < $KIDMAX
        =421=           and keys %kids < $code_count->()) {
        =422=         ## warn "forking a kid";  # trace
        =423=         my $kid = fork;
        =424=         if (defined $kid) {       # good parent or child
        =425=           if ($kid) {             # parent
        =426=             $kids{$kid} = 1;
        =427=           } else {
        =428=             $code_task->();       # the real task
        =429=             exit 0;
        =430=           }
        =431=         } else {
        =432=           warn "cannot fork: $!"; # hopefully temporary
        =433=           sleep 1;
        =434=           next;                   # outer loop
        =435=         }
        =436=       }
        =437=       print "[", scalar keys %kids, " kids]\n" if $VERBOSE;
        =438=       sleep 1;
        =439=     }
        =440=   }
        =441=   
        =442=   sub URI::mailto::host { ""; }   # workaround bug in LWP
        =443=   sub URI::mailto::authority { ""; } # workaround bug in LWP

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Linux Magazine Column 52 (Oct 2003)

Listing