Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Unix Review Column 56 (Jan 2005)

[suggested title: ``Checking your bookmarks'']

Like most people, I've bookmarked about a third of the known internet by now. Of course, sites go away, and URLs become invalid, so some of my lesser-used bookmarks are pointing off into 404-land.

Some browsers have an option to periodically revalidate bookmarks. My favorite browser lacks such a feature, but does include the ability to export an HTML file of all the bookmarks, and reimport a similar file in a way that can be easily merged back into my existing bookmark setup. So, I thought I'd take a whack at a Perl-based bookmark validator, especially one that worked in parallel so that I could get through my bookmark list fairly quickly. And the result is in [listing one, below].

Lines 1 through 3 declare the program as a Perl program, and turn on the compiler restrictions and warnings as good programming practice.

Lines 5 through 7 pull in three modules that are found in the CPAN. The HTML::Parser module enables my program to cleanly parse HTML with all its intricacies. The LWP::Parallel::UserAgent module provides a means to fetch many web pages at once. And finally, HTTP::Request::Common sets up an HTTP::Request object so that I can fetch it with the user agent.

Lines 9 and 10 set up the user interface for this program. I can use the program as a filter:

  ./this_program <Bookmarks.html >NewBookmarks.html

or as an in-place editor:

  ./this_program Bookmarks.html

As an in-place editor, the Bookmarks.html file will be renamed to Bookmarks.html~ (with an appended tilde), and the new version will appear at the original name.

Lines 11 to 19 edit each file (usually just one) in turn, or the standard input as one file. Line 12 slurps the entire file in to $_. Two passes are performed over the HTML text: the first pass in line 14 finds the existing links, and the second pass in line 18 edits the HTML with additional DEAD - text for links that were found broken. In between, we'll check the validity of the discovered URLs, in line 16. This is our entire top-level code, using named subroutines to clearly delineate the various phases and couplings of this program. I find it helpful to break a program down this way.

Let's look at how the links are found, in the subroutine beginning in line 21. First, we'll accept the input parameter in line 22. Then, we'll create a staging variable for the return value in line 24.

Lines 26 to 34 create an HTML::Parser object. Creating a parser object is an art form, because there are so many buttons and dials and levers on the instantiation and later reconfiguration of the parser. My usual trick is to find a similar example and then modify it until it does what I want.

In this case, we want to be notified of all start tags, so we'll define a start handler (line 28) consisting of an anonymous subroutine (lines 29 to 32) and a description of the parameters that will be sent to the subroutine (line 33). We're asking for the tagname (like ``a''), and the attribute hash as the only two parameters. We extract these parameters in line 30.

Line 31 ignores all a tags that don't have an href attribute, which skips over local anchors and anything else more bizarre. Line 32 creates an element in the hash with the key being the same as the URL. The value is unimportant at this point, although we check to see if the value is DEAD later, so that would be a bad value for an initialization.

Once the parser is created, we'll tell it to parse a string, and then finish up, in line 36 and 37. When start tags are seen, the requested callback is invoked, populating the %urls hash at the appropriate time. At the end of the input string, we'll return a reference to that populated hash so that the caller has some data to manipulate.

The validate_links routine (beginning in line 42) is really the heart of this program, because we'll now take the list of URLs (the keys of the hash in line 43) and verify that they are still dot-com, not dot-bomb.

Line 45 creates the parallel user agent object. This object is a virtual browser with the ability to fetch multiple URLs at once (default 5). The max_size value says that we don't need to see anything past the first byte of the response, so we can stop when the first ``chunk'' of text has been read from the remote server. (This is actually a feature of LWP::UserAgent, from which LWP::Parallel::UserAgent inherits.)

Lines 47 to 49 set up the list of URLs that the user agent will fetch once activated. We'll just grab the keys (efficiently) from the hash referenced by $urls, and call the register method of the user agent with an HTTP::Request object that GETs the corresponding URL.

Line 51 is where our program will spend most of the ``real'' time. The wait method call tells the user agent to do its job, waiting at most 30 seconds for each connection and response. The result of the wait method is a hashref whose values are LWP::Parallel::UserAgent::Entry objects representing the result of attempting to fetch each page. Calling request on these objects (as in line 52) gives us the original request, while the response method (as in line 53) gives us the corresponding response. We fetch the original URL, and its success status into a couple of variables, and then update the hash referenced by $urls with a LIVE/DEAD code in line 54, also logging each result to STDERR for information purposes.

Once we have a hash mapping each URL to a LIVE/DEAD code, it's time to patch up the original file, marking all dead links with a prefix of DEAD -, using the rewrite_html routine beginning in line 60.

Lines 61 and 62 capture the incoming parameters: the original HTML text, and the reference to the hash of the URLs and their status.

Line 64 sets up a $dead flag. If we see a start tag that begins a link to a dead page, we'll set that flag true, and then update the first following text to include our DEAD - prefix, resetting the variable as needed.

Lines 66 to 87 set up a new HTML::Parser object. This one is a bit more complex than the previous one, because we have to watch for link start tags, the text of links, and copy everything else through.

As before, a start handler is enabled, starting in line 68. Because we're now echoing the input text, we'll ask for the original text as one of the parameters, displayed in line 74.

Lines 71 to 73 determine if current tag is indeed a dead link. If so, line 72 sets $dead to 1.

Line 76 defines a text handler, called as the parser recognizes the text of the HTML document. If we see some text, and our $dead flag is set, we'll prefix the existing text with DEAD - and reset the $dead flag. If the text already has the dead flag, we'll leave it alone, so that we don't keep prefixing new additional text on every access. The original or altered text is then printed in line 83.

Lines 85 and 86 define a ``default'' handler, called for everything else that isn't a start tag or a main text, such as end tags, comments, processing instructions, and so on. Here, we're just passing through everything we don't otherwise care about.

Lines 89 and 90 cause the incoming HTML to be parsed, resulting in the majority of the text being passed unmodified to the default output handle, except for the dead links which will have been appropriately altered.

And that's all there is! I save the current bookmarks into a file, run the program, wait until it completes, and then I reimport the modified HTML file as my new bookmarks. And now my bookmarks are all fresh and shiny new. Until next time, enjoy!

Listing

        =1=     #!/usr/bin/perl
        =2=     use strict;
        =3=     use warnings;
        =4=     
        =5=     use HTML::Parser;
        =6=     use LWP::Parallel::UserAgent;
        =7=     use HTTP::Request::Common;
        =8=     
        =9=     $^I = "~";
        =10=    @ARGV = "-" unless @ARGV;       # act as filter if no names specified
        =11=    while (@ARGV) {
        =12=      $_ = do { local $/; <> };
        =13=    
        =14=      my $urls = extract_links($_);
        =15=    
        =16=      validate_links($urls);
        =17=    
        =18=      rewrite_html($_, $urls);
        =19=    }
        =20=    
        =21=    sub extract_links {
        =22=      my $html = shift;
        =23=    
        =24=      my %urls;
        =25=    
        =26=      my $p = HTML::Parser->new
        =27=        (
        =28=         start_h =>
        =29=         [sub {
        =30=            my ($tagname, $attr) = @_;
        =31=            return unless $tagname eq "a" and my $href = $attr->{href};
        =32=            $urls{$href} = "";
        =33=          }, "tagname, attr"],
        =34=        ) or die;
        =35=    
        =36=      $p->parse($html);
        =37=      $p->eof;
        =38=    
        =39=      return \%urls;
        =40=    }
        =41=    
        =42=    sub validate_links {
        =43=      my $urls = shift;             # hashref
        =44=    
        =45=      my $pua = LWP::Parallel::UserAgent->new(max_size => 1);
        =46=    
        =47=      while (my ($url) = each %$urls) {
        =48=        $pua->register(GET $url);
        =49=      }
        =50=    
        =51=      for my $entry (values %{$pua->wait(30)}) {
        =52=        my $url = $entry->request->url;
        =53=        my $success = $entry->response->is_success;
        =54=        warn +($urls->{$url} = $success ? "LIVE" : "DEAD"), ": $url\n";
        =55=      }
        =56=    
        =57=      # return void
        =58=    }
        =59=    
        =60=    sub rewrite_html {
        =61=      my $html = shift;
        =62=      my $urls = shift;             # hashref
        =63=    
        =64=      my $dead = 0;                 # mark the next text as "DEAD -"
        =65=    
        =66=      my $p = HTML::Parser->new
        =67=        (
        =68=         start_h =>
        =69=         [sub {
        =70=            my ($text, $tagname, $attr) = @_;
        =71=            if ($tagname eq "a" and my $href = $attr->{href}) {
        =72=              $dead = 1 if $urls->{$href} eq "DEAD";
        =73=            }
        =74=            print $text;
        =75=          }, "text, tagname, attr"],
        =76=         text_h =>
        =77=         [sub {
        =78=            my ($text) = @_;
        =79=            if ($dead) {
        =80=              $text = "DEAD - $text" unless $text =~ /DEAD -/;
        =81=              $dead = 0;
        =82=            }
        =83=            print $text;
        =84=          }, "text"],
        =85=         default_h =>
        =86=         [sub { print shift }, 'text'],
        =87=        ) or die;
        =88=    
        =89=      $p->parse($html);
        =90=      $p->eof;
        =91=      # return void
        =92=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.