Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Download this listing! | ||
Unix Review Column 56 (Jan 2005)
[suggested title: ``Checking your bookmarks'']
Like most people, I've bookmarked about a third of the known internet by now. Of course, sites go away, and URLs become invalid, so some of my lesser-used bookmarks are pointing off into 404-land.
Some browsers have an option to periodically revalidate bookmarks. My favorite browser lacks such a feature, but does include the ability to export an HTML file of all the bookmarks, and reimport a similar file in a way that can be easily merged back into my existing bookmark setup. So, I thought I'd take a whack at a Perl-based bookmark validator, especially one that worked in parallel so that I could get through my bookmark list fairly quickly. And the result is in [listing one, below].
Lines 1 through 3 declare the program as a Perl program, and turn on the compiler restrictions and warnings as good programming practice.
Lines 5 through 7 pull in three modules that are found in the CPAN.
The HTML::Parser
module enables my program to cleanly parse HTML
with all its intricacies. The LWP::Parallel::UserAgent
module
provides a means to fetch many web pages at once. And finally,
HTTP::Request::Common
sets up an HTTP::Request
object so that I
can fetch it with the user agent.
Lines 9 and 10 set up the user interface for this program. I can use the program as a filter:
./this_program <Bookmarks.html >NewBookmarks.html
or as an in-place editor:
./this_program Bookmarks.html
As an in-place editor, the Bookmarks.html
file will be renamed to
Bookmarks.html~
(with an appended tilde), and the new version will
appear at the original name.
Lines 11 to 19 edit each file (usually just one) in turn, or the
standard input as one file. Line 12 slurps the entire file in to
$_
. Two passes are performed over the HTML text: the first pass in
line 14 finds the existing links, and the second pass in line 18 edits
the HTML with additional DEAD -
text for links that were found
broken. In between, we'll check the validity of the discovered URLs,
in line 16. This is our entire top-level code, using named
subroutines to clearly delineate the various phases and couplings of
this program. I find it helpful to break a program down this way.
Let's look at how the links are found, in the subroutine beginning in line 21. First, we'll accept the input parameter in line 22. Then, we'll create a staging variable for the return value in line 24.
Lines 26 to 34 create an HTML::Parser
object. Creating a parser
object is an art form, because there are so many buttons and dials and
levers on the instantiation and later reconfiguration of the parser.
My usual trick is to find a similar example and then modify it until
it does what I want.
In this case, we want to be notified of all start tags, so we'll
define a start handler (line 28) consisting of an anonymous subroutine
(lines 29 to 32) and a description of the parameters that will be sent
to the subroutine (line 33). We're asking for the tagname
(like
``a''), and the attribute hash as the only two parameters. We extract
these parameters in line 30.
Line 31 ignores all a
tags that don't have an href
attribute,
which skips over local anchors and anything else more bizarre. Line
32 creates an element in the hash with the key being the same as the
URL. The value is unimportant at this point, although we check to see
if the value is DEAD
later, so that would be a bad value for an
initialization.
Once the parser is created, we'll tell it to parse a string, and then
finish up, in line 36 and 37. When start tags are seen, the requested
callback is invoked, populating the %urls
hash at the appropriate
time. At the end of the input string, we'll return a reference to
that populated hash so that the caller has some data to manipulate.
The validate_links
routine (beginning in line 42) is really the
heart of this program, because we'll now take the list of URLs (the
keys of the hash in line 43) and verify that they are still dot-com,
not dot-bomb.
Line 45 creates the parallel user agent object. This object is a
virtual browser with the ability to fetch multiple URLs at once
(default 5). The max_size
value says that we don't need to see
anything past the first byte of the response, so we can stop when the
first ``chunk'' of text has been read from the remote server. (This is
actually a feature of LWP::UserAgent
, from which
LWP::Parallel::UserAgent
inherits.)
Lines 47 to 49 set up the list of URLs that the user agent will fetch
once activated. We'll just grab the keys (efficiently) from the hash
referenced by $urls
, and call the register
method of the user
agent with an HTTP::Request
object that GET
s the corresponding
URL.
Line 51 is where our program will spend most of the ``real'' time. The
wait
method call tells the user agent to do its job, waiting at
most 30 seconds for each connection and response. The result of the
wait
method is a hashref whose values are
LWP::Parallel::UserAgent::Entry
objects representing the result of
attempting to fetch each page. Calling request
on these objects
(as in line 52) gives us the original request, while the response
method (as in line 53) gives us the corresponding response. We fetch
the original URL, and its success status into a couple of variables,
and then update the hash referenced by $urls
with a LIVE/DEAD code
in line 54, also logging each result to STDERR
for information
purposes.
Once we have a hash mapping each URL to a LIVE/DEAD code, it's time to
patch up the original file, marking all dead links with a prefix of
DEAD -
, using the rewrite_html
routine beginning in line 60.
Lines 61 and 62 capture the incoming parameters: the original HTML text, and the reference to the hash of the URLs and their status.
Line 64 sets up a $dead
flag. If we see a start tag that begins a
link to a dead page, we'll set that flag true, and then update the
first following text to include our DEAD -
prefix, resetting the
variable as needed.
Lines 66 to 87 set up a new HTML::Parser
object. This one is a bit
more complex than the previous one, because we have to watch for link
start tags, the text of links, and copy everything else through.
As before, a start handler is enabled, starting in line 68. Because we're now echoing the input text, we'll ask for the original text as one of the parameters, displayed in line 74.
Lines 71 to 73 determine if current tag is indeed a dead link. If so,
line 72 sets $dead
to 1.
Line 76 defines a text handler, called as the parser recognizes the
text of the HTML document. If we see some text, and our $dead
flag
is set, we'll prefix the existing text with DEAD -
and reset the
$dead
flag. If the text already has the dead flag, we'll leave it
alone, so that we don't keep prefixing new additional text on every
access. The original or altered text is then printed in line 83.
Lines 85 and 86 define a ``default'' handler, called for everything else that isn't a start tag or a main text, such as end tags, comments, processing instructions, and so on. Here, we're just passing through everything we don't otherwise care about.
Lines 89 and 90 cause the incoming HTML to be parsed, resulting in the majority of the text being passed unmodified to the default output handle, except for the dead links which will have been appropriately altered.
And that's all there is! I save the current bookmarks into a file, run the program, wait until it completes, and then I reimport the modified HTML file as my new bookmarks. And now my bookmarks are all fresh and shiny new. Until next time, enjoy!
Listing
=1= #!/usr/bin/perl =2= use strict; =3= use warnings; =4= =5= use HTML::Parser; =6= use LWP::Parallel::UserAgent; =7= use HTTP::Request::Common; =8= =9= $^I = "~"; =10= @ARGV = "-" unless @ARGV; # act as filter if no names specified =11= while (@ARGV) { =12= $_ = do { local $/; <> }; =13= =14= my $urls = extract_links($_); =15= =16= validate_links($urls); =17= =18= rewrite_html($_, $urls); =19= } =20= =21= sub extract_links { =22= my $html = shift; =23= =24= my %urls; =25= =26= my $p = HTML::Parser->new =27= ( =28= start_h => =29= [sub { =30= my ($tagname, $attr) = @_; =31= return unless $tagname eq "a" and my $href = $attr->{href}; =32= $urls{$href} = ""; =33= }, "tagname, attr"], =34= ) or die; =35= =36= $p->parse($html); =37= $p->eof; =38= =39= return \%urls; =40= } =41= =42= sub validate_links { =43= my $urls = shift; # hashref =44= =45= my $pua = LWP::Parallel::UserAgent->new(max_size => 1); =46= =47= while (my ($url) = each %$urls) { =48= $pua->register(GET $url); =49= } =50= =51= for my $entry (values %{$pua->wait(30)}) { =52= my $url = $entry->request->url; =53= my $success = $entry->response->is_success; =54= warn +($urls->{$url} = $success ? "LIVE" : "DEAD"), ": $url\n"; =55= } =56= =57= # return void =58= } =59= =60= sub rewrite_html { =61= my $html = shift; =62= my $urls = shift; # hashref =63= =64= my $dead = 0; # mark the next text as "DEAD -" =65= =66= my $p = HTML::Parser->new =67= ( =68= start_h => =69= [sub { =70= my ($text, $tagname, $attr) = @_; =71= if ($tagname eq "a" and my $href = $attr->{href}) { =72= $dead = 1 if $urls->{$href} eq "DEAD"; =73= } =74= print $text; =75= }, "text, tagname, attr"], =76= text_h => =77= [sub { =78= my ($text) = @_; =79= if ($dead) { =80= $text = "DEAD - $text" unless $text =~ /DEAD -/; =81= $dead = 0; =82= } =83= print $text; =84= }, "text"], =85= default_h => =86= [sub { print shift }, 'text'], =87= ) or die; =88= =89= $p->parse($html); =90= $p->eof; =91= # return void =92= }