Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 92 (Apr 2007)

[suggested title: ``HTML Scraping with XPath'']

Many useful web services have been popping up lately. You can get the weather in Tokyo, the GPS coordinates of 1600 Pennsylvania Avenue, the top 100 pages that contain images of puppies, the stock price of Apple Inc, sales information for my Learning Perl, and a bizarre variety of other bits of useful free information from the major providers, such as Yahoo, Google, and Amazon. In the early days of the web, you had to dig into the HTML response after simulating a user typing into a search box, but the modern web is simply a SOAP or XML or JSON value away from your answer.

Most of the web content being delivered is still intended to be rendered in a browser, and the providers of that content have not included a nice web service API for automated extraction. Thus, the ability to create a web scraper to extract useful bits from browser-bound content is in strong demand. I've addressed this issue a couple of times in the past in this column: see [LM Jun 03 - ``Wrong parser for the right reasons''] and [LM Jan 04 - ``Using xsh to scrape web pages'']. However, some recent developments are worth noting.

One of the cool things to come out of the XML camp is the idea of an XPath specification to identify a set of nodes that meet a particular condition. Because the hardest part of web scraping is to identify the location in the (possibly changing) web page, an XPath expression becomes a very convenient way to say ``the first table that has a heading of 'price'''. Let's look at a couple of tools that use XPath to extract some useful information from HTML.

The task at hand is a practical and familiar one, if you've ever had to install a module from a CPAN mirror before. When you use the CPAN or CPANPLUS shell, you're asked to choose a nearby mirror from a list of mirrors in your area. But which one to choose? You want one that has been kept up to date, hopefully hourly, to make sure you're getting the latest things.

Thankfully, a live ``mirror monitoring'' page has been put up at http://www.cs.uu.nl/people/henkp/mirmon/cpan.html. At a glance, we can see all of the mirrors broken down by country, and how recently they've been refreshed. Ideally, we want something aged less than a day, perhaps even within the past two hours.

But for long lists of mirrors (such as for the United States), it's tedious to look up and down amongst the refresh times to find the shortest value. It would be nice if we could have some tool that would fetch the page, pull out the refresh values for our selected country, normalize them all to hours, and then sort from shortest to longest. So let's write something that does that.

Looking at the HTML source for this page reveals a nice regularity (an accidental bonus!). A given country in the table (say, ``de'' for Germany) starts with:

  <TR ...><TH ...><A NAME="de" ...>

and ends with the next row that looks like:

  <TR ...><TH ...><A ...>

or perhaps the end of the table (for the United States, which is last). Here, I'm using ``...'' to indicate other uninteresting attributes. Between these two lines, I want to get the URL of the CPAN mirror, and the refresh age, which can be extracted from:

  <TR ...>
    <TD ...>...<A HREF="URL HERE">...</A>...</TD>
    <TD ...>...</TD>
    <TD ...>REFRESH AGE HERE</TD>
    ...

As long as I can keep the specification to a minimum, this template should survive minor variations of updates to the output.

The refresh age can be a floating point value in minutes, hours, or days, so I have to do some normalization with that, but that's not hard. Once I have normalized values, a simple sort by values display will give me the information I want.

So, now that we have a task, let's look at two different approaches, one using xsh, and the other using HTML::TreeBuilder::XPath. (You'll see that the code is nearly identical, although the underlying engines are completely different, and this has an important impact, which I will reveal at the end.)

My favorite tool for wrangling XML is the wonderful XSH language. Recently, XSH has rolled (incompatibly, sorry) from version 1 to version 2, and boasts a far better integration of Perl and XSH, as well as inline and Module::Compile support. (That last item is interesting, because you can write a library using XSH, compile it once, and reuse the compiled version without having to reparse the XSH source on each hit. But, that's a topic for another column.)

Using the interactive mode of the xsh command, I was able to find the XPath expressions for the data in which I was interested. The specification for the tr node that begins Germany is:

  //tr[th/a/@name = 'de']

This says ``find a tr, that has immediately beneath it a th, and immediately beneath that an a, that has an attribute of name with a value of de''. Notice that I'm not concerned when writing this XPath about any other attributes or even whether de is in single or double quotes (if quoted at all) -- far better than doing this with a regex.

I then struggled to figure out how to say ``all the following tr nodes until either the next thing like this, or the end of the table''. After about 20 minutes of scratching my head, I gave up, and decided to handle it iteratively with a condition in my loop. So, I create a nodeset with:

  //tr[th/a/@name = 'de']/following-sibling::*

which says give me all the following nodes at the same ``level'' (hopefully tr, although maybe I should have specified that).

Inside my loop, I check:

  th/a/@name

and if it exists, I stop, because I've hit the next country. Otherwise, I extract the text of:

  td[1]/a[1]/@href

for the URL link (there are multiple a nodes there), and the text of:

  td[3]

for the refresh age.

Once I had the XPath to get to the pieces, I was able to cobble up my command-line xsh program, and the result is in [listing 1, below].

Line 1 begins with an invocation of env, which finds the xsh tool in my path, and invokes the script using xsh. (This is a feature that didn't work with XSH version 1, and I'm happy the developer accepted my suggestion to make it work.)

Line 2 turns off the diagnostic messages normally associated with operations like reading a document.

Line 4 is commented out, but would have opened a local file named cpan.html in HTML mode. I used this for testing: rather than hitting the remote server repeatedly (which wouldn't have worked at 30,000 feet where I wrote most of this code), I had a local cache.

Line 5 fetches the given URL, and parses it as the default document for the remaining operations. If I had wanted to switch back and forth between multiple documents, I would have saved this as:

  $cpan := open --format html 'cpan.html';

which would then not only open the document, but also store a reference to it in the $cpan variable.

Lines 7 through 12 gather the data from the document. The for loop starts with line 7 to create the nodeset. Each element of the nodeset is made into the context node for the body of the loop. This is a very natural thing to do, because all the XPath inside the loop is then relative to this current context.

Line 8 aborts the loop if we're looking at the next country already. Note that for the United States, we'll never trigger, because it's the last country in the list. That's why I tested it with Germany instead.

Lines 9 and 10 extract the stringy value of the two XPath expressions into the variables $url and $time. These are also scalar variables in the XML::XSH2::Map package namespace, which is handy for the embedded Perl code coming up.

Line 11 uses the values of the two variables to add elements to an ordinary Perl hash called %XML::XSH2::Map::time. Notice how the embedded Perl code flowed very naturally. (Had I been able to specify all of the related nodes in one XPath expression, I could have used the XSH hash command, but that would have also made this program even more different from the Perl version coming up soon.)

Once I have all of the data nodes processed, I just have to normalize the data and sort and print the results. The large Perl block between lines 14 and 25 take care of that for me.

Lines 15 to 20 normalize the refresh age by trying to interpret it as days, hours or minutes, along with the appropriate calculation to bring the age to just hours.

Lines 22 to 24 are a normal ``sort by value'' loop to dump the URLs by their increasing ages. And that's all there is to our program!

Running this, we get nice output that right at the moment looks like:

  ftp://ftp.uni-erlangen.de/pub/source/CPAN/ => 0.4
  http://netmirror.org/mirror/CPAN/ => 1
  ftp://cpan.mirror.iphh.net/pub/CPAN/ => 1
  http://www.chemmedia.de/mirrors/CPAN/ => 2
  ftp://ftp.fu-berlin.de/unix/languages/perl/ => 2
  [... middle items omitted for space ...]
  http://www.cpan.mirroarrr.de/ => 672
  http://cpan.linux-mirror.org/ => 2224.8

Yes, the newest item was refreshed within the hour, but the oldest hasn't been updated in nearly a quarter of a year. Woe be to those who have selected that mirror accidentally!

Once we have this information, we can hop into our CPAN/CPANPLUS shell and reconfigure it to use the most recent mirrors, and all is good.

Now, XSH is based on XML::LibXML, which is amazingly fast at parsing XML and relatively sane HTML. On my machine, this program runs from the command line in less than a CPU second, which is tolerably fast. Let's look at a similar application written using HTML::TreeBuilder::XPath, in [listing two, below], and compare speeds when we're done.

Line 1 here selects my local perl command. While I could have just put in the proper path, I wanted to keep the programs as parallel as possible.

Lines 3 and 4 bring in the needed CPAN modules. Unlike XML::LibXML, the HTML::Parser module (on which TreeBuilder is layered) doesn't have built-in HTTP fetching, so I have to use the separate LWP library for that.

Line 6 creates the parsing object. Line 7 or 8 grabs the document source, either locally or remotely, and parses it, resulting in a document tree.

The rest of the code is also very similar to the XSH version. Lines 11 to 16 identify and parse the nodeset. Notice that the XPath expressions are identical, although we now have to specify method calls against the original document and against the resulting current node.

Lines 18 to 27 dump the result as before, using Perl code identical to the XSH version.

And if I run this program, I get the identical output, yeay! But if I compare times, I see something quite interesting:

  % time ./xsh-version >/dev/null
  ./xsh-version > /dev/null  0.60s user 0.06s system 21% cpu 3.013 total
  % time ./htx-version >/dev/null
  ./htx-version > /dev/null  4.28s user 0.05s system 64% cpu 6.746 total

Wow, look at that. The XSH version (even considering the compilation of the XSH language into Perl before running) takes about one seventh the time of the HTX (HTTP::TreeBuilder::XPath) version. I was surprised at the difference, so I did a bit of exploration and found that that the HTX version was spending almost all of its time building the huge Perl data structure to represent the DOM. Because XSH doesn't need to do that (the DOM is in C-side data structures), we get a tremendous savings in time, not to mention quicker queries later.

The wall-clock difference is somewhat marginalized because of the amount of time it takes to fetch the remote page, so the results are more dramatic when fetching local cached copies.

I hope you're inspired to consider XSH for your web-scraping needs, and to learn a bit about XPath to identify nodes of interest. Until next time, enjoy!

Listings

        =0=     #################### LISTING ONE ####################
        =1=     #!/usr/bin/env xsh
        =2=     quiet;
        =3=     
        =4=     # open --format html 'cpan.html';
        =5=     open --format html 'http://www.cs.uu.nl/people/henkp/mirmon/cpan.html';
        =6=     
        =7=     for //tr[th/a/@name = 'de']/following-sibling::* {
        =8=       if th/a/@name last;
        =9=       $url = (td[1]/a[1]/@href);
        =10=      $time = (td[3]);
        =11=      perl { $time{$url} = $time };
        =12=    }
        =13=    
        =14=    perl {
        =15=      for (values %time) {
        =16=        s{([\d.]+) days?}{$1 * 24}e
        =17=          or s{([\d.]+) hours?}{$1}
        =18=            or s{([\d.]+) minutes?}{sprintf "%.1f", $1 / 60}e
        =19=              or die "Don't understand $_";
        =20=      }
        =21=    
        =22=      for (sort { $time{$a} <=> $time{$b} } keys %time) {
        =23=        print "$_ => $time{$_}\n";
        =24=      }
        =25=    }
        =0=     #################### LISTING TWO ####################
        =1=     #!/usr/bin/env perl
        =2=     
        =3=     use HTML::TreeBuilder::XPath;
        =4=     use LWP::Simple;
        =5=     
        =6=     my $tree = HTML::TreeBuilder::XPath->new;
        =7=     # $tree->parse_file('cpan.html');
        =8=     $tree->parse_content(get 'http://www.cs.uu.nl/people/henkp/mirmon/cpan.html');
        =9=     
        =10=    my %time;
        =11=    for ($tree->findnodes(q{//tr[th/a/@name = 'de']/following-sibling::*})) {
        =12=      last if $_->exists(q{th/a/@name});
        =13=      my $url = $_->findvalue(q{td[1]/a[1]/@href});
        =14=      my $time = $_->findvalue(q{td[3]});
        =15=      $time{"$url"} = "$time";      # ensure stringy
        =16=    }
        =17=    
        =18=    for (values %time) {
        =19=      s{([\d.]+) days?}{$1 * 24}e
        =20=        or s{([\d.]+) hours?}{$1}
        =21=          or s{([\d.]+) minutes?}{sprintf "%.1f", $1 / 60}e
        =22=            or die "Don't understand $_";
        =23=    }
        =24=    
        =25=    for (sort { $time{$a} <=> $time{$b} } keys %time) {
        =26=      print "$_ => $time{$_}\n";
        =27=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.