Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Download this listing! | ||
Linux Magazine Column 92 (Apr 2007)
[suggested title: ``HTML Scraping with XPath'']
Many useful web services have been popping up lately. You can get the weather in Tokyo, the GPS coordinates of 1600 Pennsylvania Avenue, the top 100 pages that contain images of puppies, the stock price of Apple Inc, sales information for my Learning Perl, and a bizarre variety of other bits of useful free information from the major providers, such as Yahoo, Google, and Amazon. In the early days of the web, you had to dig into the HTML response after simulating a user typing into a search box, but the modern web is simply a SOAP or XML or JSON value away from your answer.
Most of the web content being delivered is still intended to be rendered in a browser, and the providers of that content have not included a nice web service API for automated extraction. Thus, the ability to create a web scraper to extract useful bits from browser-bound content is in strong demand. I've addressed this issue a couple of times in the past in this column: see [LM Jun 03 - ``Wrong parser for the right reasons''] and [LM Jan 04 - ``Using xsh to scrape web pages'']. However, some recent developments are worth noting.
One of the cool things to come out of the XML camp is the idea of an XPath specification to identify a set of nodes that meet a particular condition. Because the hardest part of web scraping is to identify the location in the (possibly changing) web page, an XPath expression becomes a very convenient way to say ``the first table that has a heading of 'price'''. Let's look at a couple of tools that use XPath to extract some useful information from HTML.
The task at hand is a practical and familiar one, if you've ever had to install a module from a CPAN mirror before. When you use the CPAN or CPANPLUS shell, you're asked to choose a nearby mirror from a list of mirrors in your area. But which one to choose? You want one that has been kept up to date, hopefully hourly, to make sure you're getting the latest things.
Thankfully, a live ``mirror monitoring'' page has been put up at
http://www.cs.uu.nl/people/henkp/mirmon/cpan.html
. At a glance, we
can see all of the mirrors broken down by country, and how recently
they've been refreshed. Ideally, we want something aged less than a
day, perhaps even within the past two hours.
But for long lists of mirrors (such as for the United States), it's tedious to look up and down amongst the refresh times to find the shortest value. It would be nice if we could have some tool that would fetch the page, pull out the refresh values for our selected country, normalize them all to hours, and then sort from shortest to longest. So let's write something that does that.
Looking at the HTML source for this page reveals a nice regularity (an accidental bonus!). A given country in the table (say, ``de'' for Germany) starts with:
<TR ...><TH ...><A NAME="de" ...>
and ends with the next row that looks like:
<TR ...><TH ...><A ...>
or perhaps the end of the table (for the United States, which is last). Here, I'm using ``...'' to indicate other uninteresting attributes. Between these two lines, I want to get the URL of the CPAN mirror, and the refresh age, which can be extracted from:
<TR ...> <TD ...>...<A HREF="URL HERE">...</A>...</TD> <TD ...>...</TD> <TD ...>REFRESH AGE HERE</TD> ...
As long as I can keep the specification to a minimum, this template should survive minor variations of updates to the output.
The refresh age can be a floating point value in minutes, hours, or days, so I have to do some normalization with that, but that's not hard. Once I have normalized values, a simple sort by values display will give me the information I want.
So, now that we have a task, let's look at two different approaches,
one using xsh
, and the other using HTML::TreeBuilder::XPath
.
(You'll see that the code is nearly identical, although the underlying
engines are completely different, and this has an important impact,
which I will reveal at the end.)
My favorite tool for wrangling XML is the wonderful XSH language.
Recently, XSH has rolled (incompatibly, sorry) from version 1 to
version 2, and boasts a far better integration of Perl and XSH, as
well as inline and Module::Compile
support. (That last item is
interesting, because you can write a library using XSH, compile it
once, and reuse the compiled version without having to reparse the XSH
source on each hit. But, that's a topic for another column.)
Using the interactive mode of the xsh command, I was able to find
the XPath expressions for the data in which I was interested. The
specification for the tr
node that begins Germany is:
//tr[th/a/@name = 'de']
This says ``find a tr
, that has immediately beneath it a th
, and
immediately beneath that an a
, that has an attribute of name
with a
value of de
''. Notice that I'm not concerned when writing this XPath about
any other attributes or even whether de
is in single or double quotes (if
quoted at all) -- far better than doing this with a regex.
I then struggled to figure out how to say ``all the following tr
nodes until
either the next thing like this, or the end of the table''. After about 20
minutes of scratching my head, I gave up, and decided to handle it iteratively
with a condition in my loop. So, I create a nodeset with:
//tr[th/a/@name = 'de']/following-sibling::*
which says give me all the following nodes at the same ``level'' (hopefully
tr
, although maybe I should have specified that).
Inside my loop, I check:
th/a/@name
and if it exists, I stop, because I've hit the next country. Otherwise, I extract the text of:
td[1]/a[1]/@href
for the URL link (there are multiple a
nodes there), and the text of:
td[3]
for the refresh age.
Once I had the XPath to get to the pieces, I was able to cobble up my
command-line xsh
program, and the result is in [listing 1, below].
Line 1 begins with an invocation of env
, which finds the xsh
tool in my path, and invokes the script using xsh
. (This is a
feature that didn't work with XSH version 1, and I'm happy the
developer accepted my suggestion to make it work.)
Line 2 turns off the diagnostic messages normally associated with operations like reading a document.
Line 4 is commented out, but would have opened a local file named
cpan.html
in HTML mode. I used this for testing: rather than
hitting the remote server repeatedly (which wouldn't have worked at
30,000 feet where I wrote most of this code), I had a local cache.
Line 5 fetches the given URL, and parses it as the default document for the remaining operations. If I had wanted to switch back and forth between multiple documents, I would have saved this as:
$cpan := open --format html 'cpan.html';
which would then not only open the document, but also store a reference
to it in the $cpan
variable.
Lines 7 through 12 gather the data from the document. The for
loop
starts with line 7 to create the nodeset. Each element of the nodeset
is made into the context node for the body of the loop. This is a
very natural thing to do, because all the XPath
inside the loop is
then relative to this current context.
Line 8 aborts the loop if we're looking at the next country already. Note that for the United States, we'll never trigger, because it's the last country in the list. That's why I tested it with Germany instead.
Lines 9 and 10 extract the stringy value of the two XPath expressions
into the variables $url
and $time
. These are also scalar
variables in the XML::XSH2::Map
package namespace, which is handy
for the embedded Perl code coming up.
Line 11 uses the values of the two variables to add elements to an
ordinary Perl hash called %XML::XSH2::Map::time
. Notice how the
embedded Perl code flowed very naturally. (Had I been able to specify
all of the related nodes in one XPath expression, I could have used
the XSH hash
command, but that would have also made this program
even more different from the Perl version coming up soon.)
Once I have all of the data nodes processed, I just have to normalize the data and sort and print the results. The large Perl block between lines 14 and 25 take care of that for me.
Lines 15 to 20 normalize the refresh age by trying to interpret it as days, hours or minutes, along with the appropriate calculation to bring the age to just hours.
Lines 22 to 24 are a normal ``sort by value'' loop to dump the URLs by their increasing ages. And that's all there is to our program!
Running this, we get nice output that right at the moment looks like:
ftp://ftp.uni-erlangen.de/pub/source/CPAN/ => 0.4 http://netmirror.org/mirror/CPAN/ => 1 ftp://cpan.mirror.iphh.net/pub/CPAN/ => 1 http://www.chemmedia.de/mirrors/CPAN/ => 2 ftp://ftp.fu-berlin.de/unix/languages/perl/ => 2 [... middle items omitted for space ...] http://www.cpan.mirroarrr.de/ => 672 http://cpan.linux-mirror.org/ => 2224.8
Yes, the newest item was refreshed within the hour, but the oldest hasn't been updated in nearly a quarter of a year. Woe be to those who have selected that mirror accidentally!
Once we have this information, we can hop into our CPAN/CPANPLUS shell and reconfigure it to use the most recent mirrors, and all is good.
Now, XSH is based on XML::LibXML
, which is amazingly fast at
parsing XML and relatively sane HTML. On my machine, this program
runs from the command line in less than a CPU second, which is
tolerably fast. Let's look at a similar application
written using HTML::TreeBuilder::XPath
, in [listing two, below],
and compare speeds when we're done.
Line 1 here selects my local perl
command. While I could have just
put in the proper path, I wanted to keep the programs as parallel as
possible.
Lines 3 and 4 bring in the needed CPAN modules. Unlike
XML::LibXML
, the HTML::Parser
module (on which TreeBuilder is
layered) doesn't have built-in HTTP fetching, so I have to use the
separate LWP
library for that.
Line 6 creates the parsing object. Line 7 or 8 grabs the document source, either locally or remotely, and parses it, resulting in a document tree.
The rest of the code is also very similar to the XSH version. Lines 11 to 16 identify and parse the nodeset. Notice that the XPath expressions are identical, although we now have to specify method calls against the original document and against the resulting current node.
Lines 18 to 27 dump the result as before, using Perl code identical to the XSH version.
And if I run this program, I get the identical output, yeay! But if I compare times, I see something quite interesting:
% time ./xsh-version >/dev/null ./xsh-version > /dev/null 0.60s user 0.06s system 21% cpu 3.013 total % time ./htx-version >/dev/null ./htx-version > /dev/null 4.28s user 0.05s system 64% cpu 6.746 total
Wow, look at that. The XSH version (even considering the compilation
of the XSH language into Perl before running) takes about one seventh
the time of the HTX (HTTP::TreeBuilder::XPath
) version. I was
surprised at the difference, so I did a bit of exploration and found
that that the HTX version was spending almost all of its time building
the huge Perl data structure to represent the DOM. Because XSH
doesn't need to do that (the DOM is in C-side data structures), we get
a tremendous savings in time, not to mention quicker queries later.
The wall-clock difference is somewhat marginalized because of the amount of time it takes to fetch the remote page, so the results are more dramatic when fetching local cached copies.
I hope you're inspired to consider XSH for your web-scraping needs, and to learn a bit about XPath to identify nodes of interest. Until next time, enjoy!
Listings
=0= #################### LISTING ONE #################### =1= #!/usr/bin/env xsh =2= quiet; =3= =4= # open --format html 'cpan.html'; =5= open --format html 'http://www.cs.uu.nl/people/henkp/mirmon/cpan.html'; =6= =7= for //tr[th/a/@name = 'de']/following-sibling::* { =8= if th/a/@name last; =9= $url = (td[1]/a[1]/@href); =10= $time = (td[3]); =11= perl { $time{$url} = $time }; =12= } =13= =14= perl { =15= for (values %time) { =16= s{([\d.]+) days?}{$1 * 24}e =17= or s{([\d.]+) hours?}{$1} =18= or s{([\d.]+) minutes?}{sprintf "%.1f", $1 / 60}e =19= or die "Don't understand $_"; =20= } =21= =22= for (sort { $time{$a} <=> $time{$b} } keys %time) { =23= print "$_ => $time{$_}\n"; =24= } =25= } =0= #################### LISTING TWO #################### =1= #!/usr/bin/env perl =2= =3= use HTML::TreeBuilder::XPath; =4= use LWP::Simple; =5= =6= my $tree = HTML::TreeBuilder::XPath->new; =7= # $tree->parse_file('cpan.html'); =8= $tree->parse_content(get 'http://www.cs.uu.nl/people/henkp/mirmon/cpan.html'); =9= =10= my %time; =11= for ($tree->findnodes(q{//tr[th/a/@name = 'de']/following-sibling::*})) { =12= last if $_->exists(q{th/a/@name}); =13= my $url = $_->findvalue(q{td[1]/a[1]/@href}); =14= my $time = $_->findvalue(q{td[3]}); =15= $time{"$url"} = "$time"; # ensure stringy =16= } =17= =18= for (values %time) { =19= s{([\d.]+) days?}{$1 * 24}e =20= or s{([\d.]+) hours?}{$1} =21= or s{([\d.]+) minutes?}{sprintf "%.1f", $1 / 60}e =22= or die "Don't understand $_"; =23= } =24= =25= for (sort { $time{$a} <=> $time{$b} } keys %time) { =26= print "$_ => $time{$_}\n"; =27= }