Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Linux Magazine Column 49 (Jun 2003)
More and more these days, you get faced with a problem with angle brackets somewhere in the data. How do you find what you're looking for in HTML or XML data?
At first glance, the question has an obvious answer. If you have an
HTML task, you use HTML::Parser
or some derived or wrapper class.
If you have an XML task, you use XML::Parser
or XML::LibXML
.
But maybe the obvious answer isn't always the best. Let's look at a
couple of cases.
Parsing XML with HTML::Parser
My friend Doug LaFarge was recently working on an e-commerce website. Part of the task involved computing the shipping charges by connecting up with a remote web service via HTTP, passing the size and weight of the packages and destination address, and getting back a response.
Now I won't embarass the service provider by giving their name, but
they really did a pretty poor job of designing and documenting their
service. First, their ``sample Perl code'' could never have run, as
they were using +
to do string concatenation. (It was apparently
copied from their JavaScript example, except they weren't paying
attention.) Second, they return something that is nearly XML, but
has extra leading and trailing whitespace, so a true XML parser
aborts. (You have to trim the whitespace before feeding the parser.)
And finally, they return XML, but they're not using SOAP, which seems
odd because it looks like a natural SOAP application. So, if you can
get around the fact that their example programs don't run, the
response requires massaging before parsing, and it's not SOAP, it
works fine.
After we had informed the company that their sample program didn't
work, they asked us if we could suggest some improvements to it. At
first, I reached for XML::Parser
, and then realized that this would
be bad as model code, because in my experience, XML::Parser
is a
bit finicky to install, requiring expat
to be installed as well.
And there was still that nasty bit of needing to trim the whitespace.
But I had noticed some time ago that the friendly HTML::Parser
module has an ``XML mode'', which modifies the parser so that it can
deal mostly with XHTML
, but works neatly on generic well-formed
XML. And, since the sample code we were developing was presuming that
LWP was installed, we could also presume that in nearly all cases, we
also had HTML::Parser
as well.
I quickly started hacking up some code, and with a half-hour, was happily fetching the data and at least recognizing the start/end tags and content. Let's take a look at some of the code snippets. First, we need to construct the URL containing the shipping parameters, including credentials for authorization:
my $API_URL = "http://name.of.shipping.company/calculate.cgi"; my $USERNAME = "doug"; my $PASSWORD = "doug's password";
use URI; my $uri = URI->new($API_URL); $uri->query_form( Username => $USERNAME, Password => $PASSWORD, FromAddress => ..., FromCity => ..., FromState => ..., FromZip => ..., ToAddress => ..., ... Package1Name => 'big box', Package1Weight => 10, Package1Width => 20, ... Package2Name => 'small tube', Package2Weight => 5, ... Carrier1Name => 'MonkeyFlingers', Carrier2Name => 'StarvingSoftwareEngineers', ... Method1Name => 'Overnight', Method2Name => 'SlowBoatToChina', ... );
I'm leaving a lot out here. Let's just say we end up with a URL that's about 300 to 1000 characters long. Ugh. Dumb interface. Now, we make the request:
my $response = get $uri;
At this point, $response
is either undef
(the fetch failed), or
some XML-like string (with the ugly extra whitespace). Again, simplifying
it a bit, it looks like this:
<?xml version="1.0"> <response> <package id="big box"> <quote id=1> <carrier>MonkeyFlingers</carrier> <method>Overnight</method> <amount>123.95</amount> </quote> <quote id=2> <carrier>MonkeyFlingers</carrier> <method>SlowBoatToChina</method> <amount>3.95</amount> </quote> <quote id=3> <carrier>StarvingSoftwareEngineers</carrier> <method>Overnight</method> <amount>99.50</amount> </quote> <quote id=4> <carrier>StarvingSoftwareEngineers</carrier> <method>SlowBoatToChina</method> <amount>3.50</amount> </quote> </package> <package id="small tube"> <quote id=1> <carrier>MonkeyFlingers</carrier> <method>Overnight</method> <amount>85.50</amount> </quote> <quote id=2> <carrier>MonkeyFlingers</carrier> <method>SlowBoatToChina</method> <amount>3.95</amount> </quote> <quote id=3> <carrier>StarvingSoftwareEngineers</carrier> <method>Overnight</method> <amount>72.50</amount> </quote> <quote id=4> <carrier>StarvingSoftwareEngineers</carrier> <method>SlowBoatToChina</method> <amount>3.00</amount> </quote> </package> </response>
Because it's promised to be well-formed, we know that we'll get nicely matching pairs of start and end tags from a parsing.
We can parse this result using HTML::Parser
using a nice program
structure like:
my @state; ## other results and accumulator variables go here
my $p = HTML::Parser->new ( xml_mode => 1, start_h => [sub { my ($tagname, $attr) = @_; push @state, $tagname; ## We are beginning state "@state"
}, "tagname, attr"], text_h => [sub { my ($text) = @_; ## We see content within state "@state" }, "dtext"], end_h => [sub { my ($tagname) = @_; ## We are ending state "@state" pop @state; }, "tagname"], );
$p->parse($result); $p->eof;
The array of @state
, when interpolated within double quotes, will
be a space-separated list of states showing where we are in the XML
hierarchy. For example, at the beginning of a particular package,
@state
will be response package
in the first handler. This is
the basic pattern. For our specific application, we'll need to
aggregate the resulting data into our final data structure:
my @state; my %quotes; # all quotes, keyed by package name my $package; # the current package name my %quote; # the current quote being accumulated for $package
use HTML::Parser; my $p = HTML::Parser->new ( xml_mode => 1, start_h => [sub { my ($tagname, $attr) = @_; push @state, $tagname; ## We are beginning state "@state" if ("@state" eq "response package") { # beginning of package $package = $attr->{id}; # pick out the package id } elsif ("@state" eq "response package quote") { # beginning of quote %quote = (); # empty out the quote info } }, "tagname, attr"], text_h => [sub { my ($text) = @_; ## We see content within state "@state" if ("@state" eq "response package quote carrier") { $quote{"carrier"} = $text; # carrier for this quote } elsif ("@state" eq "response package quote method") { $quote{"method"} = $text; # method for this quote } elsif ("@state" eq "response package quote amount") { $quote{"amount"} = $text; # amount for this quote } }, "dtext"], end_h => [sub { my ($tagname) = @_; ## We are ending state "@state" if ("@state" eq "response package quote") { # end of a quote push @{$quotes{$package}}, { %quote }; # save hash copy } pop @state; }, "tagname"], );
$p->parse($result); $p->eof;
Wow. Lots of stuff there. Basically, I looked at each beginning,
middle, and end of each state, and attached actions to perform at that
step. Beginning states are used to reset accumulator variables, or
save the attributes of the start tag. Middles are used to extract the
text content between elements. Ends merge the accumulators into
larger structures. If you keep that pattern in mind, it's pretty easy
to come up with the locations for things. The resulting data
structure when dumped with Data::Dumper
looks like this:
$VAR1 = { 'big box' => [ { 'carrier' => 'MonkeyFlingers', 'amount' => '123.95', 'method' => 'Overnight' }, { 'carrier' => 'MonkeyFlingers', 'amount' => '3.95', 'method' => 'SlowBoatToChina' }, { 'carrier' => 'StarvingSoftwareEngineers', 'amount' => '99.50', 'method' => 'Overnight' }, { 'carrier' => 'StarvingSoftwareEngineers', 'amount' => '3.50', 'method' => 'SlowBoatToChina' } ], 'small tube' => [ { 'carrier' => 'MonkeyFlingers', 'amount' => '85.50', 'method' => 'Overnight' }, { 'carrier' => 'MonkeyFlingers', 'amount' => '3.95', 'method' => 'SlowBoatToChina' }, { 'carrier' => 'StarvingSoftwareEngineers', 'amount' => '72.50', 'method' => 'Overnight' }, { 'carrier' => 'StarvingSoftwareEngineers', 'amount' => '3.00', 'method' => 'SlowBoatToChina' } ] };
And then we'd wander through that structure in the rest of the
application. The problem is solved, by using HTML::Parser
to parse
XML.
Parsing HTML with XML::LibXML
The XML::LibXML
module is a wrapper around the GNOME libxml2
parser, which is perhaps even more finicky to install than expat
,
but I seem to have managed. But it's worth it, because of the
additional functionality (and I'm told, speed) over the older
expat
.
First, the XML::LibXML
module can parse HTML, including dealing
with the optional close tags for the elements, and return back a nice
node tree, suitable for spitting out as XHTML
. For example,
parsing and cleaning up the http://www.perl.org
web page looks like
this:
use LWP::Simple; my $html = get "http://www.perl.org"; use XML::LibXML; my $doc = XML::LibXML->new->parse_html_string($html); print $doc->toStringHTML;
The result is clean enough to be valid XHTML, with all the tags nicely balanced.
But another nice feature of XML::LibXML
is the built-in XPath
processor. For web-scraping, this is a very powerful tool. For
example, let's say I want to find the current rank of Learning Perl
in O'Reilly's top-25 book sales page (updated weekly).
use LWP::Simple; my $html = get "http://www.oreilly.com/catalog/top25.html"; use XML::LibXML; my $doc = XML::LibXML->new->parse_html_string($html);
I now have a DOM object of the page. I'm interested in the table
in the middle of the page that has the book rankings. In the table,
the td
cell containing Learning Perl
is in the same row as
the cell containing the ranking. With a simple bit of XPath magic,
I can first locate the cell containing the title:
//text()[contains(., "Learning Perl")]
and then from there go to the closest enclosing row and pick out the first table cell's content:
//text()[contains(., "Learning Perl")]/ancestor::tr[1]/td[1]/text()
and then get the string value of that node. The nice thing about this XPath is that it's relatively immune to layout changes or added information or reformatting. We're specifying a location by logical steps and not directly by syntax. Back to our DOM, this would be simply:
use LWP::Simple; my $html = get "http://www.oreilly.com/catalog/top25.html"; use XML::LibXML; my $doc = XML::LibXML->new->parse_html_string($html); my $location = '//text()[contains(., "Learning Perl")]' . '/ancestor::tr[1]/td[1]/text()'; print $doc->findvalue($location);
I got to the data I needed, relatively easy. It didn't even matter that the book title was actually within an offpage link. It just did the right thing. And that's why you should consider parsing HTML using an XML parser, especially if you're webscraping.
Summary
I hope you've seen now that sometimes using the wrong tool for the right reasons can be fun and useful. Until next time, enjoy!