Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Linux Magazine Column 49 (Jun 2003)

More and more these days, you get faced with a problem with angle brackets somewhere in the data. How do you find what you're looking for in HTML or XML data?

At first glance, the question has an obvious answer. If you have an HTML task, you use HTML::Parser or some derived or wrapper class. If you have an XML task, you use XML::Parser or XML::LibXML. But maybe the obvious answer isn't always the best. Let's look at a couple of cases.

Parsing XML with HTML::Parser

My friend Doug LaFarge was recently working on an e-commerce website. Part of the task involved computing the shipping charges by connecting up with a remote web service via HTTP, passing the size and weight of the packages and destination address, and getting back a response.

Now I won't embarass the service provider by giving their name, but they really did a pretty poor job of designing and documenting their service. First, their ``sample Perl code'' could never have run, as they were using + to do string concatenation. (It was apparently copied from their JavaScript example, except they weren't paying attention.) Second, they return something that is nearly XML, but has extra leading and trailing whitespace, so a true XML parser aborts. (You have to trim the whitespace before feeding the parser.) And finally, they return XML, but they're not using SOAP, which seems odd because it looks like a natural SOAP application. So, if you can get around the fact that their example programs don't run, the response requires massaging before parsing, and it's not SOAP, it works fine.

After we had informed the company that their sample program didn't work, they asked us if we could suggest some improvements to it. At first, I reached for XML::Parser, and then realized that this would be bad as model code, because in my experience, XML::Parser is a bit finicky to install, requiring expat to be installed as well. And there was still that nasty bit of needing to trim the whitespace.

But I had noticed some time ago that the friendly HTML::Parser module has an ``XML mode'', which modifies the parser so that it can deal mostly with XHTML, but works neatly on generic well-formed XML. And, since the sample code we were developing was presuming that LWP was installed, we could also presume that in nearly all cases, we also had HTML::Parser as well.

I quickly started hacking up some code, and with a half-hour, was happily fetching the data and at least recognizing the start/end tags and content. Let's take a look at some of the code snippets. First, we need to construct the URL containing the shipping parameters, including credentials for authorization:

  my $API_URL = "http://name.of.shipping.company/calculate.cgi";;
  my $USERNAME = "doug";
  my $PASSWORD = "doug's password";
  use URI;
  my $uri = URI->new($API_URL);
  $uri->query_form(
    Username => $USERNAME,
    Password => $PASSWORD,
    FromAddress => ...,
    FromCity => ...,
    FromState => ...,
    FromZip => ...,
    ToAddress => ...,
    ...
    Package1Name => 'big box',
    Package1Weight => 10,
    Package1Width => 20,
    ...
    Package2Name => 'small tube',
    Package2Weight => 5,
    ...
    Carrier1Name => 'MonkeyFlingers',
    Carrier2Name => 'StarvingSoftwareEngineers',
    ...
    Method1Name => 'Overnight',
    Method2Name => 'SlowBoatToChina',
    ...
  );

I'm leaving a lot out here. Let's just say we end up with a URL that's about 300 to 1000 characters long. Ugh. Dumb interface. Now, we make the request:

  my $response = get $uri;

At this point, $response is either undef (the fetch failed), or some XML-like string (with the ugly extra whitespace). Again, simplifying it a bit, it looks like this:

  <?xml version="1.0">
  <response>
    <package id="big box">
      <quote id=1>
        <carrier>MonkeyFlingers</carrier>
        <method>Overnight</method>
        <amount>123.95</amount>
      </quote>
      <quote id=2>
        <carrier>MonkeyFlingers</carrier>
        <method>SlowBoatToChina</method>
        <amount>3.95</amount>
      </quote>
      <quote id=3>
        <carrier>StarvingSoftwareEngineers</carrier>
        <method>Overnight</method>
        <amount>99.50</amount>
      </quote>
      <quote id=4>
        <carrier>StarvingSoftwareEngineers</carrier>
        <method>SlowBoatToChina</method>
        <amount>3.50</amount>
      </quote>
    </package>
    <package id="small tube">
      <quote id=1>
        <carrier>MonkeyFlingers</carrier>
        <method>Overnight</method>
        <amount>85.50</amount>
      </quote>
      <quote id=2>
        <carrier>MonkeyFlingers</carrier>
        <method>SlowBoatToChina</method>
        <amount>3.95</amount>
      </quote>
      <quote id=3>
        <carrier>StarvingSoftwareEngineers</carrier>
        <method>Overnight</method>
        <amount>72.50</amount>
      </quote>
      <quote id=4>
        <carrier>StarvingSoftwareEngineers</carrier>
        <method>SlowBoatToChina</method>
        <amount>3.00</amount>
      </quote>
    </package>
  </response>

Because it's promised to be well-formed, we know that we'll get nicely matching pairs of start and end tags from a parsing.

We can parse this result using HTML::Parser using a nice program structure like:

  my @state;
  ## other results and accumulator variables go here
  my $p = HTML::Parser->new
    (
     xml_mode => 1,
     start_h =>
     [sub {
        my ($tagname, $attr) = @_;
        push @state, $tagname;
        ## We are beginning state "@state"
      }, "tagname, attr"],
     text_h =>
     [sub {
        my ($text) = @_;
        ## We see content within state "@state"
      }, "dtext"],
     end_h =>
     [sub {
        my ($tagname) = @_;
        ## We are ending state "@state"
        pop @state;
      }, "tagname"],
    );
  $p->parse($result);
  $p->eof;

The array of @state, when interpolated within double quotes, will be a space-separated list of states showing where we are in the XML hierarchy. For example, at the beginning of a particular package, @state will be response package in the first handler. This is the basic pattern. For our specific application, we'll need to aggregate the resulting data into our final data structure:

  my @state;
  my %quotes; # all quotes, keyed by package name
  my $package; # the current package name
  my %quote; # the current quote being accumulated for $package
  use HTML::Parser;
  my $p = HTML::Parser->new
    (
     xml_mode => 1,
     start_h =>
     [sub {
        my ($tagname, $attr) = @_;
        push @state, $tagname;
        ## We are beginning state "@state" 
        if ("@state" eq "response package") { # beginning of package
          $package = $attr->{id}; # pick out the package id
        } elsif ("@state" eq "response package quote") { # beginning of quote
          %quote = (); # empty out the quote info
        }
      }, "tagname, attr"],
     text_h =>
     [sub {
        my ($text) = @_;
        ## We see content within state "@state"
        if ("@state" eq "response package quote carrier") {
          $quote{"carrier"} = $text; # carrier for this quote
        } elsif ("@state" eq "response package quote method") {
          $quote{"method"} = $text; # method for this quote
        } elsif ("@state" eq "response package quote amount") {
          $quote{"amount"} = $text; # amount for this quote
        }
      }, "dtext"],
     end_h =>
     [sub {
        my ($tagname) = @_;
        ## We are ending state "@state"
        if ("@state" eq "response package quote") { # end of a quote
          push @{$quotes{$package}}, { %quote }; # save hash copy
        }
        pop @state;
      }, "tagname"],
    );
  $p->parse($result);
  $p->eof;

Wow. Lots of stuff there. Basically, I looked at each beginning, middle, and end of each state, and attached actions to perform at that step. Beginning states are used to reset accumulator variables, or save the attributes of the start tag. Middles are used to extract the text content between elements. Ends merge the accumulators into larger structures. If you keep that pattern in mind, it's pretty easy to come up with the locations for things. The resulting data structure when dumped with Data::Dumper looks like this:

  $VAR1 = {
    'big box' => [
      {
        'carrier' => 'MonkeyFlingers',
        'amount' => '123.95',
        'method' => 'Overnight'
      },
      {
        'carrier' => 'MonkeyFlingers',
        'amount' => '3.95',
        'method' => 'SlowBoatToChina'
      },
      {
        'carrier' => 'StarvingSoftwareEngineers',
        'amount' => '99.50',
        'method' => 'Overnight'
      },
      {
        'carrier' => 'StarvingSoftwareEngineers',
        'amount' => '3.50',
        'method' => 'SlowBoatToChina'
      }
    ],
    'small tube' => [
      {
        'carrier' => 'MonkeyFlingers',
        'amount' => '85.50',
        'method' => 'Overnight'
      },
      {
        'carrier' => 'MonkeyFlingers',
        'amount' => '3.95',
        'method' => 'SlowBoatToChina'
      },
      {
        'carrier' => 'StarvingSoftwareEngineers',
        'amount' => '72.50',
        'method' => 'Overnight'
      },
      {
        'carrier' => 'StarvingSoftwareEngineers',
        'amount' => '3.00',
        'method' => 'SlowBoatToChina'
      }
    ]
  };

And then we'd wander through that structure in the rest of the application. The problem is solved, by using HTML::Parser to parse XML.

Parsing HTML with XML::LibXML

The XML::LibXML module is a wrapper around the GNOME libxml2 parser, which is perhaps even more finicky to install than expat, but I seem to have managed. But it's worth it, because of the additional functionality (and I'm told, speed) over the older expat.

First, the XML::LibXML module can parse HTML, including dealing with the optional close tags for the elements, and return back a nice node tree, suitable for spitting out as XHTML. For example, parsing and cleaning up the http://www.perl.org web page looks like this:

  use LWP::Simple;
  my $html = get "http://www.perl.org";;
  use XML::LibXML;
  my $doc = XML::LibXML->new->parse_html_string($html);
  print $doc->toStringHTML;

The result is clean enough to be valid XHTML, with all the tags nicely balanced.

But another nice feature of XML::LibXML is the built-in XPath processor. For web-scraping, this is a very powerful tool. For example, let's say I want to find the current rank of Learning Perl in O'Reilly's top-25 book sales page (updated weekly).

  use LWP::Simple;
  my $html = get "http://www.oreilly.com/catalog/top25.html";;
  use XML::LibXML;
  my $doc = XML::LibXML->new->parse_html_string($html);

I now have a DOM object of the page. I'm interested in the table in the middle of the page that has the book rankings. In the table, the td cell containing Learning Perl is in the same row as the cell containing the ranking. With a simple bit of XPath magic, I can first locate the cell containing the title:

  //text()[contains(., "Learning Perl")]

and then from there go to the closest enclosing row and pick out the first table cell's content:

  //text()[contains(., "Learning Perl")]/ancestor::tr[1]/td[1]/text()

and then get the string value of that node. The nice thing about this XPath is that it's relatively immune to layout changes or added information or reformatting. We're specifying a location by logical steps and not directly by syntax. Back to our DOM, this would be simply:

  use LWP::Simple;
  my $html = get "http://www.oreilly.com/catalog/top25.html";;
  use XML::LibXML;
  my $doc = XML::LibXML->new->parse_html_string($html);
  my $location =
    '//text()[contains(., "Learning Perl")]' .
    '/ancestor::tr[1]/td[1]/text()';
  print $doc->findvalue($location);

I got to the data I needed, relatively easy. It didn't even matter that the book title was actually within an offpage link. It just did the right thing. And that's why you should consider parsing HTML using an XML parser, especially if you're webscraping.

Summary

I hope you've seen now that sometimes using the wrong tool for the right reasons can be fun and useful. Until next time, enjoy!


Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.