Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 55 (Jan 2004)

[suggested title: ``Using xsh to scrape web pages'']

One activity I find myself frequently attempting is extracting bits of useful information from existing web pages that change over some time period. In an ideal world, everything I would want would be provided via some RSS feed or ``wholesale'' SOAP web service, but in the world I still live in, I usually end up parsing the ``retail'' HTML intended for browser views.

Although HTML isn't XML, they both have common roots, and I've been experimenting lately with using XML::LibXML to parse HTML. (See this column, June 2003, for example.) The advantage to using an XML parser to handle the angley-bracketed text is that once the text is parsed, we can use DOM and XPath operations on the result, sometimes resulting in greater speed or flexibility over the traditional data structures built using c<HTML::Parser>.

Recently, I've started playing with the evolving xsh language, which can be briefly described as an XML manipulation shell. xsh uses XML::LibXML to parse an XML file into an internal document structure. Once the document is built, we can manipulate it using a mix of Perl syntax and other control structures specifically designed for navigating a document tree. Following the Unix filesystem metaphor, we can use xsh commands of ``cd'' and ``pwd'' much like their Unix counterpart, where the ``current directory'' is a document node of interest.

Many operations in xsh are specified using XPath, so a working knowledge of XPath is very helpful. For example,

        cd /;

focuses the ``current node'' at the top of the document tree, while:

        pwd;

shows that we are currently located there. To get ``inside'' the root node of the document, we can use:

        cd /*;

which finds the one matching node and sets that as the current focus, which is undone by:

        cd ..;

The metaphor works nicely because XPath's notation mimics the Unix filesystem for many of common operations.

Once we have a node of interest, we can display its location using locate, as in:

        locate .; # same as "pwd;"

but I use the locate command more often to print all nodes that match a given XPath expression, as in:

        locate //a[@href];

which finds and shows the path to all a nodes that have an href attribute. To display the href directly, I can use a similar expression:

        locate //a/@href;

which sets the context to the attribute itself. If I merely wanted a count, I could replace this with:

        count //a/@href;

Like other traditional shells, xsh can be used both as a programming scripting language, and directly interactively. I've found that the best way to write an xsh program is to have an editor open on my emerging program, and another window running an interactive session on a sample document similar to the one I'll eventually be parsing. I can start this with:

        xsh -I path/to/some/file.xml

and I get the xsh prompt. In interactive mode, an implied semicolon at the end of each line makes entering statements easier.

The real power of xsh is that it can be intertwined with Perl code in your script. At any point, you can invoke Perl with:

        eval { ... block of perl code here ... };

The scalar variables are shared between the Perl code and the xsh code, simplifying the integration. And, you can call back to xsh from inside the Perl code using the xsh() function.

While I won't have room to teach all of xsh in this article, I suggest you surf on over to xsh.sourceforge.net for further information. In the meanwhile, let me introduce xsh a little more by way of an example.

I noticed the other day that http://www.oreilly.com/animals.html has a list of the O'Reilly ``animal'' covers, organized by cover title, but as I was scanning through the list, I noticed that a few of the animals were used for more than one title. I was curious about how many animals were reused, so I decided to write a program to extract the information. I started by invoking an xsh shell, entering:

        open HTML a = http://www.oreilly.com/animals.html

The HTML flag here tells xsh to use the HTML-string-parsing interfaces, rather than the XML-string-parsing interfaces. Additionally, because XML::LibXML uses Gnome's libxml2, we don't need to use LWP or some external program to fetch the URL.

Once I had the document in memory, I used some simple XPath queries to determine the structure of the web page. For example, I found all tables that weren't nested (didn't contain another table) with:

        locate //table[not(.//table)]

Certainly I could have stared at the raw HTML (or even a prettied version) for a long time to find the same information. With xsh, I was simply ``exploring a document'' using XPath.

After a bit of experimentation, I ended up with the program shown in [listing one, below]. Note that the entire program consists of a use statement in line 3, followed by a call to xsh() in line 4 of a here-doc-string of the remaining text lines. I really wanted to do something like:

        #!/usr/bin/env xsh
        ... xsh script here ...

but unfortunately, the version of xsh as I write this requires a -l flag to load a script. I'm told that a future version of xsh will work as needed.

The xsh script starts in line 6, with a command to enable recovering mode. Even though XML::LibXML deals relatively well with HTML, many web pages (including the one we're parsing) contain broken entity references. A hint to web page programmers: the text

        <a href="/some/place?fred=flintstone&barney=rubble">
        click here!</a>

is broken. You need to escape that ampersand as &amp;. Just because nearly every browser error-corrects for this is no excuse to write bad HTML!

Line 7 turns on ``quiet'' mode, which prevents the open in line 8 from announcing its success.

An xsh script can have many documents open at once. XPath expressions can refer to nodes in other documents by prefixing the document name and a colon in front of the traditional XPath expression.

Lines 9 through 18 form a two-level nested foreach loop structure. The foreach beginning in line 9 puts a traditional Perl expression inside curly-braces. Each iteration of this resulting value will be placed into $__ (yes, with two underscores for reasons I don't completely understand).

The inner foreach loop uses an XPath expression to define a list of nodes. The ``current node'' is set to each matching node, and the block of code is then executed. Note that we're looking for all tables that don't contain a nested table, and which have a first row that has a first or second table cell that contains Book Title. The value of $__ is interpolated directly from the variable set in the outer loop. If I were a bit more clever, I might have been able to do without the nested loops, but I didn't care at this point, since the program worked. The final part of the XPath expression finds all table rows after the first row, which is where the real data is found.

Line 13 contains a debugging step... I wanted to see where these rows were actually found as I was developing the program. The xsh script can include Perl-style pound-sign comments, so this is commented out.

Line 14 assigns the string value of the last table cell in the row currently being examined to a scalar $cover. This variable is visible both to further xsh steps as well as included Perl code. I observed that the last cell always contained the animal (or other) cover, hence the capture. Similarly, $subject is set in line 15 to be the string value of the penultimate table cell. The values are automatically de-entitized, so I end up with a plain string here.

Line 16 breaks out into Perl to access a traditional Perl hash named %cover. The keys are the cover animals, while the corresponding values are array references listing all books with that particular animal.

Note the ease with which Perl and xsh code co-exist to produce the result. And, while this could have been written using a more traditional straight invocation of XML::LibXML, I think we're ahead by about five lines of code already in the first 15 lines here.

Now for the fun part. I want to create a new XML output that looks something like this:

    <?xml version="1.0" encoding="utf-8"?>
    <root>
      ...
      <cover>
        <animal>Lions</animal>
        <book>Java &amp; XML</book>
      </cover>
      <cover>
        <animal>Llama</animal>
        <book>Learning Perl</book>
      </cover>
      <cover>
        <animal>Llama &amp; camel</animal>
        <book>Perl Pocket Reference</book>
      </cover>
      <cover>
        <animal>Locking pliers</animal>
        <book>Google Hacks</book>
        <book>Google Pocket Guide</book>
      </cover>
      ...
    </root>

I can do this by walking through the newly created hash and using traditional print operations, but it's more fun to just use xsh. Line 19 creates a new document t1 and gives it a root node of root.

Line 20 uses a Perl-style foreach expression to get the sorted keys of %cover. Note that these animals will be in $__, not $_, and I traced this in line 21 while I was debugging the program.

Line 22 adds a new cover element at the end of the root element. These new nodes are always added last, and line 23 moves our current focus inside this new element.

Line 24 and 25 create the animal node within the most recent cover node. The value of $__ is automatically re-entitized to be valid XML.

Lines 26 through 30 walk through the titles for the given animal cover, again using a Perl-style foreach loop. The book titles appear in $__, traced in line 27 during debugging.

Each new book element is created at the end of the current node in line 28, and the title text is inserted into this node in line 29. Note that by proper use of the current context node, the various pieces of animal and covers using that animal are brought together cleanly and simply.

We now have a new document which looks just like what we want to display, and we'll do that in lines 32 and 33. The quiet mode is again enforced, although it hasn't changed since line 7, but I consider this just some defensive programming on my part. Line 33 dumps the XML text to standard output in a nice indented fashion, by default.

As more data shows up on the web both in HTML and XML forms, I can see how this kind of scripting will be helpful to me. Of course, for specialized XML such as RSS or SOAP, other modules will do the job with fewer steps, but nothing stops me from using those modules in xsh programs as well. And xsh also connects with XML::LibXSLT for XSLT processing. Could xsh be the next ASP-like language? Perhaps, with a little more work on caching the parsed tree. Until next time, enjoy!

Listings

        =1=     #!/usr/bin/perl
        =2=     
        =3=     use XML::XSH;
        =4=     xsh <<'END_XSH';
        =5=     
        =6=     recovering 1; # for broken entity recovery (a frequent HTML problem)
        =7=     quiet; # avoid tracing of open
        =8=     open HTML animals = "http://www.oreilly.com/animals.html";;
        =9=     foreach {1..2} {
        =10=      foreach //table[not(.//table)
        =11=                      and contains(tr[1]/td[$__], "Book Title")
        =12=                     ]/tr[position() > 1] {
        =13=      # pwd;
        =14=      $cover = string(td[last()]);
        =15=      $subject = string(td[last() - 1]);
        =16=      eval { push @{$cover{$cover}}, $subject; }
        =17=      }
        =18=    }
        =19=    create t1 root;
        =20=    foreach {sort keys %cover} {
        =21=      ## print "animal $__";
        =22=      insert element cover into /root;
        =23=      cd /root/cover[last()];
        =24=      insert element animal into .;
        =25=      insert text $__ into animal;
        =26=      foreach {sort @{$cover{$__}}} {
        =27=        ## print "book $__";
        =28=        insert element book into .;
        =29=        insert text $__ into book[last()];
        =30=      }
        =31=    }
        =32=    quiet; # avoid final message from ls
        =33=    ls /;
        =34=    END_XSH

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.