Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
![]() |
Download this listing! | ![]() |
![]() |
![]() |
Linux Magazine Column 55 (Jan 2004)
[suggested title: ``Using xsh to scrape web pages'']
One activity I find myself frequently attempting is extracting bits of useful information from existing web pages that change over some time period. In an ideal world, everything I would want would be provided via some RSS feed or ``wholesale'' SOAP web service, but in the world I still live in, I usually end up parsing the ``retail'' HTML intended for browser views.
Although HTML isn't XML, they both have common roots, and I've been
experimenting lately with using XML::LibXML
to parse HTML. (See
this column, June 2003, for example.) The advantage to using an XML
parser to handle the angley-bracketed text is that once the text is
parsed, we can use DOM and XPath operations on the result, sometimes
resulting in greater speed or flexibility over the traditional data
structures built using c<HTML::Parser>.
Recently, I've started playing with the evolving xsh
language,
which can be briefly described as an XML manipulation shell. xsh
uses XML::LibXML
to parse an XML file into an internal document
structure. Once the document is built, we can manipulate it using a
mix of Perl syntax and other control structures specifically designed
for navigating a document tree. Following the Unix filesystem
metaphor, we can use xsh
commands of ``cd'' and ``pwd'' much like their
Unix counterpart, where the ``current directory'' is a document node of
interest.
Many operations in xsh
are specified using XPath, so a working
knowledge of XPath is very helpful. For example,
cd /;
focuses the ``current node'' at the top of the document tree, while:
pwd;
shows that we are currently located there. To get ``inside'' the root node of the document, we can use:
cd /*;
which finds the one matching node and sets that as the current focus, which is undone by:
cd ..;
The metaphor works nicely because XPath's notation mimics the Unix filesystem for many of common operations.
Once we have a node of interest, we can display its location using locate
,
as in:
locate .; # same as "pwd;"
but I use the locate
command more often to print all nodes that
match a given XPath expression, as in:
locate //a[@href];
which finds and shows the path to all a
nodes that have an href
attribute. To display the href
directly, I can use a similar expression:
locate //a/@href;
which sets the context to the attribute itself. If I merely wanted a count, I could replace this with:
count //a/@href;
Like other traditional shells, xsh
can be used both as a
programming scripting language, and directly interactively. I've
found that the best way to write an xsh
program is to have an
editor open on my emerging program, and another window running an
interactive session on a sample document similar to the one I'll
eventually be parsing. I can start this with:
xsh -I path/to/some/file.xml
and I get the xsh
prompt. In interactive mode, an implied
semicolon at the end of each line makes entering statements easier.
The real power of xsh
is that it can be intertwined with Perl code
in your script. At any point, you can invoke Perl with:
eval { ... block of perl code here ... };
The scalar variables are shared between the Perl code and the xsh
code, simplifying the integration. And, you can call back to xsh
from inside the Perl code using the xsh()
function.
While I won't have room to teach all of xsh
in this article, I
suggest you surf on over to xsh.sourceforge.net
for further
information. In the meanwhile, let me introduce xsh
a little more
by way of an example.
I noticed the other day that http://www.oreilly.com/animals.html
has a list of the O'Reilly ``animal'' covers, organized by cover title,
but as I was scanning through the list, I noticed that a few of the
animals were used for more than one title. I was curious about how
many animals were reused, so I decided to write a program to extract
the information. I started by invoking an xsh
shell, entering:
open HTML a = http://www.oreilly.com/animals.html
The HTML
flag here tells xsh
to use the HTML-string-parsing
interfaces, rather than the XML-string-parsing interfaces.
Additionally, because XML::LibXML
uses Gnome's libxml2
, we don't
need to use LWP
or some external program to fetch the URL.
Once I had the document in memory, I used some simple XPath queries to determine the structure of the web page. For example, I found all tables that weren't nested (didn't contain another table) with:
locate //table[not(.//table)]
Certainly I could have stared at the raw HTML (or even a prettied
version) for a long time to find the same information. With xsh
, I
was simply ``exploring a document'' using XPath.
After a bit of experimentation, I ended up with the program shown in
[listing one, below]. Note that the entire program consists of a
use
statement in line 3, followed by a call to xsh()
in line 4
of a here-doc-string of the remaining text lines. I really wanted to do
something like:
#!/usr/bin/env xsh ... xsh script here ...
but unfortunately, the version of xsh
as I write this requires a
-l
flag to load a script. I'm told that a future version of xsh
will work as needed.
The xsh
script starts in line 6, with a command to enable
recovering
mode. Even though XML::LibXML
deals relatively well
with HTML, many web pages (including the one we're parsing) contain
broken entity references. A hint to web page programmers: the text
<a href="/some/place?fred=flintstone&barney=rubble"> click here!</a>
is broken. You need to escape that ampersand as &
. Just
because nearly every browser error-corrects for this is no excuse to
write bad HTML!
Line 7 turns on ``quiet'' mode, which prevents the open
in line 8
from announcing its success.
An xsh
script can have many documents open at once. XPath
expressions can refer to nodes in other documents by prefixing the
document name and a colon in front of the traditional XPath
expression.
Lines 9 through 18 form a two-level nested foreach loop structure.
The foreach
beginning in line 9 puts a traditional Perl expression
inside curly-braces. Each iteration of this resulting value will be
placed into $__
(yes, with two underscores for reasons I don't
completely understand).
The inner foreach
loop uses an XPath expression to define a list of
nodes. The ``current node'' is set to each matching node, and the block
of code is then executed. Note that we're looking for all tables that
don't contain a nested table, and which have a first row that has a
first or second table cell that contains Book Title
. The value of
$__
is interpolated directly from the variable set in the outer
loop. If I were a bit more clever, I might have been able to do
without the nested loops, but I didn't care at this point, since the
program worked. The final part of the XPath expression finds all
table rows after the first row, which is where the real data is found.
Line 13 contains a debugging step... I wanted to see where these rows
were actually found as I was developing the program. The xsh
script can include Perl-style pound-sign comments, so this is
commented out.
Line 14 assigns the string value of the last table cell in the row
currently being examined to a scalar $cover
. This variable is
visible both to further xsh
steps as well as included Perl code. I
observed that the last cell always contained the animal (or other)
cover, hence the capture. Similarly, $subject
is set in line 15 to
be the string value of the penultimate table cell. The values are
automatically de-entitized, so I end up with a plain string here.
Line 16 breaks out into Perl to access a traditional Perl hash named
%cover
. The keys are the cover animals, while the corresponding
values are array references listing all books with that particular
animal.
Note the ease with which Perl and xsh
code co-exist to produce the
result. And, while this could have been written using a more
traditional straight invocation of XML::LibXML
, I think we're ahead
by about five lines of code already in the first 15 lines here.
Now for the fun part. I want to create a new XML output that looks something like this:
<?xml version="1.0" encoding="utf-8"?> <root> ... <cover> <animal>Lions</animal> <book>Java & XML</book> </cover> <cover> <animal>Llama</animal> <book>Learning Perl</book> </cover> <cover> <animal>Llama & camel</animal> <book>Perl Pocket Reference</book> </cover> <cover> <animal>Locking pliers</animal> <book>Google Hacks</book> <book>Google Pocket Guide</book> </cover> ... </root>
I can do this by walking through the newly created hash and using
traditional print
operations, but it's more fun to just use xsh
.
Line 19 creates a new document t1
and gives it a root node of root
.
Line 20 uses a Perl-style foreach
expression to get the sorted keys
of %cover
. Note that these animals will be in $__
, not $_
,
and I traced this in line 21 while I was debugging the program.
Line 22 adds a new cover
element at the end of the root
element.
These new nodes are always added last, and line 23 moves our current
focus inside this new element.
Line 24 and 25 create the animal
node within the most recent
cover
node. The value of $__
is automatically re-entitized to
be valid XML.
Lines 26 through 30 walk through the titles for the given animal
cover, again using a Perl-style foreach
loop. The book titles
appear in $__
, traced in line 27 during debugging.
Each new book
element is created at the end of the current node in
line 28, and the title text is inserted into this node in line 29.
Note that by proper use of the current context node, the various
pieces of animal and covers using that animal are brought together
cleanly and simply.
We now have a new document which looks just like what we want to display, and we'll do that in lines 32 and 33. The quiet mode is again enforced, although it hasn't changed since line 7, but I consider this just some defensive programming on my part. Line 33 dumps the XML text to standard output in a nice indented fashion, by default.
As more data shows up on the web both in HTML and XML forms, I can see
how this kind of scripting will be helpful to me. Of course, for
specialized XML such as RSS or SOAP, other modules will do the job
with fewer steps, but nothing stops me from using those modules in
xsh
programs as well. And xsh
also connects with XML::LibXSLT
for XSLT processing. Could xsh
be the next ASP-like language?
Perhaps, with a little more work on caching the parsed tree. Until
next time, enjoy!
Listings
=1= #!/usr/bin/perl =2= =3= use XML::XSH; =4= xsh <<'END_XSH'; =5= =6= recovering 1; # for broken entity recovery (a frequent HTML problem) =7= quiet; # avoid tracing of open =8= open HTML animals = "http://www.oreilly.com/animals.html"; =9= foreach {1..2} { =10= foreach //table[not(.//table) =11= and contains(tr[1]/td[$__], "Book Title") =12= ]/tr[position() > 1] { =13= # pwd; =14= $cover = string(td[last()]); =15= $subject = string(td[last() - 1]); =16= eval { push @{$cover{$cover}}, $subject; } =17= } =18= } =19= create t1 root; =20= foreach {sort keys %cover} { =21= ## print "animal $__"; =22= insert element cover into /root; =23= cd /root/cover[last()]; =24= insert element animal into .; =25= insert text $__ into animal; =26= foreach {sort @{$cover{$__}}} { =27= ## print "book $__"; =28= insert element book into .; =29= insert text $__ into book[last()]; =30= } =31= } =32= quiet; # avoid final message from ls =33= ls /; =34= END_XSH