Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 51 (Jul 2000)

[suggested title: Simple XML processing and queries]

The buzz is still abuzz about XML. You've probably seen XML about 47 times in this issue before you got to my column, unless my column was the first one to which you turn! XML processing in Perl is a breeze, but is also a rapidly evolving technology, so I decided to tackle a simple problem in an unexpected way to show off some of the strategies.

The known Perl Mongers Perl Users Groups are all registered with ``the mother ship'' at the www.pm.org website, somewhere near http://www.pm.org/groups.shtml if they haven't moved it around by the time you read this. Now the pages there give a nice textual description, but they don't let you search for groups very easily.

Using HTML::Parser, you could scan through the output of the pages, looking for the words you want, but there's a better solution. The creators of this list (probably brian d foy, if I recall correctly) provide not only an HTML version meant for rendering in a browser, but also an XML version coded for machine use. And that's exactly what we need!

Here's a slightly-mangled sample of some of the XML data from that link:

  <?xml version="1.0" standalone="no"?>
  <!DOCTYPE perl-mongers PUBLIC "-//Perl Mongers//DTD for Perl user groups//EN" 
        "http://www.pm.org/XML/perl_mongers.dtd">
  <!-- This file is maintained by brian d foy, brian@smithrenaud.com -->
  <perl-mongers>
        <group id="0">
                <!-- new york was the first group -->
                <name>NY.pm</name>
                <location>
                        <city>New York City</city>
                        <state>New York</state>
                        <region>Center of the Universe</region>
                        <country>United States of America</country>
                        <continent>North America</continent>
                        <longitude>-73.94384</longitude>
                        <latitude>40.66980</latitude>
                </location>
                <email type="group">nypm@ny.pm.Borg</email>
                <tsar>
                        <name>David H. Adler</name>
                        <email type="personal">dha@panix.comm</email>
                </tsar>
                <web>http://ny.pm.Borg</web>;
                <mailing_list>
                        <name>General NY.pm discussion</name>
                        <email type="list">ny@lists.panix.comm</email>
                        <email type="list_admin">majordomo@lists.pm.Borg</email>
                        <subscribe>subscribe ny email_address</subscribe>
                        <unsubscribe>unsubscribe ny email_address</unsubscribe>
                </mailing_list>
                <mailing_list>
                        <name>NYC Perl Jobs</name>
                        <email type="list">nyc-perl-jobs@perl.Borg</email>
                        <email type="list_admin">majordomo@perl.Borg</email>
                        <subscribe>subscribe nyc-perl-jobs email_address</subscribe>
                        <unsubscribe>unsubscribe nyc-perl-jobs email_address</unsubscribe>
                </mailing_list>
                <date type="inception">19970827</date>
        </group>
  <!-- OTHER GROUPS WOULD BE HERE -->
  </perl-mongers>

Note that we can find a lot of information, all tagged appropriately. For this application, let's fetch the data, and extract the name of the group, the city/state/country location, and the contact info for the ``tsar''. For this, we'll use XML::Parser.

But then how should we search the data rows? If this were in a traditional relational database, we'd use a DBI interface to spew SQL queries at the database, with various wildcards and AND and OR options to make things interesting. But we don't have a database.

Well, we can have a ``lightweight'' database using the new DBD::RAM module. This module makes an ``in-core'' database that interfaces directly with DBI, including supporting full SQL queries!

So, the basic strategy is to put up a form asking for SQL search patterns for each of the columns, then when the form is submitted, use LWP to fetch the data, XML::Parser to parse the data and insert it into a DBD::RAM database via DBI, then use SQL queries constructed from the form fields to generate a response. And that's pretty much what the program does in [listing one, below].

Line 1 turns on taint checking, and warnings. Taint checking is my mandatory switch for all CGI-deployed programs, to keep me from shooting myself in the foot accidentally. (I've gotta want to shoot myself in the foot deliberately, apparently.) And warnings are handy to catch other stupid stuff, although once I deploy this script into production, I'll turn the warnings off.

Line 2 enables the compiler restrictions, preventing accidental use of mistyped variable names, soft references, and barewords. I turn this on when a program meets my ``ten-ten'' rule: if the program is longer than 10 lines, or I'm going to use it for more than 10 minutes.

Line 3 enables immediate output flushing on each print, handy for CGI scripts that intermingle print-style output with calls to system. I'm not doing that here, but I habitually put this in every CGI program anyway, just in case I add that stuff later.

Line 5 brings in Lincoln Stein's wonderful CGI module, along with all the form-accessing and HTML-generation shortcuts. Again, don't reinvent the wheel, when Lincoln has done such a good job (err, but not at reinventing the wheel, I mean).

Lines 7 through 24 set up some configuration variables. I needed some common values for both the form generation and handling the form response, so I put those here. This program is not one of my better designed programs (I threw it together in about an hour), so perhaps some of this could have been factored out a bit, but that's the way it is.

Lines 10 through 18 create a mapping from an SQL column name to a hierarchical place in the XML data tree. The first word on each line is the SQL column name (also used as the descriptive name for the form and the response). The remaining words represent the ``path'' to the interesting data item. The top-level on each item must be perl-monger, because that's the name of the outermost element. Again, I could probably have factored that out, but it would have made the code harder. Note that <Name> and Tsar_name are both a name element in the XML data, so only by relative tree position could we have distinguished the two. More on this table later. The code parses the here-string into an array of arrayrefs, such that $column_def[3] will be an array reference to an array containing Country and "perl-mongers group location country".

Lines 21 and 22 create derived data from @column_def: the column names, and a hash mapping the XML location to a column name.

Lines 26 to 42 dump out the CGI script response, both on the initial invocation, and the subsequent handling of the filled-out form. Because I'm illustrating back-end technology here, I've made the form very simple and structural. In a real live program, I'd have gone to some work to design a nice user interface.

The form consists primarily of the table generated in lines 32 through 38. For each SQL column name, we'll get a table row with the name on the left and a textfield input field on the right with the same name. The default value for each input field is %, which will mean ``anything goes'' to SQL on the response unless altered. This is needed because we are AND-ing together the various conditions, and we want unneeded conditions to be effectively ignored.

Line 44 calls the (big!) handle_form subroutine if we have been given any form-response parameters. And line 46 wraps up the main output.

The rest of the program, beginning in line 48, takes care of the response. Up to this point, it's just been a boring CGI script, so let's get some real action.

Lines 49 through 51 pull in the modules that we need. Note the use of a require rather than a use, because we don't want to load these modules if we're just printing out the form. These all come from the CPAN: XML::Parser to parse the XML (duh!), DBI to handle the SQL query, and LWP::Simple (part of the LWP massive distribution) to fetch the XML page from the Perl Monger's site.

Line 53 creates a DBD::RAM object: essentially, a nearly full-featured database living entirely in memory, compliant with the DBI specification. The RaiseError option is set, letting me get away with not putting any error checking on the rest of the calls.

Lines 54 through 56 create the data table. The columns are all of SQL type TEXT, general enough for what we're doing. If there had been any errors here, DBI would have die'd, so no need to check anything. The name table1 is arbitrary, but needs to be consistent throughout this program.

Lines 57 through 60 set up a ``DBI statement handle'', precompiling an SQL line that inserts a row into the table. The SQL contains placeholders (a question mark) that will be ``bound'' to the data when actually used.

Line 62 fetches the current XML data from the www.pm.org website. Note that if this were a real application instead of a toy, I'd provide a first level of caching right here. There's no point in re-fetching the data on every hit when it changes probably only once every few weeks. Look into the LWP::Simple::mirror subroutine, for example. Note that we have to use the full path to the get subroutine because we pulled LWP::Simple in with require and not use, so no importing was done.

Line 63 parses the data using a XML::Parser. The subroutine is defined in line 86, which we'll see in a moment. The result is that the database for which $insert was prepared now has many rows populated with the appropriate data. So it's time to query that data.

Lines 65 to 70 prepare an SQL query on the database, looking for all the columns that meet the conditions. Again, placeholders are used to stand in for the actual data. In this way, we don't have to worry about SQL-quoting the data, because DBI takes care of that for us.

Line 72 executes the query, binding the query parameters to the actual form values. The unchanging order of @columns ensures that everything lines up. If a parameter is not provided, we'll treat it like the ``anything goes'' wildcard of %, as before. This lets someone create a link like:

        http://www.stonehenge.comm/cgi/perlmonger_search?City=New+York

without having to specify all the other entries as %.

And finally, the output, generated as a table in lines 74 to 77. The column names are used as table headers, with the fetchall_arrayref method returning an arrayref to an array of arrayrefs to the data (essentially a two-dimensional table, but in the typical Perl fashion). Note here that we're using the cool feature (thanks Lincoln!) of CGI.pm to generate multiple td cells when given an arrayref instead of a simple scalar. That saves me one level of map-ing that would have otherwise been required.

Beginning in line 79, we have a set of related subroutines to enable the proper interface to XML::Parser. Line 80 puts us into a separate package (MyParser) so that the subroutine callback names are distinct from the rest of the program.

Lines 82 through 84 define data that is needed by all the subroutines in this section. This data will be private to the subroutines. The @state variable is a list of which XML element we are ``in'' during the parse, all the way back to the root. %one_group_data is the information for a particular PerlMonger group. And $insert_handle is the DBI insert statement, passed in during the initial invocation.

The initial access into this collection of subroutines is via the doParse subroutine beginning in line 86. The XML data ends up in $data in line 87, and the DBI insert statement handle is extracted in line 88.

Line 89 sets up and calls an XML::Parser object, configured to execute subroutines in the same package as this call with particular names (the ``callbacks''). And this object is told immediately to get to work, parsing the XML provided in $data. This results in a number of calls to StartTag, EndTag, and Text, defined in the following lines.

Lines 93 to 99 defines StartTag, which will be called with (at least) two parameters each time the XML parser object detects the beginning of an element. The $type parameter is the most important to us: it's the name of the element. Line 95 pushes this element name onto the stack.

Lines 96 to 98 check the current state by concatenating all the array elements into a double-quoted string, which automatically adds spaces between the elements. Because a space cannot appear within an element name, this is a quick way to get our unique place in the parse. If this happens to be at the beginning of a particular PerlMonger group, we re-initialize the data for this new group to empty.

Similarly, lines 101 to 107 are called for the end of each element. Again, if we're at the end of a particular PerlMonger group, then we execute the statement handle, binding the parameters to the hash values we've seen. If the value has not been defined, it's passed as undef which looks like a NULL in the SQL table. Although the $type name is passed to this subroutine, we can be assured that we're always popping the element that we pushed before, because the XML parser would abort rather than continue to call the callbacks if the XML was invalidly nested.

Lines 109 to 114 define Text, called whenever content is detected. Line 110 computes the SQL column for this particular XML element nesting state, and if it's something we've declared interest in, we append the data to what we've accumulated for that column so far.

And that's it! You can drop this program into your CGI bin as-is and play with it, but beware, it's a CPU hog because we're not caching the fetch of the XML page from the website, nor are we saving the results of having parsed that data anywhere.

In a real application, the two phases of fetching and parsing into the database, and executing a query against that database, would probably be separated into two programs. Or, this program can be placed into a ``mini-web server'' using the strategies I detailed in this column in March and April of 1998, allowing the results of creating the database to be used across many hits until the hits are no longer frequent.

But as you can see, with a minimal fuss, I've got a way to scan through the data, perhaps in ways the original authors did not intend. So, until next time, enjoy!

Listings

        =1=     #!/usr/bin/perl -Tw
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use CGI qw(:all);
        =6=     
        =7=     ### globals
        =8=     
        =9=     ## mapping from SQL column name to XML state
        =10=    my @column_def = map [ m{^(\S+)\s+(.*)} ], <<'END_DEFINITION' =~ /(.+)/g;
        =11=    Name       perl-mongers group name
        =12=    City       perl-mongers group location city
        =13=    State      perl-mongers group location state
        =14=    Country    perl-mongers group location country
        =15=    Tsar_name  perl-mongers group tsar name
        =16=    Tsar_email perl-mongers group tsar email
        =17=    Web        perl-mongers group web
        =18=    END_DEFINITION
        =19=    
        =20=    ## reductions of data above
        =21=    my @columns = map $_->[0], @column_def;
        =22=    my %column_mapping = map { $_->[1] => $_->[0] } @column_def;
        =23=    
        =24=    ## end globals
        =25=    
        =26=    print
        =27=      header,
        =28=      start_html("Search the mongers info"),
        =29=      h1("Search the mongers info"),
        =30=      hr,
        =31=      start_form,
        =32=      table({ Border => 1, Cellspacing => 0, Cellpadding => 2 },
        =33=            (map {
        =34=              Tr(th($_), td(textfield(-Name => $_, -Default => '%')))
        =35=            } @columns),
        =36=            Tr(td({ Colspan => 2},
        =37=                  p("Use % for any chars, and ? for a single char"))),
        =38=           ),
        =39=      reset,
        =40=      submit,
        =41=      end_form,
        =42=      hr;
        =43=    
        =44=    &handle_form() if param();
        =45=    
        =46=    print end_html;
        =47=    
        =48=    sub handle_form {
        =49=      require XML::Parser;
        =50=      require DBI;
        =51=      require LWP::Simple;
        =52=    
        =53=      my $dbh = DBI->connect('dbi:RAM:', undef, undef, {RaiseError => 1});
        =54=      $dbh->do("CREATE TABLE table1 (".
        =55=               (join ", ", map "$_ TEXT", @columns).
        =56=               ")");
        =57=      my $insert = $dbh->prepare("INSERT INTO table1 (".
        =58=                                 (join ", ", @columns).
        =59=                                 ") VALUES (".
        =60=                                 (join ",", ("?") x @columns).")");
        =61=    
        =62=      my $data = LWP::Simple::get("http://www.pm.org/XML/perl_mongers.xml";);
        =63=      MyParser::doParse($data, $insert);
        =64=      
        =65=      my $extract = $dbh->prepare("SELECT ".
        =66=                                  (join ", ", @columns).
        =67=                                  " FROM table1".
        =68=                                  " WHERE ".
        =69=                                  (join " AND ", map "$_ LIKE ?", @columns).
        =70=                                  " ORDER BY Name");
        =71=    
        =72=      $extract->execute(map {defined param($_) ? param($_) : "%"} @columns);
        =73=    
        =74=      print table({Border => 1, Cellspacing => 0, Cellpadding => 2},
        =75=                  Tr(th(\@columns)),
        =76=                  map Tr(td($_)), @{$extract->fetchall_arrayref});
        =77=    }
        =78=    
        =79=    BEGIN {
        =80=      package MyParser;
        =81=    
        =82=      my @state;
        =83=      my %one_group_data;
        =84=      my $insert_handle;
        =85=    
        =86=      sub doParse {
        =87=        my $data = shift;
        =88=        $insert_handle = shift;     ## outer scope
        =89=    
        =90=        XML::Parser->new(Style => 'Stream')->parse($data);
        =91=      }
        =92=    
        =93=      sub StartTag {
        =94=        my ($parser, $type) = @_;
        =95=        push @state, $type;
        =96=        if ("@state" eq "perl-mongers group") {
        =97=          %one_group_data = ();
        =98=        }
        =99=      }
        =100=   
        =101=     sub EndTag {
        =102=       my ($parser, $type) = @_;
        =103=       if ("@state" eq "perl-mongers group") {
        =104=         $insert_handle->execute(@one_group_data{@columns});
        =105=       }
        =106=       pop @state;
        =107=     }
        =108=     
        =109=     sub Text {
        =110=       my $place = $column_mapping{"@state"};
        =111=       if (defined $place) {
        =112=         $one_group_data{$place} .= $_;
        =113=       }
        =114=     }
        =115=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.