Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Download this listing!

Web Techniques Column 30 (Oct 1998)

The CGI.pm module, created by fellow columnist Lincoln Stein, is the most commonly used method to deal with all the ins, outs, and roundabouts regarding the Common Gateway Interface. One of the nice features of this CGI module is that there are a number of HTML-creation subroutines. I personally find it easier to type

  h1("Welcome to my Fred & Barney page!")

than

  <h1>Welcome to my Fred &amp; Barney page!</h1>

Especially when the content or the parameters come from various variables. Now, if I'm constructing a new form generator, and form response program from scratch, I just start typing in the right HTML-creation routines, and I'm done.

But, let's say I had started with a static HTML form page, which had originally referenced a CGI program (perhaps something with the CGI module, but not necessarily), and now I wanted to roll it up into the ``same program generates the form and handles the response'' method. I've got all this nice HTML which should be the output of the HTML-creation routines. What I want are the routines that could have made that output.

Perl to the rescue! All we have to do is parse the HTML (easy enough with some of the existing LWP library components), and dump the parsed structure in a way that's compatible with the CGI module's routines. Once I thought of it, it took me just an hour or so to get the basic structure right (and then I tweaked it for two more hours, and that's what you're seeing here).

As an aside, there's a similar program in the CPAN HTML::Stream module, but the output requires the use of that module's HTML generators. They're fundamentally incompatible with the way that the CGI module HTML-generators want to be called, so that's not a viable solution in this case.

In order to write this code quickly, I violated a few supported boundaries. In other words, I cheated by looking at the source code for CGI.pm and the HTML::TreeWalker module, rather than deriving everything directly from the published interfaces in the documentation. I thought long and hard before I violated this interface boundary, but there seemed no other way at the time. This makes me slightly uncomfortable, because later versions of these modules may very well break this code. But that's true anyway.

The result of my little hacking experiment is provided in [listing one, below]. Please remember that this code is proof of concept only, and is not intended for robust production work. If you want that, you'll have to shake out all the bugs yourself.

Lines 1 through 3 start nearly every program I write, turning on taint checks (not much point here, but whatever), warnings, compile-time restricts, and disabling the buffering of standard output.

Line 5 pulls in the HTML::TreeBuilder module, found in the LWP library. Similarly, line 6 pulls in the CGI module, found in the standard Perl distribution. Here, -no_debug disables the CGI module's normal ``offline debugging mode'', not useful because we're not really using the module to handle CGI operations. (We just want a definition for CGI::expand_tags.)

Line 8 sets up the @ARGV array to include a single dash (meaning standard input) if @ARGV had been empty. That way, we can always get something sensible from using shift on @ARGV.

Line 10 creates the %SEEN hash. The keys of this hash will end up as parameters to the use CGI ... pragma generated for the translated Perl program. At a minimum, we'll need the three parameters of header, comment, and :html. If any tags are seen in the source HTML that are not covered in the standard list of HTML tags for the CGI module, they'll be added to this list.

Line 11 extracts the known list of HTML tags from the CGI module's data, using an internal routine called CGI::expand_tags. I discovered this subroutine by looking at the source for the CGI module, and not by reading the documentation. Hence, this is something that can quite possibly break in a future release. If a tag is present in this list, then we don't need to treat it as a special tag.

Lines 13 to 21 form the main part of the code. Line 14 creates a new HTML::TreeBuilder object, and line 15 causes it to be loaded from the HTML code being read from the first filename on the command line (or standard input if no names were found). Line 16 invokes the traverse method on the resulting tree, causing all the HTML::Element-like nodes to be visited in order. The parameter is a reference to our walker routine, defined later. Line 17 deletes the tree once the data has been collected.

Lines 18 to 20 generate the resulting Perl program on standard output. Line 18 prints the initial use CGI line, including all the parameters formed by looking at the keys of %SEEN. Following that, we're also printing the initial part of the single print statement that forms the remainder of the program.

Line 19 fetches the result of the put() calls (defined later in the program) to dump a single string holding all the nicely indented program text. More on this later. Line 20 prints the final semicolon terminating the big print statement.

That's it for the mainline, and now on to the subroutines below line 23.

Lines 25 through 32 define a small Perl-string-list massager. The arguments to the subroutine are simple scalars with arbitary data. The resulting value is a single string that is legal Perl syntax that would create the same value (without the outer parentheses, which are the responsibility of the caller). For each element, we hexify all the characters that don't trivially stand for themselves in a double-quoted string. (Watch out for $ and @, for example.) The individual elements are then wrapped inside an outer pair of double quotes, and multiple elements are glued into one string with join. This routine has a nice short unconfusing name like S because we're going to call it frequently later.

Lines 34 through 46 form a little utility package that accumulates substrings into a buffer, and later provides that buffer as the resulting value. The buffer named $put_buffer is a static local, available only to these two subroutines (which are in the outer namespace), and having a value that persists throughout the life of the program. Each invocation of put() adds stuff to the buffer, and the final call to getput() both dumps the current value and empties the buffer (permitting multiple sequential uses in the same program).

Lines 48 through 57 define a utility subroutine that knows how to dump a hashref's keys and values in a nice list. The three parameters are the open punctuation, the hashref, and the close punctuation. Line 50 extracts the keys of interest (omitting those that begin with underscore). If this list is non-empty, we'll put() the open punctuation, followed by the key-value pairs (properly S()-escaped) followed by the close punctuation. Note that if the list is empty, we dump nothing (this is handy).

Lines 59 through 98 define the walker() routine, and its associated static data. This is where the bulk of the work gets done, and therefore is the largest part of the program. Line 60 creates a static local variable $head_attrs to hold the results of parsing the HEAD portion of the HTML, so that by the time we get to the BODY, we can dump out the proper start_html directive. (More on this later.)

Line 63 declares the three parameters that the traverse method gives to us on each invocation. The first is a node: an HTML::Element object (or in this case, a subclass). The first parameter might also be a simple scalar, meaning that it's part of the content for the node we're already within. The second is a flag to indicate whether it's a start or end tag. And the final parameter is the depth, increasing by one for each level deeper in the parsed hierarchy.

Line 64 distinguishes the straight text content from the sub-element content. If it's a reference, it's something with structure that we presume is like an HTML::Element. Line 65 extracts the tag name. If it's the top-level html node, we simply return quickly in line 66, with a true value to indicate that this node should be descended recursively.

If it's a head tag, it's time to parse the header portion of the HTML file, using the get_attrs_from subroutine (defined below). Because we parse the head node in a special way, we don't want the traverse method to also descend into the node, so we'll return a false value to select this.

If we make it to line 71, we're looking at an ordinary HTML construct. However, four of the most common HTML constructs conflict with the same-named Perl keywords. We'll fix that by providing an initial uppercase letter, as suggested by the CGI.pm documentation.

Line 72 causes the output from this particular node to be indented appropriate to the depth within the HTML tree. This is really nice as it quickly shows the nesting of the constructs visually.

If the tag is the body tag, we need to generate the correct start_html or end_html construct. If it's the beginning, we'll dump the attributes of the $head_attrs variable, enclosed in parentheses if necessary (in lines 75 through 77). Otherwise, we dump the end_html construct, which never has parameters. The return keeps us from processing the rest of the subroutine.

At this point, we might have special handling for other tags (like folding the start and end of a form into the proper CGI module start_form call), but this barebones program works fine for getting the basic text out. So, for all remaining tags (the majority), we'll first note the tagname in the %SEEN hash, unless it's already a tag known to the CGI module, as noted in the %KNOWN hash (in line 83).

Lines 84 through 87 handle the beginning of a construct, indicated by a true value in $start. In this case, we dump the tag name, an open parenthesis, and the attributes (if any) as a hashref of the first element of the list. Note that in this case, we are relying on the fact that an HTML::Element is a hashref whose keys are the attribute names, again peering uncomfortably behind the scenes.

Lines 88 through 91 handle the matching close parenthesis. We'll need to generate this in either of two circumstances. Either the we're looking at the end tag from traverse (and therefore $start is false), or the tag doesn't have an end tag. To know the second case, I've relied on a datastructure within the HTML::Element package, and this is not a documented interface of this module.

Line 89 fixes a bug that I noticed during development. If the original data had an empty construct, but one that required both a start and end tag, like:

  <H1></H1>

the resulting HTML would look like:

  h1()

which the CGI module prints as merely:

  <H1>

which is very wrong. So, to fix this, for an empty content on a construct that needs a close tag, we punt and insert a single empty space, enough to trick the CGI module into generating the closing tag. Line 90 generates the closing parenthesis, along with a newline character to make the HTML output a little easier to read.

Lines 92 through 94 handle the plain-text content of each construct. Not much to to do there except dump it out in a Perl-program-safe way, indented the right amount.

Lines 95 and 96 handle the wrap-up, sending out a newline character, and returning a true value so that the traverse routine will continue to descend into the HTML parse tree.

Lines 100 through 113 handle the parsing of the head portion of the HTML data. The single parameter is expected to be the top node of the head. Lines 103 through 111 walk through each subcomponent of this structure, looking for items of interest. The only element I've programmed in is the title, which I'm grabbing and storing. Other elements (like base and meta) can be added here to make this routine smarter. For now, these all end up in line 110, so that you can know what you need to add.

Lines 115 through 122 add a method to the HTML::TreeBuilder module. Now again, I'm violating the interface because I'm adding something to someone else's package. Without this method, HTML comments in the source file are completely ignored. Instead, I'm turning them into comment tags, which the CGI module processes nicely back into actual comments. Most of the code here was lifted from the code to generate a simple tag in the HTML::TreeBuilder sourcefile. The two parameters to this routine are the HTML::Element and the text string of the comment.

Well, that's an interestingly long program, but it's sufficient to demonstrate how to parse some arbitrary HTML and generate a completely different representation. If you stick this program into say, html2cgi, you can invoke the program with:

  html2cgi <html >cgi-script

and in fact, for testing, you can even feed the output Perl program directly to Perl, as in:

  html2cgi <html | perl

The output of this command should be fairly sensible HTML again. But remember, by hanging on to the Perl program, you can then add dynamic information, some interactivity (a form that invokes the same script later, for example), and what have you. Or, just use it as a way to ensure that you're making well-formed HTML. Enjoy!

Listings

        =1=     #!/home/merlyn/bin/perl -Tw
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use HTML::TreeBuilder;
        =6=     use CGI "-no_debug";
        =7=     
        =8=     @ARGV = "-" unless @ARGV;
        =9=     
        =10=    my %SEEN = map { $_, 1 } qw(header comment :html);
        =11=    my %KNOWN = map { $_, 1 } CGI::expand_tags(":html"); # CHEAT
        =12=    
        =13=    {
        =14=      my $h = HTML::TreeBuilder->new;
        =15=      $h->parse_file(shift);
        =16=      $h->traverse(\&walker);
        =17=      $h->delete;
        =18=      print "use CGI ", S("-no_debug", sort keys %SEEN), ";\nprint header,\n";
        =19=      print getput();
        =20=      print ";\n";
        =21=    }
        =22=    
        =23=    ## subroutines
        =24=    
        =25=    sub S {
        =26=      join ", ",
        =27=      map {
        =28=        local $_ = $_;
        =29=        s/([^ !#%-?A-~])/sprintf "\\x%02x", ord $1/ge;
        =30=        qq{"$_"};
        =31=      } @_;
        =32=    }
        =33=    
        =34=    BEGIN {                         # scope for static local
        =35=      my $put_buffer = "";
        =36=    
        =37=      sub put {
        =38=        for (@_) {
        =39=          $put_buffer .= $_;
        =40=        }
        =41=      }
        =42=    
        =43=      sub getput {
        =44=        ($put_buffer."", $put_buffer = "")[0];
        =45=      }
        =46=    }
        =47=    
        =48=    sub dumpattrs {
        =49=      my ($open, $hr, $close) = @_;
        =50=      my @attrs = sort grep !/^_/, keys %$hr;
        =51=      if (@attrs) {
        =52=        put
        =53=          $open,
        =54=          join(", ", map { S($_)." => ".S($hr->{$_}) } @attrs),
        =55=          $close;
        =56=      }
        =57=    }
        =58=    
        =59=    BEGIN {                         # scope for static local
        =60=      my $head_attrs = {};
        =61=    
        =62=      sub walker {
        =63=        my ($node, $start, $depth) = @_;
        =64=        if (ref $node) {
        =65=          my $tag = $node->tag;
        =66=          return 1 if $tag eq "html";
        =67=          if ($tag eq "head") {
        =68=            $head_attrs = get_attrs_from($node);
        =69=            return 0;
        =70=          }
        =71=          $tag = ucfirst $tag if index(" select tr link delete ", " $tag ") >= 0;
        =72=          put " " x $depth;
        =73=          if ($tag eq "body") {
        =74=            if ($start) {
        =75=              put "start_html";
        =76=              dumpattrs "(", $head_attrs, ")";
        =77=              put ",\n";
        =78=            } else {
        =79=              put "end_html,\n";
        =80=            }
        =81=            return 1;
        =82=          }
        =83=          $SEEN{$tag}++ unless $KNOWN{$tag};
        =84=          if ($start) {             # start
        =85=            put "$tag (";
        =86=            dumpattrs "{", $node, "}, "; # CHEAT
        =87=          }
        =88=          if (not $start or $HTML::Element::emptyElement{lc $tag}) { # CHEAT
        =89=            put S(" ") if not $start and $node->is_empty;
        =90=            put "), \"\\n\",";
        =91=          }
        =92=        } else {                    # text
        =93=          put " " x $depth, S($node), ", ";
        =94=        }
        =95=        put "\n";
        =96=        return 1;                           # yes, recurse
        =97=      }
        =98=    }
        =99=    
        =100=   sub get_attrs_from {
        =101=     my $node = shift;
        =102=     my %return;
        =103=     for my $first (@{$node->content}) {
        =104=       next unless ref $first;     # invalid content
        =105=       my $tag = $first->tag;
        =106=       if ($tag eq "title") {
        =107=         $return{"-title"} = join " ", @{$first->content};
        =108=         next;
        =109=       }
        =110=       warn "## unknown head tag: ".($first->as_HTML);
        =111=     }
        =112=     return \%return;
        =113=   }
        =114=   
        =115=   sub HTML::TreeBuilder::comment { # CHEAT
        =116=     my $self = shift;
        =117=     my $pos = $self->{'_pos'};
        =118=     $pos = $self unless defined($pos);
        =119=     my $ele = HTML::Element->new('comment');
        =120=     $ele->push_content(shift);
        =121=     $pos->push_content($ele);
        =122=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Web Techniques Column 30 (Oct 1998)

Listings