Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 24 (May 2001)

[suggested title: Processing Footnotes]

I sat down to write a web page the other day, and realized that I wanted footnotes, because I wanted to keep the main message in the main text, but have some annotations for some of the side points. That's easy enough: just put some text in a table at the end, use those cute little sup tags around the footnote numbers, and hack away.

Ooops. Those little numbers. I started to dread getting six footnotes inserted, then going back to insert one between number 2 and number 3. Maintenance nightmare. So, can Perl help? Of course!

In about 10 times the amount of labor I would have spent just doing this manually, I hacked out the program in [listing one, below]. Well, that's not very efficient, because it took longer than I wanted to automate it. So in hopes of recouping this investment of time for all mankind, I'll pass the program along to you. Besides, it shows how to create an angley-bracket meta language for your HTML and XML processing, so there's some reuse of concept as well as code. Yeah, that justifies it.

The idea is that I insert a footnote into the main flow using a made-up tag of foot, and then this processor pass takes those out, replacing them with an anchor link and a unique number. Then, at the end of the file, all the footnotes are dumped out. For an example, look at the end of the program. And, I couldn't stop there, so I decided to allow nested footnotes (like those frequently found on the alt.sysadmin.recovery newsgroup). About half my coding time was spent getting those to work right. Someday, I must learn priorities.

So, let's see what I wasted an hour on, starting with the first few lines, which begin nearly every program I write. These lines enable warnings, turn on the normal compiler restrictions for non-trivial programs, and disable buffering on STDOUT.

Line 5 pulls in the HTML::Parser, a wonderful piece of work maintained by Gisle Aas. As of fairly recently, this is a C-based module for lightning-fast parsing of anglybracket data input, normally HTML text. Much faster than handrolled regular expressions, I must say. You'll find this as part of the LWP module family in the CPAN.

Lines 7 and 8 contain the footnote list and footnote stack index, respectively. The first item of @feet is the text of the first footnote, numbered 1. The last item of @feet_index (if any) is the subscript of @feet of the current footnote we are creating. As each new foot tag is seen, we create a new empty footnote in @feet, and put its index at the end of @feet_index. When the note is ended, we pop off @feet_index, thus resuming the previous note. If there are no items in @feet_index, it's the main body and we can just copy the data through.

Yes, that's the logic it took me the better part of an hour to get cleanly. I wanted the footnotes to be numbered in the order of the start tags, and kept coming up with algorithms to number that on the basis of the end tag instead, until I got the indirection table idea.

Line 10 keeps track of our nesting of elements. In the way that I'm using HTML::Parser, it wouldn't matter to the parser that I have mismatched tags. However, since my footnote processing is fragile under those circumstances, I decided to enforce the XML-notion of ``well-formedness'' and require properly balanced tags.

At this point, you may ask why I didn't just use XML::Parser instead of HTML::Parser. Well, I like the callback flexibility of HTML::Parser better for small projects.

Speaking of which, lines 12 through 17 define my parser object. I define three callbacks. The first one is for text items, which will get the text as its first parameter. The second one is for start tags, which will get both the original text and the tagname extracted as well. And finally, the end tags are also triggered, again getting both the original text and the tagname.

Line 19 turns on xml_mode in the HTML::Parser which keeps the tags in their original case and slightly alters the handling of a few other constructs. Again, more evidence that I really wanted an XML Parser, but I'll stop saying that now.

Line 20 pulls in the contents of the DATA filehandle, which is the contents of this file below the __END__ token below, with my sample data (the story of how I write my Perl columns). The result of this parsing pass will be a number of calls to the three callback subroutines, which after completion will have printed the main part of the text to STDOUT already. We'll see how that works shortly.

But the other effect of the parse is the extraction of the footnotes into @feet. So lines 22 to 27 dump this out in a nice way. I'm using an HTML table for layout, with a column of the footnote numbers and a column of the footnote text. Each footnote also has an anchor assigned to it, which we can use as the target of an internal anchor reference using a fragment identifier. Being uncreative, I numbered these note1, note2, and so on.

Line 29 is an exit, redundant because there's only subroutines from here down, but it's in there just to keep it clear for me where the program ends.

Lines 31 to 38 define the text handler, called whenever HTML::Parser finds some text outside a tag. We've selected to pass the text itself as the first parameter, which I copy into $text in line 32. If we're currently in a footnote, this text is part of the footnote, so we append it to the right footnote. But the right footnote will be the one whose index is in the rightmost element of @feet_index, hence the indirection. Remember, this is not necessarily the highest indexed footnote, if we've had nested footnotes, but that should be the exception rather than the rule. If we're not in a footnote, the text is just dumped to STDOUT instead.

Lines 40 to 51 define the handler for the beginning of every element. The incoming parameters are the original text, and the tagname for that start tag (but not the attributes, because we didn't need them). Those get assigned in line 41 to $text and $tagname.

Line 44 notes the current element name by pushing the tag onto the stack. We'll check this on the close tag to make sure the tags are nested properly to make valid elements.

Line 46 does the work for a footnote start tag. First, we create the new footnote as empty in line 47. Then, we insert the reference to the footnote in line 48, by faking a text event containing the reference. We can't just print this because we might still be in another footnote, and faking the text event ``does the right thing''. (For better maintenance, I'd probably pull the ``add text'' operation to a separate subroutine that both the text handler and this handler call, but this worked for this quick-n-dirty program.) Line 49 adds the footnote index onto the footnote stack. Note that we cannot do this before the previous line, or else the footnote reference would end up inside itself.

Line 53 handles the start tags that are not of interest (everything except foot tags) by simply copying them as-is to the current output (either a footnote or STDOUT).

Lines 56 to 72 handle the end tags. Again, the text and tagname end up in variables, defined in line 57.

Lines 60 through 64 handle the verification of properly nested tags. If there's no start tag, or the tags don't match, a swift and painless death is our result.

Lines 66 to 69 handle the foot end tag, which is the only one of interest. If so, we simply pop an entry off the @feet_index array, which will pop us back to the previous footnote on the next text item seen, or back to dumping to STDOUT if none.

Line 71 dumps the other uninteresting end tags as needed.

And that's all there is. Not rocket science, but it gets the job done. For some sample text, I included an outline of what it takes to write a Perl column. If you run the program, you get the following HTML output:

    <h2>Writing a Perl column</h2>
    Writing a magazine column about Perl is a simple<sup>[1]</sup> task.
    Just perform the following steps:
    <ol>
    <li>Think of a problem to
      solve<sup>[2]</sup>.</li>
    <li>Write the code to solve it.<sup>[3]</sup></li>
    <li>Fret over the code for a few hours<sup>[5]</sup>.</li>
    <li>Write the column.</li>
    <li>Show the column to a group of friends
      on IRC<sup>[6]</sup>
      for a quick peer review.</li>
    <li>Turn it in<sup>[7]</sup> to the editor.</li>
    <li>Wait a few days for the galleys<sup>[8]</sup>
      to come back.</li>
    <li>Grimace over the hacks to your lovely
    prose<sup>[9]</sup> and provide corrections to
    the corrections.
    </li>
    <li>Wait a few months<sup>[10]</sup> for it
    to "hit the stands".</li>
    <li>Wave the magazine in front of your friends<sup>[11]</sup>!</li>
    </ol>
    <hr><table border='0' cellspacing='0' cellpadding='2'>
    <tr><td><sup>[1]</sup></td><td>for some!</td></tr>
    <tr><td><sup>[2]</sup></td><td>You can ask around for help here.</td></tr>
    <tr><td><sup>[3]</sup></td><td>The code should be between 50 and 200
      lines for optimum column
      length<sup>[4]</sup>.</td></tr>
    <tr><td><sup>[4]</sup></td><td>About 10,000 characters.</td></tr>
    <tr><td><sup>[5]</sup></td><td>or a few minutes</td></tr>
    <tr><td><sup>[6]</sup></td><td>Usually the <tt>#perl</tt> channel</td></tr>
    <tr><td><sup>[7]</sup></td><td>By email.</td></tr>
    <tr><td><sup>[8]</sup></td><td>usually a PDF.</td></tr>
    <tr><td><sup>[9]</sup></td><td>just kidding, guys!</td></tr>
    <tr><td><sup>[10]</sup></td><td>Or so it seems, since the deadline
    for a April cover is usually the first week of January.</td></tr>
    <tr><td><sup>[11]</sup></td><td>Or the cute girl
    at the bookstore checkout counter.</td></tr>
    </table>

Note how the footnotes have been replaced with internal fragment references, and the contents of the footnotes have become a table at the end. Yes, I could have done all this by hand, but in retrospect, it was more fun to write the program and get it done right once and for all.

So, don't fear footnotes, and don't fear writing tiny metalanguages for those odd tasks. Until next time, enjoy!

Listings

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use HTML::Parser;
        =6=     
        =7=     my @feet;                       # final footnote list
        =8=     my @feet_index;                 # indexes into @feet
        =9=     
        =10=    my @elements;                   # ensure nested tags match
        =11=    
        =12=    my $parser = HTML::Parser->new
        =13=      (
        =14=       text_h => [\&text_h, "text"],
        =15=       start_h => [\&start_h, "text, tagname"],
        =16=       end_h => [\&end_h, "text, tagname"],
        =17=      );
        =18=    
        =19=    $parser->xml_mode(1);           # keep tags case-sensitive
        =20=    $parser->parse_file(\*DATA);    # prints main part to STDOUT
        =21=    
        =22=    if (@feet) {                    # we had footnotes?
        =23=      print "<hr><table border='0' cellspacing='0' cellpadding='2'>\n";
        =24=      print "<tr><td valign='top'><sup><a name='note$_'>$_</a></sup></td>",
        =25=        "<td>$feet[$_-1]</td></tr>\n" for 1..@feet;
        =26=      print "</table>";
        =27=    }
        =28=    
        =29=    exit 0;                         # end of code
        =30=    
        =31=    sub text_h {
        =32=      my ($text) = @_;
        =33=      if (@feet_index) {            # are we inside a footnote?
        =34=        $feet[$feet_index[-1]] .= $text;    # append to that
        =35=      } else {
        =36=        print $text;                # just show it
        =37=      }
        =38=    }
        =39=    
        =40=    sub start_h {
        =41=      my ($text, $tagname) = @_;
        =42=    
        =43=      ## ensure proper nesting
        =44=      push @elements, $tagname;
        =45=    
        =46=      if ($tagname eq "foot") {
        =47=        push @feet, "";             # the note itself
        =48=        text_h("<sup><a href='#note".@feet."'>".@feet."</a></sup>");
        =49=        push @feet_index, $#feet;   # pointer to note
        =50=        return;
        =51=      }
        =52=      
        =53=      text_h($text);                # uninteresting start tag
        =54=    }
        =55=    
        =56=    sub end_h {
        =57=      my ($text, $tagname) = @_;
        =58=    
        =59=      ## ensure proper nesting
        =60=      die "saw $text outside of element"
        =61=        unless @elements;
        =62=      die "saw $text nested inside <$elements[-1]>"
        =63=        unless $elements[-1] eq $tagname;
        =64=      pop @elements;
        =65=    
        =66=      if ($tagname eq "foot") {
        =67=        pop @feet_index;                    # no longer accumulating here
        =68=        return;
        =69=      }
        =70=      
        =71=      text_h($text);                # uninteresting end tag
        =72=    }
        =73=    
        =74=    __END__
        =75=    <h2>Writing a Perl column</h2>
        =76=    Writing a magazine column about Perl is a simple<foot>for some!</foot> task.
        =77=    Just perform the following steps:
        =78=    <ol>
        =79=    <li>Think of a problem to
        =80=      solve<foot>You can ask around for help here.
        =81=      I keep an archive of "todo" ideas, and it really helps.</foot>.</li>
        =82=    <li>Write the code to solve it.<foot>The code should be between 50 and 200
        =83=      lines for optimum column
        =84=      length<foot>About 10,000 characters.</foot>.</foot></li>
        =85=    <li>Fret over the code for a few hours<foot>Or a few minutes.</foot>.</li>
        =86=    <li>Write the column<foot>I use POD<foot>See <tt>perldoc perlpod</tt>.</foot>
        =87=      format.</foot>.</li>
        =88=    <li>Show the column to a group of friends
        =89=      on IRC<foot>Usually the <tt>#perl</tt> channel.</foot>
        =90=      for a quick peer review.</li>
        =91=    <li>Turn it in<foot>By email.</foot> to the editor.</li>
        =92=    <li>Wait a few days for the galleys<foot>Usually a
        =93=      PDF<foot><i>Portable Document Format</i> from
        =94=      Adobe<foot>See <tt>www.adobe.com</tt> for downloads.</foot>.</foot>.</foot>
        =95=      to come back.</li>
        =96=    <li>Grimace over the hacks to your lovely
        =97=    prose<foot>just kidding, guys!</foot> and provide corrections to
        =98=    the corrections.
        =99=    </li>
        =100=   <li>Wait a few months<foot>Or so it seems, since the deadline
        =101=   for a April cover is usually the first week of January.</foot> for it
        =102=   to "hit the stands".</li>
        =103=   <li>Wave the magazine in front of your friends<foot>Or the cute girl
        =104=   at the bookstore checkout counter.</foot>!</li>
        =105=   </ol>

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.