Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Perl Journal magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Perl Journal Column 03 (Jul 2003)

[Suggested title: ``Cleaning up your HTML (part 2)'']

In [last month's article], I talked about using an HTML stripper based on XML::LibXML. Let's take a look at how that might be implemented, in [listing one, below].

Line 3 pulls in the XML::LibXML module, and line 4 creates a $PARSER object, using the default settings. This parser can be shared amongst many individual filters, so we've made it a class variable.

Lines 6 through 10 provide the constructor, which simply captures the $permitted parameter as an instance (member) variable, and returns the blessed object.

The real meat begins in line 12: the strip instance method. Lines 13 and 14 grab the instance variable and the input HTML, respectively.

Line 16 parses the HTML into a DOM. That's it. The result is an XML::LibXML::Document object from which we can get nice clean HTML, or even XHTML if we choose. But we'll want to strip out the ugly stuff first. Line 17 caches the $permitted hashref into a simple scalar for quick access.

Line 19 establishes the ``cursor'' or ``current node'', conveniently and ambiguously called $cur. We'll do a walk of the DOM tree by moving this pointer around. We'll start by dropping down to the first child of the document node, and we'll end when the value hits undef, dropping out of the loop beginning in line 20.

Line 21 establishes a $delete flag. If this flag is true at the bottom of this loop, the current node must be deleted after we've computed the next node.

The comments in line 23 and 24 reflect my feelings about how this is a bad design. Any design that requires you to ask an object for its type is generally a maintenance nightmare. Instead, a common protocol should have been established, where I'd merely have to query the properties and abilities of each thing within the tree using a boolean-returning query method. But not here. So I grit my teeth and use ref a lot, hoping that some future version fixes all of this before it breaks all of this.

If it's an element, we'll note that in line 25, and then proceed to see whether the element type (the nodeName in line 27) is one of our permitted elements. If so, $ok_attr will then be a hashref of the permitted attributes, and we'll continue into line 28.

Lines 29 to 32 remove any attribute that is not permitted. Each attribute is queried for its nodeName, which is then checked against a list of permitted attributes, and removed if not permitted.

Line 34 moves our cursor down into the first child node if it exists. For example, if we're looking at a td element, it'll almost certainly have some content which we then have to scan. Some permitted elements (like br) won't have any content, so $next will be undef, and we'll use the normal ``move forward'' logic at the bottom of the loop.

That handles the permitted elements, but when we have a forbidden element, we need to remove it and reparent the orphaned children, using the code beginning in line 39.

Line 42 caches the parent node of the node to be deleted. Lines 43 to 45 move all of the node's children up to follow the node. This must be done in reverse order so that the order is retained following the current node, since we're always inserting immediately following the current node.

Finally, line 47 notes that the $cur node must be deleted after we've computed the following node at the bottom of the loop.

Lines 50 through 52 retain any existing text or cdata sections. The cdata section results when a script tag is used. Although our permitted list will probably cause the script element to be removed, the hoisted data is in a cdata element, and should be treated like text.

Lines 53 through 57 flag comments and DTDs as needing to be deleted. The former is dangerous (possibly hiding JavaScript). The latter is unnecessary, since we'll likely be including this text as part of a larger HTML page anyway.

Lines 58 to 60 attempt to flag anything else that I didn't see in my testing. I have no idea if I covered all of the nodes permitted in an HTML document, but I'm hoping I did.

Lines 62 to 73 compute the ``next'' node to be visited. I want the equivalent of the XPath expression following::node()[1], without paying the price of parsing an XPath each time through the loop. This expression looks for the next node of any type, either at the same level, or at any level higher. Child nodes are not considered.

Line 63 initially sets this ``next'' node to be the current node. Lines 65 to 68 determine if the next sibling node is available. If so, that's our selection, and we drop out of the ``naked block'' defined in lines 64 through 72.

If the node has no next sibling, then we need to pop up a level in the tree and look for that node's following node. Line 70 tries this, restarting the naked block if successful. If we're already at the top-level node, then the undef value is left in $next, which will end the outer loop started in line 20.

Lines 76 and 77 delete the current node if needed, by requesting that the parent node forget about the current node. Line 79 advances the current node to the next node as the last step of this outer loop.

All that's left now is to spit out the modified DOM as HTML, in line 82, and then toss away the initial DTD in line 83. (In my tests, this was always the first line up to a newline, but this may change in future releases of the library, so this is a bit risky.)

Line 88 provides the mandatory true value for all files brought in with require or use.

And there you have it! A configurable HTML stripper that is fast and thorough. Now there's no excuse for letting someone start a comment tag or bold tag in your guestbook, messing up the rest of your display. Until next time, enjoy!

Listings

        =1=     package My_HTML_Filter;
        =2=     use strict;
        =3=     require XML::LibXML;
        =4=     my $PARSER = XML::LibXML->new;
        =5=     
        =6=     sub new {
        =7=       my $class = shift;
        =8=       my $permitted = shift;
        =9=       return bless { permitted => $permitted }, $class;
        =10=    }
        =11=    
        =12=    sub strip {
        =13=      my $self = shift;
        =14=      my $html = shift;
        =15=    
        =16=      my $dom = $PARSER->parse_html_string($html) or die "Cannot parse";
        =17=      my $permitted = $self->{permitted};
        =18=    
        =19=      my $cur = $dom->firstChild;
        =20=      while ($cur) {
        =21=        my $delete = 0;             # default to safe
        =22=    
        =23=        ## I really really hate switching on class names
        =24=        ## but this is a bad interface design {sigh}
        =25=        if (ref $cur eq "XML::LibXML::Element") {
        =26=          ## "that which is not explicitly permitted is forbidden!"
        =27=          if (my $ok_attr = $permitted->{$cur->nodeName}) {
        =28=            ## so this element is permitted, but what about its attributes?
        =29=            for my $att ($cur->attributes) {
        =30=              my $name = $att->nodeName;
        =31=              $cur->removeAttribute($name) unless $ok_attr->{$name};
        =32=            }
        =33=            ## now descend if any kids
        =34=            if (my $next = $cur->firstChild) {
        =35=              $cur = $next;
        =36=              next;                 # don't execute code at bottom
        =37=            }
        =38=          } else {
        =39=            ## bogon - delete!
        =40=            ## we must hoist any kids to be after our current position in
        =41=            ## reverse order, since we always inserting right after old node
        =42=            my $parent = $cur->parentNode or die "Expecting parent of $cur";
        =43=            for (reverse $cur->childNodes) {
        =44=              $parent->insertAfter($_, $cur);
        =45=            }
        =46=            ## and flag this one for deletion
        =47=            $delete = 1;
        =48=            ## fall out
        =49=          }
        =50=        } elsif (ref $cur eq "XML::LibXML::Text"
        =51=                 or ref $cur eq "XML::LibXML::CDATASection") {
        =52=          ## fall out
        =53=        } elsif (ref $cur eq "XML::LibXML::Dtd"
        =54=                 or ref $cur eq "XML::LibXML::Comment") {
        =55=          ## delete these
        =56=          $delete = 1;
        =57=          ## fall out
        =58=        } else {
        =59=          warn "[what to do with a $cur?]"; # I hope we don't hit this
        =60=        }
        =61=    
        =62=        ## determine next node ala XPath "following::node()[1]"
        =63=        my $next = $cur;
        =64=        {
        =65=          if (my $sib = $next->nextSibling) {
        =66=            $next = $sib;
        =67=            last;
        =68=          }
        =69=          ## no sibling... must try parent node's sibling
        =70=          $next = $next->parentNode;
        =71=          redo if $next;
        =72=        }
        =73=        ## $next might be undef at this point, and we'll be done
        =74=    
        =75=        ## delete the current node if needed
        =76=        $cur->parentNode->removeChild($cur)
        =77=          if $delete;
        =78=    
        =79=        $cur = $next;
        =80=      }
        =81=    
        =82=      my $output_html = $dom->toStringHTML;
        =83=      $output_html =~ s/.*\n//;     # strip the doctype
        =84=    
        =85=      return $output_html;
        =86=    }
        =87=    
        =88=    1;

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.