Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Perl Journal magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
![]() |
Download this listing! | ![]() |
![]() |
![]() |
Perl Journal Column 03 (Jul 2003)
[Suggested title: ``Cleaning up your HTML (part 2)'']
In [last month's article], I talked about using an HTML stripper based
on XML::LibXML
. Let's take a look at how that might be
implemented, in [listing one, below].
Line 3 pulls in the XML::LibXML
module, and line 4 creates a
$PARSER
object, using the default settings. This parser can be
shared amongst many individual filters, so we've made it a class
variable.
Lines 6 through 10 provide the constructor, which simply captures the
$permitted
parameter as an instance (member) variable, and returns
the blessed object.
The real meat begins in line 12: the strip
instance method. Lines
13 and 14 grab the instance variable and the input HTML, respectively.
Line 16 parses the HTML into a DOM. That's it. The result is an
XML::LibXML::Document
object from which we can get nice clean HTML,
or even XHTML if we choose. But we'll want to strip out the ugly
stuff first. Line 17 caches the $permitted
hashref into a simple
scalar for quick access.
Line 19 establishes the ``cursor'' or ``current node'', conveniently and
ambiguously called $cur
. We'll do a walk of the DOM tree by moving
this pointer around. We'll start by dropping down to the first child
of the document node, and we'll end when the value hits undef
,
dropping out of the loop beginning in line 20.
Line 21 establishes a $delete
flag. If this flag is true at the
bottom of this loop, the current node must be deleted after we've
computed the next node.
The comments in line 23 and 24 reflect my feelings about how this is a
bad design. Any design that requires you to ask an object for its
type is generally a maintenance nightmare. Instead, a common protocol
should have been established, where I'd merely have to query the
properties and abilities of each thing within the tree using a
boolean-returning query method. But not here. So I grit my teeth and
use ref
a lot, hoping that some future version fixes all of this
before it breaks all of this.
If it's an element, we'll note that in line 25, and then proceed to
see whether the element type (the nodeName
in line 27) is one of
our permitted elements. If so, $ok_attr
will then be a hashref of
the permitted attributes, and we'll continue into line 28.
Lines 29 to 32 remove any attribute that is not permitted. Each
attribute is queried for its nodeName
, which is then checked
against a list of permitted attributes, and removed if not permitted.
Line 34 moves our cursor down into the first child node if it exists.
For example, if we're looking at a td
element, it'll almost
certainly have some content which we then have to scan. Some
permitted elements (like br
) won't have any content, so $next
will be undef, and we'll use the normal ``move forward'' logic at the
bottom of the loop.
That handles the permitted elements, but when we have a forbidden element, we need to remove it and reparent the orphaned children, using the code beginning in line 39.
Line 42 caches the parent node of the node to be deleted. Lines 43 to 45 move all of the node's children up to follow the node. This must be done in reverse order so that the order is retained following the current node, since we're always inserting immediately following the current node.
Finally, line 47 notes that the $cur
node must be deleted after
we've computed the following node at the bottom of the loop.
Lines 50 through 52 retain any existing text or cdata sections.
The cdata section results when a script
tag is used. Although our
permitted list will probably cause the script
element to be
removed, the hoisted data is in a cdata element, and should be treated
like text.
Lines 53 through 57 flag comments and DTDs as needing to be deleted. The former is dangerous (possibly hiding JavaScript). The latter is unnecessary, since we'll likely be including this text as part of a larger HTML page anyway.
Lines 58 to 60 attempt to flag anything else that I didn't see in my testing. I have no idea if I covered all of the nodes permitted in an HTML document, but I'm hoping I did.
Lines 62 to 73 compute the ``next'' node to be visited. I want the
equivalent of the XPath expression following::node()[1]
, without
paying the price of parsing an XPath each time through the loop. This
expression looks for the next node of any type, either at the same
level, or at any level higher. Child nodes are not considered.
Line 63 initially sets this ``next'' node to be the current node. Lines 65 to 68 determine if the next sibling node is available. If so, that's our selection, and we drop out of the ``naked block'' defined in lines 64 through 72.
If the node has no next sibling, then we need to pop up a level in the
tree and look for that node's following node. Line 70 tries this,
restarting the naked block if successful. If we're already at the
top-level node, then the undef
value is left in $next
, which will
end the outer loop started in line 20.
Lines 76 and 77 delete the current node if needed, by requesting that the parent node forget about the current node. Line 79 advances the current node to the next node as the last step of this outer loop.
All that's left now is to spit out the modified DOM as HTML, in line 82, and then toss away the initial DTD in line 83. (In my tests, this was always the first line up to a newline, but this may change in future releases of the library, so this is a bit risky.)
Line 88 provides the mandatory true value for all files brought in
with require
or use
.
And there you have it! A configurable HTML stripper that is fast and thorough. Now there's no excuse for letting someone start a comment tag or bold tag in your guestbook, messing up the rest of your display. Until next time, enjoy!
Listings
=1= package My_HTML_Filter; =2= use strict; =3= require XML::LibXML; =4= my $PARSER = XML::LibXML->new; =5= =6= sub new { =7= my $class = shift; =8= my $permitted = shift; =9= return bless { permitted => $permitted }, $class; =10= } =11= =12= sub strip { =13= my $self = shift; =14= my $html = shift; =15= =16= my $dom = $PARSER->parse_html_string($html) or die "Cannot parse"; =17= my $permitted = $self->{permitted}; =18= =19= my $cur = $dom->firstChild; =20= while ($cur) { =21= my $delete = 0; # default to safe =22= =23= ## I really really hate switching on class names =24= ## but this is a bad interface design {sigh} =25= if (ref $cur eq "XML::LibXML::Element") { =26= ## "that which is not explicitly permitted is forbidden!" =27= if (my $ok_attr = $permitted->{$cur->nodeName}) { =28= ## so this element is permitted, but what about its attributes? =29= for my $att ($cur->attributes) { =30= my $name = $att->nodeName; =31= $cur->removeAttribute($name) unless $ok_attr->{$name}; =32= } =33= ## now descend if any kids =34= if (my $next = $cur->firstChild) { =35= $cur = $next; =36= next; # don't execute code at bottom =37= } =38= } else { =39= ## bogon - delete! =40= ## we must hoist any kids to be after our current position in =41= ## reverse order, since we always inserting right after old node =42= my $parent = $cur->parentNode or die "Expecting parent of $cur"; =43= for (reverse $cur->childNodes) { =44= $parent->insertAfter($_, $cur); =45= } =46= ## and flag this one for deletion =47= $delete = 1; =48= ## fall out =49= } =50= } elsif (ref $cur eq "XML::LibXML::Text" =51= or ref $cur eq "XML::LibXML::CDATASection") { =52= ## fall out =53= } elsif (ref $cur eq "XML::LibXML::Dtd" =54= or ref $cur eq "XML::LibXML::Comment") { =55= ## delete these =56= $delete = 1; =57= ## fall out =58= } else { =59= warn "[what to do with a $cur?]"; # I hope we don't hit this =60= } =61= =62= ## determine next node ala XPath "following::node()[1]" =63= my $next = $cur; =64= { =65= if (my $sib = $next->nextSibling) { =66= $next = $sib; =67= last; =68= } =69= ## no sibling... must try parent node's sibling =70= $next = $next->parentNode; =71= redo if $next; =72= } =73= ## $next might be undef at this point, and we'll be done =74= =75= ## delete the current node if needed =76= $cur->parentNode->removeChild($cur) =77= if $delete; =78= =79= $cur = $next; =80= } =81= =82= my $output_html = $dom->toStringHTML; =83= $output_html =~ s/.*\n//; # strip the doctype =84= =85= return $output_html; =86= } =87= =88= 1;