Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Perl Journal magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
![]() |
Download this listing! | ![]() |
![]() |
![]() |
Perl Journal Column 02 (Jun 2003)
[Suggested title: ``Cleaning up your HTML (part 1)'']
The simplicity of HTML is sometimes deceiving. Sure, it's pretty easy
for your average Perl hacker to set up a web-based bulletin board
system, allowing people to come along and write comments. It's even
tempting to allow those comments to contain HTML, rather than being
escaped into monospaced <pre>
purgatory. But ``there be dragons
there'', as the old maps used to say.
The problem is that arbitrary HTML permits arbitrary activities being triggered by merely visiting the site, thanks to these fancy scriptable browsers. As reported in the security journals, these attacks are generally known as ``cross-site scripting''. They usually come in the form of a JavaScript chunk embedded in a webpage where at least part of the content can be controlled by arbitrary visitors, such as a guestbook or a web-based message system. Left unchecked, such attacks can unknowingly leak a person's credentials (such as cookies) to the bad guys, and that can lead to some pretty bad stuff.
Even without the issue of cross-site scripting, we still have to watch out for arbitary HTML and JavaScript that can trigger browser bugs, which can again lead to denial-of-service attacks or usurped credentials. While keeping up with the latest browser release usually prevents this, most people I know don't upgrade at the first notice, leading to a vulnerability window.
And then there's the just plain annoyances. People who put HTML ``start bold'' tags in without the end bold. Or worse yet, including a start comment marker without the matching end comment. This isn't always a malicious act: it could happen just as easily by accident.
Because there's so many ways to go wrong, people tend to just forbid
all HTML, escape everything through an entity escaper, and leave it at
that. But how do you permit some ``safe'' HTML while being very careful
not to let ``dangerous'' HTML or comments into your code? For example,
what if inline images were deemed to be annoying? How do you ensure
that you are stripping all img
elements?
I've seen a few solutions to tidy up the HTML, usually based on a
series of regular-expression replacements (such as HTML::Sanitizer
in the CPAN). But these often fail to consider the matching-tag or
the implicit close-tag problems of HTML. For example, consider the
valid HTML of:
<table><tr><td><b>foo<td>bar</table>
In this case, the bolding really does end at the end of foo
, so
bar
should be rendered as unbolded. But to know that, you have to
know that the td
element closes off the previous td
element, and
therefore also the b
element as well. That's a bit hard to get
into the regular expressions.
One all-encompassing solution is HTML::TreeBuilder
from the CPAN.
This code understands the nesting and optional closing tags of HTML,
and wraps itself around HTML::Parser
to find the tags and other
syntax of an HTML document. Once we have a nice clean tree of
properly parsed and nested HTML elements, we merely need to walk
through the tree, throwing away the dangerous elements. As long as we
don't mangle the tree, we should get properly nested tags out of the
mix as well.
The problem with a solution based around HTML::TreeBuilder
is that
it is too expensive to use repeatedly (such as every time a page is
reloaded). While HTML::Parser
is pretty fast, HTML::TreeBuilder
has to build a lot of heavily connected heavy Perl objects, at least
one for every element of the tree. This kind of tree is slow to
create, and slow to discard, so a heavily hit website would be bogged
down in short order.
But, from the XML realm (of all places) comes another interesting
solution, in the form of XML::LibXML
, which is a wrapper around the
GNOME libxml2
parser. Although it can be a bit finicky to install,
many interesting things become possible once you've got it there.
The XML::LibXML
library can parse things in HTML mode, not just XML
mode. In HTML mode, missing close tags are automatically deduced,
HTML entities are optional and error-corrected, and quotes around
attribute values are optional. All of these would be fatal to a
normal XML parser. The result of an HTML parsing is an in-memory DOM
that can then be accessed with XPath or DOM APIs. The advantage is
that the DOM stays in the library (C code) side of the picture until
requested, rather than a bag of Perl objects.
In my time trials, regardless of whether the HTML file was small or
huge, an HTML parse with XML::LibXML
was 10 to 20 times faster
than the equivalent parse with HTML::TreeBuilder
. This is good
news, because most of the time is spent recognizing the data and
building the tree, so reducing that gives us a big win.
So, once we build the DOM, it's a matter of walking the DOM, removing the forbidden elements and attributes, and then spitting the result out as HTML. And I've constructed a proof-of-concept module for that, which I'll describe shortly.
To test my code, I needed a list representing a typical web-based
community system's permitted HTML elements and attributes. Since I
frequent the Perl Monastery at http://www.perlmonks.org
, I decided
to grab their list of approved HTML for typical questions or answers.
I extracted the list, and put it into the center of [Listing one,
below].
Lines 1 through 3 of this program begin nearly every program I write,
turning on compiler restrictions, and disabling the normal STDOUT
buffering.
Line 5 pulls in the My_HTML_Filter
module, containing my HTML
filtering code. This module is expected to be somewhere within my
@INC
path. Because I was always invoking this program from the
current directory, I put the .pm
file in the same directory for
testing. In a production system, I might have had to alter @INC
to
access the locally installed module.
Line 7 gives the URL from which these elements and attributes are
extracted. Lines 9 through 49 create the hash of permitted elements
and attributes, as a nested hash. The first level of the hash has a
key for every valid element. The corresponding value is a hashref,
pointing to a second hash of where the keys represent every valid
attribute for that element. The corresponding values for those keys
are simply the number 1
, permitting a truth test rather than an
existence test for when we finally want to check for validity.
The code to create this hash from the ``here document'' is in lines 10
and 11. First, the data is split on newline, and then for each line,
a further split on whitespace puts the first word of the line into
$k
, and the remaining words into @v
. Then, two elements are
generated for each input element: the $k
value, and a hashref of a
hash where the keys are all the @v
elements and the values are all
1
.
The list of elements and attributes given here is by no means promised to be safe. It just happens to be what is in use at the moment at the Perl Monastery, and has evolved over time.
Line 51 and beyond create a Test::More
document, usually used in
testing a module within a distribution, but handy here while I was
developing and understanding the module. The no_plan
in line 51
indicates that Test::More
will count the number of tests and put
the ``plan'' for the tests at the end of the output rather than the
beginning.
Line 53 creates a filter object $f
, passing it the permitted
elements and attributes hash. Line 54 tests $f
to ensure that it's
actually an object of the intended type.
Lines 56 to 88 illustrate some of the transformations of this HTML stripper. Each is in the form of:
is($trial_text, $reference_text, $explanation)
The $trial_text
comes from running the filter on the given string,
resulting in some HTML output. This is compared to the
$reference_text
, which is what we are hoping the output resembles.
The $explanation
describes the particular test. A sample run
of this part of the code looks like:
ok 1 - The object isa My_HTML_Filter ok 2 - basic text gets paragraphed ok 3 - bogons gets stripped ok 4 - links are permitted ok 5 - attributes get quoted ok 6 - bad attributes get stripped ok 7 - comments get stripped ok 8 - tags get balanced ok 9 - b/i tags get balanced ok 10 - b/i tags get nested properly ok 11 - tags get lowercased ok 12 - br comes out as HTML not XHTML
This test list is by no means a full suite of tests that I would use for a production module, but shows the basics. Bad attributes and comments are removed, bad elements are stripped (and their contents pulled up in-line), close tags are automatically added according to HTML rules, and generally, life is good. The resulting HTML could be inserted into an output page safely.
And then the fun part, lines 90 to 97, showing me just how fast or
slow this code actually can be. I placed the home page for
http://www.stonehenge.com
into a local file, and then bring the
contents into $homepage
in line 91 (using the autovivified
filehandle mechanism new to Perl 5.8). I then run the stripper on the
text (about 8K as I'm testing this) until a CPU second has passed, and
report the number of passes per second that can be achieved. On an 8K
chunk of HTML (much larger than a typical question or answer at the
Monastery), I see about 40 to 50 results per second on my 1Ghz laptop.
This is well within reasonable bounds, presuming we cache the result
in some nice place on a high-performance website. Thus, the code
is useful.
But how does it work? Tune in next time for the details!
Listings
=1= #!/usr/bin/perl =2= use strict; =3= $|++; =4= =5= use My_HTML_Filter; =6= =7= ## from http://www.perlmonks.org/index.pl?node_id=29281 =8= =9= my %PERMITTED = =10= map { my($k, @v) = split; ($k, {map {$_, 1} @v}) } =11= split /\n/, <<'END'; =12= a href name target class title =13= b =14= big =15= blockquote class =16= br =17= center =18= dd =19= div class =20= dl =21= dt =22= em =23= font size color class =24= h1 =25= h2 =26= h3 =27= h4 =28= h5 =29= h6 =30= hr =31= i =32= li =33= ol type start =34= p align class =35= pre class =36= small =37= span class title =38= strike =39= strong =40= sub =41= sup =42= table width cellpadding cellspacing border bgcolor class =43= td width align valign colspan rowspan bgcolor height class =44= th colspan width align bgcolor height class =45= tr width align valign class =46= tt class =47= u =48= ul =49= END =50= =51= use Test::More qw(no_plan); =52= =53= my $f = My_HTML_Filter->new(\%PERMITTED) or die; =54= isa_ok($f, "My_HTML_Filter"); =55= =56= is($f->strip(qq{Hello}), =57= qq{<p>Hello</p>\n}, =58= "basic text gets paragraphed"); =59= is($f->strip(qq{<p><bogus>Thing}), =60= qq{<p>Thing</p>\n}, =61= "bogons gets stripped"); =62= is($f->strip(qq{<a href="foo">bar</a>}), =63= qq{<a href="foo">bar</a>\n}, =64= "links are permitted"); =65= is($f->strip(qq{<a href=foo>bar</a>}), =66= qq{<a href="foo">bar</a>\n}, =67= "attributes get quoted"); =68= is($f->strip(qq{<a href=foo bogus=place>bar</a>}), =69= qq{<a href="foo">bar</a>\n}, =70= "bad attributes get stripped"); =71= is($f->strip(qq{<p>What do <!-- comment -->you say?}), =72= qq{<p>What do you say?</p>\n}, =73= "comments get stripped"); =74= is($f->strip(qq{<table><tr><td>Hi!}), =75= qq{<table><tr><td>Hi!</td></tr></table>\n}, =76= "tags get balanced"); =77= is($f->strip(qq{<b><i>bold italic!}), =78= qq{<b><i>bold italic!</i></b>\n}, =79= "b/i tags get balanced"); =80= is($f->strip(qq{<b><i>bold italic!</b></i>}), =81= qq{<b><i>bold italic!</i></b>\n}, =82= "b/i tags get nested properly"); =83= is($f->strip(qq{<B><I>bold italic!</I></B>}), =84= qq{<b><i>bold italic!</i></b>\n}, =85= "tags get lowercased"); =86= is($f->strip(qq{<h1>hey</h1>one<br>two}), =87= qq{<h1>hey</h1>\n<p>one<br>two</p>\n}, =88= "br comes out as HTML not XHTML"); =89= =90= use Benchmark; =91= my $homepage = do { open my $f, "homepage.html"; join "", <$f> }; =92= =93= timethese =94= (-1, =95= { =96= strip_homepage => sub { $f->strip($homepage) } =97= });