Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Perl Journal magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Perl Journal Column 02 (Jun 2003)

[Suggested title: ``Cleaning up your HTML (part 1)'']

The simplicity of HTML is sometimes deceiving. Sure, it's pretty easy for your average Perl hacker to set up a web-based bulletin board system, allowing people to come along and write comments. It's even tempting to allow those comments to contain HTML, rather than being escaped into monospaced <pre> purgatory. But ``there be dragons there'', as the old maps used to say.

The problem is that arbitrary HTML permits arbitrary activities being triggered by merely visiting the site, thanks to these fancy scriptable browsers. As reported in the security journals, these attacks are generally known as ``cross-site scripting''. They usually come in the form of a JavaScript chunk embedded in a webpage where at least part of the content can be controlled by arbitrary visitors, such as a guestbook or a web-based message system. Left unchecked, such attacks can unknowingly leak a person's credentials (such as cookies) to the bad guys, and that can lead to some pretty bad stuff.

Even without the issue of cross-site scripting, we still have to watch out for arbitary HTML and JavaScript that can trigger browser bugs, which can again lead to denial-of-service attacks or usurped credentials. While keeping up with the latest browser release usually prevents this, most people I know don't upgrade at the first notice, leading to a vulnerability window.

And then there's the just plain annoyances. People who put HTML ``start bold'' tags in without the end bold. Or worse yet, including a start comment marker without the matching end comment. This isn't always a malicious act: it could happen just as easily by accident.

Because there's so many ways to go wrong, people tend to just forbid all HTML, escape everything through an entity escaper, and leave it at that. But how do you permit some ``safe'' HTML while being very careful not to let ``dangerous'' HTML or comments into your code? For example, what if inline images were deemed to be annoying? How do you ensure that you are stripping all img elements?

I've seen a few solutions to tidy up the HTML, usually based on a series of regular-expression replacements (such as HTML::Sanitizer in the CPAN). But these often fail to consider the matching-tag or the implicit close-tag problems of HTML. For example, consider the valid HTML of:

  <table><tr><td><b>foo<td>bar</table>

In this case, the bolding really does end at the end of foo, so bar should be rendered as unbolded. But to know that, you have to know that the td element closes off the previous td element, and therefore also the b element as well. That's a bit hard to get into the regular expressions.

One all-encompassing solution is HTML::TreeBuilder from the CPAN. This code understands the nesting and optional closing tags of HTML, and wraps itself around HTML::Parser to find the tags and other syntax of an HTML document. Once we have a nice clean tree of properly parsed and nested HTML elements, we merely need to walk through the tree, throwing away the dangerous elements. As long as we don't mangle the tree, we should get properly nested tags out of the mix as well.

The problem with a solution based around HTML::TreeBuilder is that it is too expensive to use repeatedly (such as every time a page is reloaded). While HTML::Parser is pretty fast, HTML::TreeBuilder has to build a lot of heavily connected heavy Perl objects, at least one for every element of the tree. This kind of tree is slow to create, and slow to discard, so a heavily hit website would be bogged down in short order.

But, from the XML realm (of all places) comes another interesting solution, in the form of XML::LibXML, which is a wrapper around the GNOME libxml2 parser. Although it can be a bit finicky to install, many interesting things become possible once you've got it there.

The XML::LibXML library can parse things in HTML mode, not just XML mode. In HTML mode, missing close tags are automatically deduced, HTML entities are optional and error-corrected, and quotes around attribute values are optional. All of these would be fatal to a normal XML parser. The result of an HTML parsing is an in-memory DOM that can then be accessed with XPath or DOM APIs. The advantage is that the DOM stays in the library (C code) side of the picture until requested, rather than a bag of Perl objects.

In my time trials, regardless of whether the HTML file was small or huge, an HTML parse with XML::LibXML was 10 to 20 times faster than the equivalent parse with HTML::TreeBuilder. This is good news, because most of the time is spent recognizing the data and building the tree, so reducing that gives us a big win.

So, once we build the DOM, it's a matter of walking the DOM, removing the forbidden elements and attributes, and then spitting the result out as HTML. And I've constructed a proof-of-concept module for that, which I'll describe shortly.

To test my code, I needed a list representing a typical web-based community system's permitted HTML elements and attributes. Since I frequent the Perl Monastery at http://www.perlmonks.org, I decided to grab their list of approved HTML for typical questions or answers. I extracted the list, and put it into the center of [Listing one, below].

Lines 1 through 3 of this program begin nearly every program I write, turning on compiler restrictions, and disabling the normal STDOUT buffering.

Line 5 pulls in the My_HTML_Filter module, containing my HTML filtering code. This module is expected to be somewhere within my @INC path. Because I was always invoking this program from the current directory, I put the .pm file in the same directory for testing. In a production system, I might have had to alter @INC to access the locally installed module.

Line 7 gives the URL from which these elements and attributes are extracted. Lines 9 through 49 create the hash of permitted elements and attributes, as a nested hash. The first level of the hash has a key for every valid element. The corresponding value is a hashref, pointing to a second hash of where the keys represent every valid attribute for that element. The corresponding values for those keys are simply the number 1, permitting a truth test rather than an existence test for when we finally want to check for validity.

The code to create this hash from the ``here document'' is in lines 10 and 11. First, the data is split on newline, and then for each line, a further split on whitespace puts the first word of the line into $k, and the remaining words into @v. Then, two elements are generated for each input element: the $k value, and a hashref of a hash where the keys are all the @v elements and the values are all 1.

The list of elements and attributes given here is by no means promised to be safe. It just happens to be what is in use at the moment at the Perl Monastery, and has evolved over time.

Line 51 and beyond create a Test::More document, usually used in testing a module within a distribution, but handy here while I was developing and understanding the module. The no_plan in line 51 indicates that Test::More will count the number of tests and put the ``plan'' for the tests at the end of the output rather than the beginning.

Line 53 creates a filter object $f, passing it the permitted elements and attributes hash. Line 54 tests $f to ensure that it's actually an object of the intended type.

Lines 56 to 88 illustrate some of the transformations of this HTML stripper. Each is in the form of:

  is($trial_text, $reference_text, $explanation)

The $trial_text comes from running the filter on the given string, resulting in some HTML output. This is compared to the $reference_text, which is what we are hoping the output resembles. The $explanation describes the particular test. A sample run of this part of the code looks like:

    ok 1 - The object isa My_HTML_Filter
    ok 2 - basic text gets paragraphed
    ok 3 - bogons gets stripped
    ok 4 - links are permitted
    ok 5 - attributes get quoted
    ok 6 - bad attributes get stripped
    ok 7 - comments get stripped
    ok 8 - tags get balanced
    ok 9 - b/i tags get balanced
    ok 10 - b/i tags get nested properly
    ok 11 - tags get lowercased
    ok 12 - br comes out as HTML not XHTML

This test list is by no means a full suite of tests that I would use for a production module, but shows the basics. Bad attributes and comments are removed, bad elements are stripped (and their contents pulled up in-line), close tags are automatically added according to HTML rules, and generally, life is good. The resulting HTML could be inserted into an output page safely.

And then the fun part, lines 90 to 97, showing me just how fast or slow this code actually can be. I placed the home page for http://www.stonehenge.com into a local file, and then bring the contents into $homepage in line 91 (using the autovivified filehandle mechanism new to Perl 5.8). I then run the stripper on the text (about 8K as I'm testing this) until a CPU second has passed, and report the number of passes per second that can be achieved. On an 8K chunk of HTML (much larger than a typical question or answer at the Monastery), I see about 40 to 50 results per second on my 1Ghz laptop. This is well within reasonable bounds, presuming we cache the result in some nice place on a high-performance website. Thus, the code is useful.

But how does it work? Tune in next time for the details!

Listings

        =1=     #!/usr/bin/perl
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use My_HTML_Filter;
        =6=     
        =7=     ## from http://www.perlmonks.org/index.pl?node_id=29281
        =8=     
        =9=     my %PERMITTED =
        =10=      map { my($k, @v) = split; ($k, {map {$_, 1} @v}) }
        =11=      split /\n/, <<'END';
        =12=    a href name target class title
        =13=    b 
        =14=    big 
        =15=    blockquote class
        =16=    br
        =17=    center 
        =18=    dd 
        =19=    div class
        =20=    dl 
        =21=    dt 
        =22=    em 
        =23=    font size color class
        =24=    h1
        =25=    h2
        =26=    h3 
        =27=    h4 
        =28=    h5 
        =29=    h6 
        =30=    hr
        =31=    i 
        =32=    li 
        =33=    ol type start
        =34=    p align class
        =35=    pre class
        =36=    small 
        =37=    span class title
        =38=    strike 
        =39=    strong 
        =40=    sub 
        =41=    sup 
        =42=    table width cellpadding cellspacing border bgcolor class
        =43=    td width align valign colspan rowspan bgcolor height class
        =44=    th colspan width align bgcolor height class
        =45=    tr width align valign class
        =46=    tt class
        =47=    u 
        =48=    ul 
        =49=    END
        =50=    
        =51=    use Test::More qw(no_plan);
        =52=    
        =53=    my $f = My_HTML_Filter->new(\%PERMITTED) or die;
        =54=    isa_ok($f, "My_HTML_Filter");
        =55=    
        =56=    is($f->strip(qq{Hello}),
        =57=       qq{<p>Hello</p>\n},
        =58=       "basic text gets paragraphed");
        =59=    is($f->strip(qq{<p><bogus>Thing}),
        =60=       qq{<p>Thing</p>\n},
        =61=       "bogons gets stripped");
        =62=    is($f->strip(qq{<a href="foo">bar</a>}),
        =63=       qq{<a href="foo">bar</a>\n},
        =64=       "links are permitted");
        =65=    is($f->strip(qq{<a href=foo>bar</a>}),
        =66=       qq{<a href="foo">bar</a>\n},
        =67=       "attributes get quoted");
        =68=    is($f->strip(qq{<a href=foo bogus=place>bar</a>}),
        =69=       qq{<a href="foo">bar</a>\n},
        =70=       "bad attributes get stripped");
        =71=    is($f->strip(qq{<p>What do <!-- comment -->you say?}),
        =72=       qq{<p>What do you say?</p>\n},
        =73=       "comments get stripped");
        =74=    is($f->strip(qq{<table><tr><td>Hi!}),
        =75=       qq{<table><tr><td>Hi!</td></tr></table>\n},
        =76=       "tags get balanced");
        =77=    is($f->strip(qq{<b><i>bold italic!}),
        =78=       qq{<b><i>bold italic!</i></b>\n},
        =79=       "b/i tags get balanced");
        =80=    is($f->strip(qq{<b><i>bold italic!</b></i>}),
        =81=       qq{<b><i>bold italic!</i></b>\n},
        =82=       "b/i tags get nested properly");
        =83=    is($f->strip(qq{<B><I>bold italic!</I></B>}),
        =84=       qq{<b><i>bold italic!</i></b>\n},
        =85=       "tags get lowercased");
        =86=    is($f->strip(qq{<h1>hey</h1>one<br>two}),
        =87=       qq{<h1>hey</h1>\n<p>one<br>two</p>\n},
        =88=       "br comes out as HTML not XHTML");
        =89=    
        =90=    use Benchmark;
        =91=    my $homepage = do { open my $f, "homepage.html"; join "", <$f> };
        =92=    
        =93=    timethese
        =94=      (-1,
        =95=       {
        =96=        strip_homepage => sub { $f->strip($homepage) }
        =97=       });

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.