Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 22 (February 1998)

I like to share things. Someone once asked me if they could have a tar of a portion of my website, so that they could read in on their local disk. At first, I said ``no problem, I'll just launch tar right here''. But then I realized that while the files would be intact, the links would be a mess. Like most people, I had some absolute links, some relative links that pointed within the tree to be moved, and some relative links that pointed outside the part to be moved.

What I wanted was a file tree that had all internal links (that is, links to other parts of this tree) to be relative, so that it didn't matter where it went on my friend's server (or even as a tree of local files accessed with file: URLs). And all links outside that tree to be absolute, so that my friend's browser would just transparently pick up the web path and run with it.

I started hacking out something with a few regular expressions, then quickly realized I was rebuilding the HTML::Parse module from the very wonderful LWP package (found in the CPAN). So, after scuttling the early effort, I decided to make a very powerful and robust program using that module as a base.

The trick with HTML::Parse was to construct a series of ``callbacks''. As that module is parsing an HTML file, it recognizes start tags, end tags, comments, and so on. Well, all I was interested in was the start tags (for the URLs in A and IMG attributes), but I still needed to construct a ``passthrough'' callback for all the rest. I started by just adding enough lines so that the output was the same as the input for the files I was testing. Then I added some recognition for the interesting tags, and then some way to walk through a file hierarchy.

And the result is in [listing one, below].

Lines 1 through 3 begin nearly every program I write, turning on warnings, disabling output buffering, and turning on useful restrictions. Lines 4 though 8 pull in the various modules I'll be using. (I'll describe them as I need them later.)

Lines 10 through 16 define the configuration constants. $TOP is the Unix path to the top level of the tree. $TOP_URL is the corresponding URL that would have fetched the same location -- this is important for resolving already relative links. (Notice here I'm using my on-line archive of my prior Web Techniques columns.) And $OUT is where we want the translated tree to be placed. This directory (and all subdirectories) will be created as needed.

The WITHIN subroutine will be called to ask if a particular URL is ``within'' this tree or not. Any URL for which WITHIN returns true will be made relative to the base of the output tree. All other URLs will be made absolute. This is useful if some of the URLs are not really files within the $TOP hierarchy, but rather active documents that must be fetched from the original server to be valid (like CGI scripts).

Lines 18 through 80 subclass the HTML::Parser class (defined in LWP), creating a specialed parser called MyParser. Since an HTML::Parser is basically just a skeleton anyway, this is generally necessary. You could stick all this stuff into a separate .pm file, but this is simpler.

Line 20 defines the package for all variables and subroutines until the end of the BEGIN block. Lines 21 through 23 pull in the various modules used only by MyParser. Lines 25 and 26 declare that MyParser inherits from (and therefore extends) HTML::Parser.

Lines 28 through 33 override the standard HTML::Parser->new method to allow MyParser to have additional parameters on the new method. We're capturing three additional values: a ``base url'' string, a filehandle for output, and a coderef for the ``within'' function. These will be instance variables in $self in addition to whatever HTML::Parser also includes for instance variables. Note the double underscore in an attempt to ensure no collision with the parent-class's instance variables.

The HTML::Parser class interface specifies five callbacks: declaration, start, end, text, and comment. These will be called as each HTML construct is recognized. It's up to MyParser to provide definitions for each of them. In fact, four of these five callbacks do nothing but echo the original data, because we don't need to alter anything but the URLs in A and IMG tags.

Lines 35 through 39 handle the declarations (like <!SOMETHING<gt>) that are not otherwise comments. Here, we take the string in $decl and pass it back out by printing to the filehandle held in the __out instance variable.

Lines 41 through 59 handle the trickiest part: rewriting some of the start tags. Lines 42 through 44 grab the input data, in preparation for the real work.

Line 45 prints out the beginning of the tag. Lines 46 through 57 cycle through for all the attributes of the tag. Each attribute name goes into $_, which we print in line 47. Line 48 captures the attribute value into a variable.

Lines 49 through 54 handle the rewriting of the URLs. If we're looking at a HREF attribute in an A tag, or a SRC attribute in a IMG tag, then the expression in line 49 is true. (If you want to include other tag/attribute pairs, you can extend this list easily.)

If we need to rewrite it, line 50 creates an absolute URL object (using the url function found in URI::URL) based on the specified URL of this file. This URL is then tested in line 51 by calling the ``within'' function to see if it needs to be absolute or relative. If it's absolute, we leave it alone. If it must be made relative, this is handled in line 52.

In line 55, we now have either the original $val attribute value, or some URL made completely absolute or relative from the steps above. In either case, we'll now need to encode any entities in the value, since they were decoded before they ended up in the attribute hash to us. And line 56 closes the attribute value.

Finally, line 58 closes the entire tag. Yes, finally.

Lines 61 to 65 handle the end tags. There's nothing to do for these, so it's a null translation.

Lines 67 to 71 handle the ``text''. Again, this is already entity encoded (ampersands are already &amp;, and so on), so there's no need to reencode it.

Lines 73 through 76 handle the ``comments''. Note that server-side include (SSI) comments are simply passed through. Perhaps you might want to examine $comment to see if it starts with #, and if so, strip it or warn the invoker.

Line 82 invokes find (defined in File::Find), passing it the name of the callback subroutine along with a top level directory. Note that the directory has a /./ suffix, which doesn't change its meaning, but allows the callback to distinguish between the original path and the subpath below it.

Lines 84 to 98 define the callback subroutine, called once for each pathname below the designated original path.

Line 85 aborts on non-files. Note that symlinks to files will be extracted as if they are original files, but symlinks to directories will not (because find will not walk them).

Line 86 rips apart the $File::Find::name, locating the relative pathname below the top directory into $rel_name. We need this to construct the following two variables. Lines 87 and 88 construct $src_url (the source URL for this particular file), and $out_file (the output filename for this file).

Line 89 creates a directory for $out_file. Note that we're using dirname, defined in File::Basename, and mkpath, defined in File::Path. Both of these libraries come with the standard Perl library, and can save you time once you are aware of them. I'm passing mkpath the ``verbose'' flag, so it'll print (to STDOUT) a listing of directories it is making as it makes them.

Line 90 is a trace for me to see what files are being translated. The ... helps distinguish them from the directory names printed in line 89.

Line 91 detects whether we're dealing with an HTML file or not. The test is rather simple, and might need to be trained better for odd layouts. If the filename ends in ``.html'', it's an HTML file, and needs to be parsed.

HTML files get handled in lines 92 through 94. Lines 92 and 93 create an output file handle by calling IO::File's new method, passing it an open-style argument. This should create a filehandle object that we can pass into the parser. (What you really don't wanna do here is confuse $_ with $out_file, which I did during testing. I blew away all the HTML files for www.stonehenge.com, replacing them with empty files. Ouch. Never test on live data!)

Line 94 is interestingly complex. First, we call the new method in the class MyParser (defined above). Three parameters are passed in: the URL of the HTML file we're looking at, the output filehandle just created, and a coderef to the WITHIN subroutine (defined at the top of this file). This method should return a MyParser object, which is then immediately sent the parse_file method, passing it the current-directory-relative filename.

The result will be a parsing of the file, along with the appropriate callbacks into the various methods. These will cause the chunks of the original HTML file to be sent to the handle in $out.

Comparatively speaking, the handling of a non-HTML file in line 96 is trivial. The file is copied using the copy function defined in File::Copy from the source file to the output file.

Lots of dense code. Hope you followed all that. But more importantly, notice all the existing code I'm leveraging (check out all the use directives). This let me write this program in an hour or so, instead of spending all day figuring out each part. Such is the Power of Perl, to help make seemingly complex tasks completely doable.

Listing One

        =1=     #!/home/merlyn/bin/perl -w
        =2=     $|++;
        =3=     use strict;
        =4=     use File::Basename;
        =5=     use File::Copy;
        =6=     use File::Find;
        =7=     use File::Path;
        =8=     use IO::File;
        =9=     
        =10=    my $TOP = "/home/merlyn/Html/merlyn/WebTechniques";
        =11=    my $TOP_URL = "http://www.stonehenge.com/merlyn/WebTechniques";;
        =12=    my $OUT = "/home/merlyn/Web/Reltree/Out";
        =13=    
        =14=    sub WITHIN {
        =15=      $_[0] =~ m{ ^\Qhttp://www.stonehenge.com/merlyn/WebTechniques/\E }xs;
        =16=    }
        =17=    
        =18=    ## "use MyParser;" ##
        =19=    BEGIN {
        =20=      package MyParser;
        =21=      use HTML::Parser;
        =22=      use HTML::Entities ();
        =23=      use URI::URL;
        =24=    
        =25=      use vars qw(@ISA);
        =26=      @ISA = qw(HTML::Parser);
        =27=    
        =28=      sub new {
        =29=        my $pack = shift;
        =30=        my $self = $pack->SUPER::new;
        =31=        @{$self}{qw(__base __out __within)} = @_;
        =32=        $self;
        =33=      }
        =34=    
        =35=      sub declaration {
        =36=        my $self = shift;
        =37=        my ($decl) = @_;
        =38=        $self->{__out}->print("<!$decl>");
        =39=      }
        =40=    
        =41=      sub start {
        =42=        my $self = shift;
        =43=        my ($tag, $attr, $attrseq, $origtext) = @_;
        =44=        my $out = $self->{__out};
        =45=        $out->print("<$tag");
        =46=        for (keys %$attr) {
        =47=          $out->print(" $_=\"");
        =48=          my $val = $attr->{$_};
        =49=          if ("$tag $_" =~ /^(a href|img src)$/) {
        =50=            $val = url($val)->abs($self->{__base},1);
        =51=            if ($self->{__within}->($val)) {
        =52=              $val = $val->rel($self->{__base});
        =53=            }
        =54=          }
        =55=          $out->print(HTML::Entities::encode($val, '<>&"'));
        =56=          $out->print('"');
        =57=        }
        =58=        $out->print(">");
        =59=      }
        =60=    
        =61=      sub end {
        =62=        my $self = shift;
        =63=        my ($tag) = @_;
        =64=        $self->{__out}->print("</$tag>");
        =65=      }
        =66=    
        =67=      sub text {
        =68=        my $self = shift;
        =69=        my ($text) = @_;
        =70=        $self->{__out}->print("$text");
        =71=      }
        =72=    
        =73=      sub comment {
        =74=        my $self = shift;
        =75=        my ($comment) = @_;
        =76=        $self->{__out}->print("<!-- $comment -->");
        =77=      }
        =78=    
        =79=    }
        =80=    ## end "use MyParser;" ##
        =81=      
        =82=    find(\&translate, "$TOP/./");
        =83=    
        =84=    sub translate {
        =85=      return unless -f;
        =86=      (my $rel_name = $File::Find::name) =~ s{ .*/\./ }{}xs;
        =87=      my $src_url = "$TOP_URL/$rel_name";
        =88=      my $out_file = "$OUT/$rel_name";
        =89=      mkpath dirname($out_file), 1;
        =90=      print "... $out_file\n";
        =91=      if (/\.html$/) {
        =92=        my $out = IO::File->new(">$out_file")
        =93=          or die "Cannot create $out_file: $!";
        =94=        MyParser->new($src_url,$out,\&WITHIN)->parse_file($_);
        =95=      } else {
        =96=        copy $_, $out_file;
        =97=      }
        =98=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.