Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 35 (Apr 2002)

[suggested title: Serving up a piece of tar]

The great thing about web servers is that they can serve more than web pages. They can serve stuff. Sometimes, that stuff is inside tarballs, those little bundles of joy that efficiently hold many files (sometimes numbering in the thousands) for convenient transferring or archival.

A recent message on the Perl Monastary (www.perlmonks.org) inspired me. The person, known as Screamer, gave a little note titled Serving tarball contents as part of your webspace. It was very short:

    #!/usr/bin/perl -w
    use strict;
    use CGI;
    use CGI::Carp qw(fatalsToBrowser);
    use Archive::Tar;
    my $cgi = CGI->new();
    my $tar = Archive::Tar->new('foobar.tar.gz');
    my $file = $cgi->path_info();
    $file =~ s|^/||;
    print $cgi->header(), $tar->get_content($file);

The purpose is to permit /cgi/tarserver/fred/barney/betty.txt to return the file fred/barney/betty.txt from inside a tar archive. I liked the code, but Screamer went on to give a few limitations, and I thought of a few of my own.

So, I sat down and wrote the program that I thought would be a bit more flexible and useful, and present it in this month's column. In particular, I wanted to serve many files from many different tarballs, and browse the files within the tarballs instead of having to get the filename correct on the first try. And this program is presented in [listing one, below]. As always, this is a work-in-progress, progressed to the point where I want to show it off, but it could still use some features and a few more once-overs before putting it into production.

Lines 1 through 3 start nearly every program I write, enabling warnings for debugging, turning on the common restrictions (variables need to be declared, references cannot be soft references, and barewords are not allowed), and disabling the buffering on standard output.

Line 5 pulls in the CGI::Carp module for debugging, turning all those die messages into browser responses, making it a bit easier for me to debug through my browser. As this program is still representative, and not complete, my next step would be to clean up those dies, replacing them with reasonable responses rather than just aborting. But for now, I'll do this.

Lines 7 through 13 define two configuration items: the location of the tar files, and a subroutine to select which files from that directory are permitted. The subroutine returns true for tar files, tar-gz files, and tgz files, and rejects anything that begins with a dot for safe measure.

Line 15 pulls in the CGI module, with all the shortcuts enabled as subroutines rather than method calls. This is slightly more expensive than using the object-oriented interface, but I'm willing to trade programmer typing time for CPU time.

Lines 17 and 18 extract the ``path info'' data. The CGI script will be repeatedly invoked, with the ``parameter'' coming from information trailing the name, as if the script name were a directory. path_info() provides the data after the name, from which I strip an initial slash if present. Then a simple split pulls it apart.

Line 20 declares @choices, which will contain a list of subentries below the current path (if any) for the HTML code beginning in line 82 to display.

If some path info has been included, line 22 notes that, and so we'll need to look within an archive. The first step of that must be a tar file, with a good name (validated in line 23) and reasonably readable as a file (validated in line 25).

Parsing through the contents of an archive for the list of names is relatively expensive (since it also typically involves a decompression), so we want to cache that. The @names variable declared in line 28 will most likely come from that cache, except for the first hit.

Line 29 brings in the Cache::FileCache module to handle the caching interface (found in the CPAN). Lines 31 to 36 create the cache object. We cache for only 10 minutes, because the purpose of this cache is to prevent an expensive parse of a file that is likely being browsed through many multiple hits over a short period of time. There's no point in caching it longer, and that might interfere if the file is updated from time to time. Once an hour, one lucky user also gets to run the cleaning cycle to erase old cached items.

If the item is found in the cache (in line 37), then we return that list for @names. If not, we'll have to parse the tar archive, and that's handled in lines 40 to 45. Line 40 pulls in Archive::Tar (found in the CPAN), which also uses Compress::Gzip for tar-gz files automatically. Line 43 lists the archive. Line 44 stores the listing into the cache so we don't have to do this for another 10 minutes, and line 45 returns the list to become @names.

Once we have the contents of the archive, we have to take the remaining path items (if any) and discard anything that wouldn't match those. Lines 49 to 66 walk through the names, simultaneously selecting those that match, and discarding the common prefix, with most of the magic happening in line 50. The map attempts to match that ugly regular expression against each item. If the match is successful, the value of $1 is passed to the result. If the match fails, nothing results, and the item disappears. Since $1 is the part of the string after the slash, the prefix is automatically stripped. If there's no trailing item, then $1 is undef, which we'll use an a moment.

Line 51 takes care of a path given to us that doesn't map to any entries: if the set of names in @names ever reduces to empty, there can't be any matches, so the path must have been contrived.

Line 52 detects any undef value in the list. If that's the case, then the optional trailing part of the regular expression in line 50 must've failed, and thus we've matched the entire name exactly. Line 53 ensures that we didn't match a name and still have trailing path steps. If that's not the case, it's time to grab the contents, using Archive::Tar again. Line 56 opens the archive, then line 58 constructs the path, and line 59 grabs the contents into a scalar.

The HTTP protocol wants a MIME type delivered in the header, so we'll bring in the File::MMagic module to help identify the type (also from the CPAN). This type is used in line 64, along with sending out the contents, and we're done delivering the specific file.

But, if we don't have a specific file, we'll fall down out of the loop, and set up @choices. The hash %choices created in line 70 will have all the files left over from walking the name list, as well as the top-level of any subdirectory, indicated with a trailing slash. The trailing slash doesn't mess up any of the logic (sort of accidentally, as I was pleasently surprised, rather than by design).

If no path was included, lines 77 to 79 set up @choices as being all the tar files in the directory.

Line 82 prints out a normal HTML header, and a simple ``h1'' for a label.

Lines 84 to 93 show the ``path'' we're currently browsing. The first link, labled Top, will bring us back to the list of tar files, and is also the initial entry point. Otherwise, each step in the path list becomes an individual link to bring us to that place in the hierarchy: a fairly standard and intuitive interface. I'm very careful to escape the HTML entities (using the CGI.pm subroutine) in case a filename contains HTML-sensitive characters (particularly like an ampersand).

Lines 95 through 99 show the sublinks. At the top level, this is a list of tar files in the designated directory. If one of those are clicked, the contents are determined, and the top-level directories or files of that archive are displayed. As each directory is selected, we reuse the listing to generate the subdirectories. If a file is instead selected, then the content is displayed, much like the normal index-generation facilities of a typical web server. An unordered list (typically represented with bullets) is used to organize the display.

Line 101 ends the HTML output, and we're done!

The tar archive contains information about size and timestamps which we're ignoring. The Archive::Tar interface can give those to us nicely, so one improvement could be to include that in the output. Also, for simple security, I've required all the tar files to be the same directory, but there's nothing stopping the path from being partially a list of qualified Unix directories, then the archive, then some more directories within the archive, then the final file. Beware that you don't permit dot-dot when you do this though, or you could end up serving your entire disk.

I hope this program illustrates a couple of nice ideas for your own use, or perhaps simply provides you a quick and dirty tool to browse a few tar-gz's you have lying around. Until next time, enjoy!

Listing

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use CGI::Carp qw(fatalsToBrowser); # DEBUG only
        =6=     
        =7=     ## begin config
        =8=     my $DIR = "/home/merlyn/Web/Tarserver";
        =9=     sub VALID {
        =10=      local $_ = shift;
        =11=      /(\.tgz|\.tar(\.gz)?)\z/ && !/\A\./;
        =12=    }
        =13=    ## end config
        =14=    
        =15=    use CGI qw(:all);
        =16=    
        =17=    (my $path = path_info()) =~ s/\A\///;
        =18=    my @path = split '/', $path;
        =19=    
        =20=    my @choices;
        =21=    
        =22=    if (@path) {                    # first element must be tar.gz
        =23=      die "bad tar name: $path[0]" unless VALID($path[0]);
        =24=      my $tarchive = "$DIR/$path[0]";
        =25=      die "missing tarchive: $tarchive" unless -f $tarchive and -r $tarchive;
        =26=    
        =27=      ## must look in contents now
        =28=      my @names = do {
        =29=        require Cache::FileCache;
        =30=    
        =31=        my $cache = Cache::FileCache->new
        =32=          ({namespace => 'tarserver',
        =33=            username => 'nobody',
        =34=            default_expires_in => '10 minutes',
        =35=            auto_purge_interval => '1 hour',
        =36=           }) or die "Cannot connect to cache";
        =37=        if (my $names = $cache->get($tarchive)) {
        =38=          @$names;
        =39=        } else {
        =40=          require Archive::Tar;
        =41=    
        =42=          die "Cannot list archive $tarchive"
        =43=            unless my @n = Archive::Tar->list_archive($tarchive);
        =44=          $cache->set($tarchive, \@n);
        =45=          @n;
        =46=        }
        =47=      };
        =48=    
        =49=      for my $step (1..$#path) {
        =50=        @names = map /\A\/?\Q$path[$step]\E(?:\/(.*))?\z/s, @names;
        =51=        die "no such name" unless @names;
        =52=        if (grep !defined $_, @names) {
        =53=          die "trailing stuff after name" if $step != $#path;
        =54=          require Archive::Tar;
        =55=    
        =56=          my $at = Archive::Tar->new($tarchive)
        =57=            or die "Cannot open archive $tarchive";
        =58=          my $file = join "/", @path[1..$#path];
        =59=          defined(my $contents = $at->get_content($file))
        =60=            or die "Cannot get $file from $tarchive";
        =61=    
        =62=          require File::MMagic;
        =63=          my $mimetype = File::MMagic->new->checktype_contents($contents);
        =64=          print header($mimetype), $contents;
        =65=          exit 0;
        =66=        }
        =67=      }
        =68=    
        =69=      {
        =70=        my %choices = ();
        =71=        $choices{$_}++ for map /\A([^\/]+\/?)/, @names;
        =72=    
        =73=        @choices = sort keys %choices;
        =74=      }
        =75=    
        =76=    } else {                        # choose a top-level item
        =77=      opendir D, $DIR;
        =78=      @choices = sort grep VALID($_), readdir D;
        =79=      closedir D;
        =80=    }
        =81=      
        =82=    print header('text/html'), start_html('tar server'), h1('tar server');
        =83=    
        =84=    ## show path
        =85=    print "from ", a({href => url()}, "Top");
        =86=    {
        =87=      my $link = "";
        =88=      for (@path) {
        =89=        $link .= "/$_";
        =90=        print " / ", a({href => url().$link}, escapeHTML("$_"));
        =91=      }
        =92=    }
        =93=    print br;
        =94=    
        =95=    ## show sublinks
        =96=    my $prefix = @path ? join("/", @path, "") : "";
        =97=    print ul(map {
        =98=      li(a({href => url()."/$prefix$_"}, escapeHTML($_)));
        =99=    } @choices);
        =100=   
        =101=   print end_html;

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.