Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 52 (Aug 2000)

[suggested title: Selectable downloads via generated tar-gz files]

There are probably tens of thousands of compressed tar archives out there on the wild wild web to download. I can't imagine how much time, even on a high speed connection, it would take to download them all. But even if the publisher of the information carefully segregated the information into a reasonable bundle, sometimes I really want only part of the data. However, I'm forced to download the entire thing (perhaps over a slow connection in a hotel room, as I often am), just to discard the parts I don't want.

That is, unless the publisher of the data provided a way to build a custom compressed archive, with only the files or directories that I choose. And that's what we have for this month: a CGI program that lets a user choose first a distribution, and then individual files from that distribution, on an item by item basis. Once the choice is made, a specific .tar.gz file is made just for that user. And the result is in [listing one, below].

This month's column idea was suggested by fellow Stonehenge Perl trainer and Usenet Poster Extraordinare, the one and only Tom Phoenix (rootbeer@redcat.com), based on similar code he wrote for me to handle the downloads of exercise data and answers for our on-site training classes. I wrote this code from scratch though, so if it isn't exactly what he was suggesting, that's my fault.

Line 1 would have had the preferred -Tw flags after the path to Perl, but I ran into unavoidable problems with both taint mode and warnings. First, the standard File::Find module in the distribution is not ``taint safe'', so that's a loser. (I think this was corrected in Perl 5.6, but I haven't started using that version on my production site yet.) Second, I'm using two variables from File::Find, and with warnings enabled, I get the ugly ``used only once'' warning, annoying at best.

Line 2 turns on compiler restrictions, forcing me to declare my variables (no use of a default package), disabling barewords (no ``Perl poetry mode''), and symbolic references (no variables that contain other variable names).

Line 3 unbuffers STDOUT, ensuring that any CGI header I've generated appears before any program I fork, needed here for the tar launch.

Line 5 pulls in the standard CGI::Pretty module, which has the same parameters as CGI.pm, but generates nicely indented HTML. It's a little slower to run, but the HTML was fairly small from this program, and I wanted to be able to read it easier. The :standard parameter generates the function shortcuts, rather than requiring us to use the object-oriented interface, which seems to involve a lot more typing for not a lot of real gain.

Line 6 pulls in one of my favorite modules, CGI::Carp (also found in any recent standard Perl distribution). Here, I'm redirecting any fatal runtime errors to the browser, rather than having to hunt around for them in the web error log. Please note that this is a potential security hole, as it reveals sensitive information to any random user out there on the net. So, don't use this in production code (but you aren't supposed to be using my programs as-is for production anyway).

Line 7 sets the PATH environment variable to something that doesn't trip up tainting or permit additional security holes. Note that tar needs to be found in one of these directories.

Line 11 forms the only configurable part of this code. Here, I'm specifying the directory in which I'm storing the distributions. Subdirectories below this directory define particular distributions, and must not begin with a dot or a dash. (So a directory named with morse code would definitely be forbidden.) As a security precaution, symbolic links will not be listed, either for directories or files, so the data must really live below this directory.

Line 15 sets our current directory to the top-level directory. Although we die if something breaks, this will merely trigger CGI::Carp to spit the error message out to the browser. In production code, this death should send a simple innocuous ``something broke'' message to the browser, and a detailed message either logged or perhaps emailed to the webmaster.

Now, the rest of the program is ``upside down'', so I'll describe it from back to front.

Lines 58 to 70 get executed on the first call to this script. Here, we generate the list of distributions.

Lines 60 generates the HTTP header, and the HTML header (including the message for the title bar, used when a bookmark is made). Line 61 gives the first text inside the browser window.

Lines 63 through 65 locate the distributions. We read the ``dot'' directory (our current directory), looking for names that don't begin with a dot or dash, and are a directory but not by being a symlink to the directory. For consistency, they'll be sorted, regardless of the unpredictable order that they're returned from readdir.

Lines 66 to 69 generate a selection form, in the layout of a radio_group. Line 66 generates the FORM tag, with an action equal to the URL of this script (defaulting the method to POST). Line 67 generates a single-column table with one or more radio buttons in a group. The group uses the same name for each form element name, but a different value. The user will select one of these items, sent to us in the dist parameter upon pressing the submit button created in line 68.

One fun feature of CGI.pm is that the list of values used in line 67 for the @names will automatically be HTML-entity-escaped, meaning that < will already be escaped to &lt;, and so on. And of course, the browser will re-escape the information as we come back the other way, which will be re-decoded by CGI.pm on the reply. So it doesn't matter what the directory names are! For testing, I used a name that contained both less-thans and spaces, and it worked just fine. Thanks, Lincoln!

So, once a distribution has been selected, we return back to the script, and the stuff in the second pass (lines 26 to 56) gets invoked. This is detected by a non-false value in $dist created in line 28.

Line 29 examines the value of $dist to ensure that it is a sane distribution. Even though we give the user a choice of valid directory names, we must distrust the value returned because it is trivial to fake any return value, possibly giving the user access to formerly restricted files. So, the first check is to ensure that it's a name that doesn't start with dash or dot, and does not consist of any slashes. Next, it has to be a directory, and finally, it has to not be a symbolic link to a directory, but an actual directory.

If that's all OK, we copy $1 into $dist, to untaint it. I did this before I had to turn off taint checking because of File::Find's displeasing behavior.

Speaking of which, we pull in File::Find in line 32. I do this as a require so that I don't load all that unneeded code on the first and third passes. The downside of this is that File::Find::find is not imported, so I have to call it explicitly in line 34. Line 33 sets up the @names array, stuffed with appropriate names in line 37. (Sorry for the forward/backward references there, but that's how they match up.)

So, find is called in lines 34 though 38, looking at all the pathnames below $dist, which is in turn below the current directory. Line 36 forces any names that begin with a dash or a dot to be ignored, and further to ensure that any subdirectories that begin with a dash or a dot are not examined. This is achieved by setting the $prune variable to true, notifying File::Find::find that we don't want to descend into here. Line 37 puts the full pathname (relative to the top-level directory) into @names if it is a file and not a symlink.

Once we have the names, we start dumping the CGI response back in line 39. Line 42 begins the form, generating a self-referencing URL with a slight twist. If the CGI script is invoked as /cgi/getdist, we set the action URL to /cgi/getdist/nnnnnnnnn.tar.gz, where nnnnnnnnn is a numeric value based on the epoch time (increasing by one each second, 9 digits as I write this, rolling over to 10 digits in early September 2001).

The trailing name will be ignored by the script, but when the invocation generates the compressed tar archive, the browser will likely download the file to this name (or at least default to it). This will uniquely identify this download, making it unlikely to conflict with any other file in the user's download directory.

Next, lines 44 to 52 display a table (used for layout), headed up by (often bolded) column titles for the name, size, and last modification date and time. Each name from @names is passed through the map to generate one row. Similar checkbox items with identical parameter name, but differing return values, are generated. The checkbox defaults to ``on'', defined by the 1 in line 49. Line 50 computes the filesize using -s, and line 51 gets the modification time converted as a human-readable string.

And when the submit button generated in line 53 is pressed, we come back to this same script in final pass, handled in lines 19 through 24. This is by far the easiest, as we merely need to extract the @names from the response checkboxes that had been selected, then dump out an appropriate MIME header (line 20), verify the names aren't trying to select /etc/passwd or anything else scary (line 21), and then let tar do all the hard work. I'm presuming a GNU tar here that can take a z flag to handle the compression. If you have to use a gzip in front of a non-GNU tar, this step gets slightly more complicated, or you can just forgo the compression.

Or if you're feeling quite adventurous, you can use the Archive::Tar and Compress::Zlib (both found in the CPAN) to generate the compressed archive without using an external program. (Perhaps I'll do that in a future column.)

And that's all there is to it! To start making your custom-selected compressed tar archives, stick your distributions below the configured top level directory, and link to the CGI URL from some convenient page.

The technique of generating a compressed tar archive on the fly can also be applied to a ``shopping cart'' strategy. You can have many pages with different filenames on them, selected or omitted by the user, and maintain the list either as hidden fields in the forms on the client side, or via some session ID technique on the server side. Then, when you're ready to generate the archive, be sure to invoke the downloading CGI URL with an appropriate extra path information so that the download name is set appropriately. Be sure to revalidate all the requested names; don't let bad guys grab arbitrary files this way.

Tom Phoenix got famous (again) by suggesting this month's column idea. If you have some snippet of an idea that can be handled by 30 to 300 lines of Perl, drop me a note. If I use your idea, you'll be famous! Until next time, enjoy!

Listings

        =1=     #!/usr/bin/perl
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use CGI::Pretty ":standard";
        =6=     use CGI::Carp qw(fatalsToBrowser); # only for debugging
        =7=     $ENV{PATH} = "/bin:/usr/bin:/usr/local/bin";
        =8=     
        =9=     ## CONFIG
        =10=    
        =11=    my $topdir = "/home/merlyn/Web/Bundle";
        =12=    
        =13=    ## END CONFIG
        =14=    
        =15=    chdir $topdir or die "Cannot chdir: $!";
        =16=    
        =17=    ## final pass
        =18=    
        =19=    if (my @names = param('names')) {
        =20=      print header("application/x-tar-gzip");
        =21=      @names = grep !(/^\// or grep /^[-.]/, split /\//), @names;
        =22=      exec "tar", "cfz", "-", @names if @names;
        =23=      die "cannot exec tar: $!";
        =24=    }
        =25=    
        =26=    ## second pass
        =27=    
        =28=    if (my $dist = param('dist')) {
        =29=      $dist =~ / ^ (?! [-.]) ([^\/]+) $ /x and -d $dist and not -l $dist or die "bad dist";
        =30=      $dist = $1;                   # now untainted
        =31=    
        =32=      require File::Find;
        =33=      my @names;
        =34=      File::Find::find
        =35=          (sub {
        =36=             return $File::Find::prune = 1 if /^(-|\..)/s;
        =37=             push @names, $File::Find::name if -f and not -l;
        =38=           }, $dist);
        =39=      print header, start_html("Download a distribution");
        =40=      print h1("Select your items within this distribution");
        =41=    
        =42=      print start_form(-action => url()."/".time().".tar.gz");
        =43=      print p("Select your items:");
        =44=      print
        =45=        table({cellspacing => 0, cellpadding => 2, border => 0},
        =46=              Tr(th(["Filename",
        =47=                     "Size",
        =48=                     "Last modified"])),
        =49=              map Tr(td([checkbox("names", 1, $_, $_),
        =50=                         sprintf("%dK", (1023 + -s)/1024),
        =51=                         scalar(localtime +(stat)[9])])),
        =52=              @names);
        =53=      print submit;
        =54=      print end_form, end_html;
        =55=      exit 0;
        =56=    }
        =57=    
        =58=    ## first pass
        =59=    
        =60=    print header, start_html("Download a distribution");
        =61=    print h1("Select your distribution");
        =62=    
        =63=    opendir DOT, "." or die "Cannot opendir: $!";
        =64=    my @names = sort grep !/^[-.]/ && -d && ! -l, readdir DOT;
        =65=    closedir DOT;
        =66=    print start_form, p("Select your distribution:");
        =67=    print radio_group(-name => "dist", -values => \@names, -columns => 1);
        =68=    print submit;
        =69=    print end_form, end_html;
        =70=    exit 0;

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.