Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
![]() |
Download this listing! | ![]() |
![]() |
![]() |
Web Techniques Column 52 (Aug 2000)
[suggested title: Selectable downloads via generated tar-gz files]
There are probably tens of thousands of compressed tar archives out there on the wild wild web to download. I can't imagine how much time, even on a high speed connection, it would take to download them all. But even if the publisher of the information carefully segregated the information into a reasonable bundle, sometimes I really want only part of the data. However, I'm forced to download the entire thing (perhaps over a slow connection in a hotel room, as I often am), just to discard the parts I don't want.
That is, unless the publisher of the data provided a way to build a
custom compressed archive, with only the files or directories that I
choose. And that's what we have for this month: a CGI program that
lets a user choose first a distribution, and then individual files
from that distribution, on an item by item basis. Once the choice is
made, a specific .tar.gz
file is made just for that user. And the
result is in [listing one, below].
This month's column idea was suggested by fellow Stonehenge Perl
trainer and Usenet Poster Extraordinare, the one and only Tom Phoenix
(rootbeer@redcat.com
), based on similar code he wrote for me to
handle the downloads of exercise data and answers for our on-site
training classes. I wrote this code from scratch though, so if it
isn't exactly what he was suggesting, that's my fault.
Line 1 would have had the preferred -Tw
flags after the path to
Perl, but I ran into unavoidable problems with both taint mode and
warnings. First, the standard File::Find
module in the
distribution is not ``taint safe'', so that's a loser. (I think this
was corrected in Perl 5.6, but I haven't started using that version on
my production site yet.) Second, I'm using two variables from
File::Find
, and with warnings enabled, I get the ugly ``used only
once'' warning, annoying at best.
Line 2 turns on compiler restrictions, forcing me to declare my variables (no use of a default package), disabling barewords (no ``Perl poetry mode''), and symbolic references (no variables that contain other variable names).
Line 3 unbuffers STDOUT
, ensuring that any CGI header I've
generated appears before any program I fork, needed here for the
tar launch.
Line 5 pulls in the standard CGI::Pretty
module, which has the same
parameters as CGI.pm
, but generates nicely indented HTML. It's a
little slower to run, but the HTML was fairly small from this program,
and I wanted to be able to read it easier. The :standard
parameter
generates the function shortcuts, rather than requiring us to use the
object-oriented interface, which seems to involve a lot more typing
for not a lot of real gain.
Line 6 pulls in one of my favorite modules, CGI::Carp
(also found
in any recent standard Perl distribution). Here, I'm redirecting any
fatal runtime errors to the browser, rather than having to hunt around
for them in the web error log. Please note that this is a
potential security hole, as it reveals sensitive information to any
random user out there on the net. So, don't use this in production
code (but you aren't supposed to be using my programs as-is for
production anyway).
Line 7 sets the PATH
environment variable to something that doesn't
trip up tainting or permit additional security holes. Note that
tar needs to be found in one of these directories.
Line 11 forms the only configurable part of this code. Here, I'm specifying the directory in which I'm storing the distributions. Subdirectories below this directory define particular distributions, and must not begin with a dot or a dash. (So a directory named with morse code would definitely be forbidden.) As a security precaution, symbolic links will not be listed, either for directories or files, so the data must really live below this directory.
Line 15 sets our current directory to the top-level directory.
Although we die
if something breaks, this will merely trigger
CGI::Carp
to spit the error message out to the browser. In
production code, this death should send a simple innocuous ``something
broke'' message to the browser, and a detailed message either logged or
perhaps emailed to the webmaster.
Now, the rest of the program is ``upside down'', so I'll describe it from back to front.
Lines 58 to 70 get executed on the first call to this script. Here, we generate the list of distributions.
Lines 60 generates the HTTP header, and the HTML header (including the message for the title bar, used when a bookmark is made). Line 61 gives the first text inside the browser window.
Lines 63 through 65 locate the distributions. We read the ``dot''
directory (our current directory), looking for names that don't begin
with a dot or dash, and are a directory but not by being a symlink to
the directory. For consistency, they'll be sorted, regardless of the
unpredictable order that they're returned from readdir
.
Lines 66 to 69 generate a selection form, in the layout of a
radio_group
. Line 66 generates the FORM
tag, with an action
equal to the URL of this script (defaulting the method to POST
).
Line 67 generates a single-column table with one or more radio buttons
in a group. The group uses the same name for each form element name,
but a different value. The user will select one of these items, sent
to us in the dist
parameter upon pressing the submit button created
in line 68.
One fun feature of CGI.pm
is that the list of values used in line
67 for the @names
will automatically be HTML-entity-escaped,
meaning that <
will already be escaped to <
, and so on.
And of course, the browser will re-escape the information as we come
back the other way, which will be re-decoded by CGI.pm
on the
reply. So it doesn't matter what the directory names are! For
testing, I used a name that contained both less-thans and spaces, and
it worked just fine. Thanks, Lincoln!
So, once a distribution has been selected, we return back to the
script, and the stuff in the second pass (lines 26 to 56) gets
invoked. This is detected by a non-false value in $dist
created
in line 28.
Line 29 examines the value of $dist
to ensure that it is a sane
distribution. Even though we give the user a choice of valid
directory names, we must distrust the value returned because it is
trivial to fake any return value, possibly giving the user access
to formerly restricted files. So, the first check is to ensure that
it's a name that doesn't start with dash or dot, and does not consist
of any slashes. Next, it has to be a directory, and finally, it has
to not be a symbolic link to a directory, but an actual directory.
If that's all OK, we copy $1
into $dist
, to untaint it. I did
this before I had to turn off taint checking because of
File::Find
's displeasing behavior.
Speaking of which, we pull in File::Find
in line 32. I do this as
a require
so that I don't load all that unneeded code on the first
and third passes. The downside of this is that File::Find::find
is
not imported, so I have to call it explicitly in line 34. Line 33
sets up the @names
array, stuffed with appropriate names in line
37. (Sorry for the forward/backward references there, but that's how
they match up.)
So, find
is called in lines 34 though 38, looking at all the
pathnames below $dist
, which is in turn below the current
directory. Line 36 forces any names that begin with a dash or a dot
to be ignored, and further to ensure that any subdirectories that
begin with a dash or a dot are not examined. This is achieved by
setting the $prune
variable to true, notifying File::Find::find
that we don't want to descend into here. Line 37 puts the full
pathname (relative to the top-level directory) into @names
if it is
a file and not a symlink.
Once we have the names, we start dumping the CGI response back in line
39. Line 42 begins the form, generating a self-referencing URL with a
slight twist. If the CGI script is invoked as /cgi/getdist
, we set
the action URL to /cgi/getdist/
nnnnnnnnn.tar.gz
, where
nnnnnnnnn is a numeric value based on the epoch time (increasing
by one each second, 9 digits as I write this, rolling over to 10
digits in early September 2001).
The trailing name will be ignored by the script, but when the invocation generates the compressed tar archive, the browser will likely download the file to this name (or at least default to it). This will uniquely identify this download, making it unlikely to conflict with any other file in the user's download directory.
Next, lines 44 to 52 display a table (used for layout), headed up by
(often bolded) column titles for the name, size, and last modification
date and time. Each name from @names
is passed through the map
to generate one row. Similar checkbox
items with identical
parameter name, but differing return values, are generated. The
checkbox defaults to ``on'', defined by the 1
in line 49. Line 50
computes the filesize using -s
, and line 51 gets the modification
time converted as a human-readable string.
And when the submit button generated in line 53 is pressed, we come
back to this same script in final pass, handled in lines 19 through
24. This is by far the easiest, as we merely need to extract the
@names
from the response checkboxes that had been selected, then
dump out an appropriate MIME header (line 20), verify the names aren't
trying to select /etc/passwd
or anything else scary (line 21), and
then let tar do all the hard work. I'm presuming a GNU tar here
that can take a z
flag to handle the compression. If you have to
use a gzip in front of a non-GNU tar, this step gets slightly more
complicated, or you can just forgo the compression.
Or if you're feeling quite adventurous, you can use the
Archive::Tar
and Compress::Zlib
(both found in the CPAN) to
generate the compressed archive without using an external program.
(Perhaps I'll do that in a future column.)
And that's all there is to it! To start making your custom-selected compressed tar archives, stick your distributions below the configured top level directory, and link to the CGI URL from some convenient page.
The technique of generating a compressed tar archive on the fly can also be applied to a ``shopping cart'' strategy. You can have many pages with different filenames on them, selected or omitted by the user, and maintain the list either as hidden fields in the forms on the client side, or via some session ID technique on the server side. Then, when you're ready to generate the archive, be sure to invoke the downloading CGI URL with an appropriate extra path information so that the download name is set appropriately. Be sure to revalidate all the requested names; don't let bad guys grab arbitrary files this way.
Tom Phoenix got famous (again) by suggesting this month's column idea. If you have some snippet of an idea that can be handled by 30 to 300 lines of Perl, drop me a note. If I use your idea, you'll be famous! Until next time, enjoy!
Listings
=1= #!/usr/bin/perl =2= use strict; =3= $|++; =4= =5= use CGI::Pretty ":standard"; =6= use CGI::Carp qw(fatalsToBrowser); # only for debugging =7= $ENV{PATH} = "/bin:/usr/bin:/usr/local/bin"; =8= =9= ## CONFIG =10= =11= my $topdir = "/home/merlyn/Web/Bundle"; =12= =13= ## END CONFIG =14= =15= chdir $topdir or die "Cannot chdir: $!"; =16= =17= ## final pass =18= =19= if (my @names = param('names')) { =20= print header("application/x-tar-gzip"); =21= @names = grep !(/^\// or grep /^[-.]/, split /\//), @names; =22= exec "tar", "cfz", "-", @names if @names; =23= die "cannot exec tar: $!"; =24= } =25= =26= ## second pass =27= =28= if (my $dist = param('dist')) { =29= $dist =~ / ^ (?! [-.]) ([^\/]+) $ /x and -d $dist and not -l $dist or die "bad dist"; =30= $dist = $1; # now untainted =31= =32= require File::Find; =33= my @names; =34= File::Find::find =35= (sub { =36= return $File::Find::prune = 1 if /^(-|\..)/s; =37= push @names, $File::Find::name if -f and not -l; =38= }, $dist); =39= print header, start_html("Download a distribution"); =40= print h1("Select your items within this distribution"); =41= =42= print start_form(-action => url()."/".time().".tar.gz"); =43= print p("Select your items:"); =44= print =45= table({cellspacing => 0, cellpadding => 2, border => 0}, =46= Tr(th(["Filename", =47= "Size", =48= "Last modified"])), =49= map Tr(td([checkbox("names", 1, $_, $_), =50= sprintf("%dK", (1023 + -s)/1024), =51= scalar(localtime +(stat)[9])])), =52= @names); =53= print submit; =54= print end_form, end_html; =55= exit 0; =56= } =57= =58= ## first pass =59= =60= print header, start_html("Download a distribution"); =61= print h1("Select your distribution"); =62= =63= opendir DOT, "." or die "Cannot opendir: $!"; =64= my @names = sort grep !/^[-.]/ && -d && ! -l, readdir DOT; =65= closedir DOT; =66= print start_form, p("Select your distribution:"); =67= print radio_group(-name => "dist", -values => \@names, -columns => 1); =68= print submit; =69= print end_form, end_html; =70= exit 0;