Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
![]() |
Download this listing! | ![]() |
![]() |
![]() |
Linux Magazine Column 42 (Nov 2002)
[suggested title: Mirroring your own mini-CPAN]
The Comprehensive Perl Archive Network, known as ``the CPAN'', is the ``one stop shopping center'' for all things Perl. This 1.2 GB archive contains over 13000 modules for inclusion in your programs, as well as scripts, documentation, many non-Unix Perl binaries, and other interesting things.
Although there's nearly always a good fast CPAN archive nearby when you are connected to the net, sometimes you're connected to the net at different speeds (like quickly at work, but slowly at home or vice versa), or not at all. And what do you do then when you're like me, at 30,000 feet jetting off to yet another conference or customer site, and you realize you need a module that you haven't yet installed on your laptop? (This is especially an issue when a deadline for a magazine column looms close.)
Well, for the past year or so, I've been mirroring the entire CPAN to my laptop, thanks to the permission and cooperation of the owner of one of the major archive sites (and a few carefully constructed rsync commands). But at a recent conference, someone said ``hey, can you just burn that onto a CD for me?'', and I was stuck. The current CPAN exceeds the size of a CD-ROM, even though only a small portion of the files are needed for module installation!
So that got me thinking. If I brought down only the files that were
needed by CPAN.pm
to perform the installation of the latest release
of a module, how big would that be? And the answer was wonderfully
surprising: a bit more than 200 meg, which easily fits on a CD-ROM.
Unfortunately, I didn't see any clean, easy-to-use, efficient ``mirror only the latest modules of the CPAN'' program out there, so I wrote my own, which I present in [listing one, below].
Lines 1 through 3 start nearly every long program that I write,
enabling warnings, compiler restrictions, and disabling buffering on
STDOUT
.
Lines 5 through 17 form the configuration section of this program. There's really only three things to set here.
$REMOTE
is the URL prefix leading to the nearest CPAN archive. The
uncommented value is the main United States CPAN archive. The next
value is the Finland archive, which also happens to be the master
archive. If you want the most up-to-date sources, they're here. And
because I was initially developing this program at the annual SAGE-AU
conference in Australia, the value following that is the Australian
CPAN archive. Finally, I have a complete CPAN archive on my
laptop's disk already, so I can point to that with a file:
URL as
well, as shown by the fourth value.
That's the source, and we need to define a destination, and that's in
$LOCAL
. This is a simple Unix path. If you're on a non-Unix
system, you can specify this in the local directory syntax, since
we'll be using the cross-platform File::Spec
library to manipulate
this path. And, as the comment warns, this program owns the contents
of that directory, and is free to delete anything it sees fit, so keep
that in mind as you are specifying the path.
Finally, a simple true/false $TRACE
flag decides whether this
program is noisy by default or quiet by default. The noise is limited
to actual activity, and reassures me during execution that something
is happening.
Next, from lines 20 to 30, we'll pull in the necessary modules. The
standard Perl bundle gives us the dirname
, catfile
, and find
routines. The optional CPAN-installable LWP
library gives us the
URI
object module and the mirror
routine (and some associated
status values). And Compress::Zlib
lets us expand the
gzip-compressed index file so we know what distributions are needed
for the mirror.
Once we've got everything set up, it's time to transfer everything
needed for a typical operation of the core CPAN
module (described
by perldoc CPAN
in a typical Perl installation). First, we need
the index files, defined in lines 34 to 36. We'll call my_mirror
on each of those, defined later. For now, we'll presume that this
creates or refreshes each of those files below the
$LOCAL
-identified directory.
The 02packages.details.txt.gz
file is a flat text file with a short
header that contains the path to each distribution for each module in
the CPAN. However, this file is gzip-compressed, so we need to expand
the file to process the contents. Stealing the example out of the
Compress::Zlib
manpage nearly directly, lines 40 to 52 expand this
file and extract the necessary information.
Line 40 constructs the filename in a platform-independent way by using
the catfile
routine. Note that we're actually passing three
parameters. The first parameter is the value of $LOCAL
, which
serves as the starting point, from which we descend further to the
subdirectory called modules
, and thence finally to a file within
that directory called 02packages.details.txt.gz
. I've tested this
only on Unix, but I'll presume that the program is portable, because
I've used the portable functions.
Line 41 takes this constructed path, and creates a Compress::Zlib
object, which can be asked to deliver the uncompressed file
line-by-line. If that fails, we're in an unrecoverable state, and
we'll abort.
The data contains a header, delimited by a blank line, so we need to
skip over all the data up to and including that blank line. We'll do
this by setting a flag to an initial 1
value in line 42. Line 43
reads a line at a time into $_
, stopping when there is no more data
(or there's an I/O error). Lines 44 to 47 look for the end of the
header as long as we're still in the header. A header ends on a line
that doesn't contain a non-blank character, hence the unless
.
If we make it to line 49, we're staring at a standard line from the index, which looks something like
Parse::RecDescent 1.80 D/DC/DCONWAY/Parse-RecDescent-1.80.tar.gz
The first column is the module name (here Parse::RecDescent
), and
is not very interesting to us. Neither is the second column, which is
the current version number. But the third column contains (the unique
part of) the path to the distribution for this module, and that's what
the CPAN
module will be looking for, and what we need to mirror.
Note that many module names will share the same common distribution
file, so we'll need logic to avoid downloading duplicates. We'll
defer that problem to the my_mirror
subroutine.
A few of the modules are listed as belonging to a core Perl distribution. To avoid mirroring the various Perl distributions (and wasting space in our mirror), we'll skip over them in line 50. The regular expression is somewhat ad-hoc, but seems to do the right thing.
Line 51 mirrors the requested distribution into our local mirror. The
1
parameter says ``if it already exists, it's up to date'', and is an
optimization based on external knowledge that a given distribution
will never be updated in place. Rather, a new file will be created
with a new version number. Of course, like any optimization, we do
this with some hesitation and a bit of caution.
Once we've passed through the entire module list, we need to delete any outdated modules. A CPAN contributor has the option of leaving older versions of modules in the CPAN, or deleting them. We need to keep track of everything that is current, and delete anything not mentioned, in order to keep in sync with the master archive.
And that's it, as line 57 confirms.
But of course, that's not the whole story. We need to manage the
mirroring. There are two steps to mirroring: fetching the files, and
throwing away anything left over. These need to share a common hash,
which we'll define as a closure variable inside a BEGIN
block
starting in line 59. The %mirrored
hash in line 62 is keyed by the
filename, and has a value of 1 to indicate that the file has been at
least checked for existence, and 2 to indicate that it has been
mirrored from the remote site and brought up to date. At the end of
the run, any files that aren't either 1 or 2 for values are deleted
files or temp files, and should be deleted from our mirror.
The my_mirror
routine starting in line 64 does the hard work. The
two parameters are the partial URL path and the ``skip if present''
flag.
In line 68, we use the URI
module to construct the full URL, based
on the $REMOTE
value and the partial path. Line 69 constructs the
local file path, based on $LOCAL
and the partial path as well. The
task for the remainder of the subroutine is to make the local file be
up to date with respect to the remote URL.
Line 70 manages the checksum file. Each distribution is checksummed to ensure proper complete transfer. We'll first pretend that the checksum file doesn't need updating, but later remove that assumption if we end up transferring the distribution file.
Starting in line 72, we look at what to do to bring this file up to
date. If $skip_if_present
is true, then we'll never worry about
the remote timestamp being out of sync. If the file is present, it's
good enough, noted by the -f
flag in line 72. Line 74 records that
the file was at least checked for existence, so we don't delete it
during the cleanup phase.
If $skip_if_present
is not true, or the file doesn't exist, then
it's time to do a full mirror on this distribution. We'll note that
in line 77. Line 79 creates the directory to receive the file. (I
would argue that LWP
should do this for me, but that's not the way
it works.) The $TRACE
value causes a series of mkdir
command-lines to be traced to the output; otherwise, this operation is
silent. Line 80 also puts out some noise if $TRACE
is set: note
the absence of a newline, because we're going to follow on with a
result status.
Line 81 is where the real work happens. We'll call mirror
to bring
the remote URL to the local file. This is done in such a way that the
existing modification timestamp (if any) is noted and respected,
minimizing the load on the remote server. And the file is actually
written into a temp file, and then renamed only when the transfer is
complete, thus ensuring that other users of this directory will not
see partially transferred files at normal locations. (If one of these
transfers aborts mid-way, the cleanup phase at the end of this program
will delete the partial transfer). The modification time is also
updated to that of the remote data, so that a later mirror will again
note that the file is up to date.
The result of mirror
is an HTTP status value. If it's RC_OK
,
then we've got a new version of the remote file. In this case, the
checksum file may now be out of date: we can't merely check for its
existence, so we'll flag that by setting the variable to 0 in line 84.
If the response is RC_NOT_MODIFIED
, then we already had an
up-to-date version of the file, and the remote server has informed us
of such without even sending us a new version. In that case, we end up
in line 90, finishing out the tracing message if needed.
However, if the status is neither of these, then something wrong has
happened, and we'll generate a warn
noting the status, and abort
any further operation on this path by returning from the subroutine.
Once the distribution has been transferred, it's time to grab the
checksum file. If the path is a distribution (checked in line 94),
we'll compute the path to the CHECKSUMS
file in lines 95 and 96.
We must be careful to perform URL calculations here, not native path
calculations. And, to keep the algorithm easy, we need to compute the
path relative to the original CPAN mirror base, not a full path.
Thankfully, this is also trivial with the URI
module.
In line 97, if we're not already looking at a CHECKSUMS
file, we
need to call back to ourself to transfer the file. This is a clean
tail-recursion, so I could have simply used a goto
or a loop, but
the subroutine call seemed easier and clearer at the time. If the
checksum might already be up to date, it will merely be checked for
its presence. If a transfer has taken place, a full mirror call will
be issued instead.
Finally, we have the cleanup phase routine. We'll start at $LOCAL
using the File::Find
recursion. If a file exists, and it's not
noted as such in the %mirrored
hash (line 105), then we remove it
(line 107).
And there you have it. Set up the configuration, and let it rip. On the first execution, you will want to be on a fast link (or a relatively unloaded time of day), because it downloads about 200 megabytes of data. After that, it's about 2-5 minutes per (average) day on a 28.8 link, which is completely tolerable for me from my hotel room when I'm on the road. And don't forget: you're downloading only installable modules, not the rest of the CPAN.
To use this mini-CPAN mirror with CPAN.pm
, you'll need to enter at
the CPAN prompt:
o conf urllist unshift file://$LOCAL o conf commit reload index
Here, $LOCAL
is replaced by the value you've set in $LOCAL
but
specified as a URL path (forward slashes for directory delimiters, and
percent-escaped unusual characters). That's because CPAN.pm
is
expecting a URL, not a file path.
At the risk of repeating myself: this won't make CPAN installations any faster, unless you happen to be a road-warrior like me, needing to do CPAN installations when you are on a very slow net link (or no link at all). Of course, you could burn a daily CD for your friends, and ``hand them a CPAN archive on a disk'', providing a gateway between your bandwidth and the sneakernet. At least you won't be worrying trying to figure out how to fit the full 1.2+ GB CPAN on a CD-ROM! Until next time, enjoy!
Listing
=1= #!/usr/bin/perl -w =2= use strict; =3= $|++; =4= =5= ### CONFIG =6= =7= my $REMOTE = "http://www.cpan.org/"; =8= # my $REMOTE = "http://fi.cpan.org/"; =9= # my $REMOTE = "http://au.cpan.org/"; =10= # my $REMOTE = "file://Users/merlyn/MIRROR/CPAN/"; =11= =12= ## warning: unknown files below this dir are deleted! =13= my $LOCAL = "/Users/merlyn/MIRROR/MINICPAN/"; =14= =15= my $TRACE = 1; =16= =17= ### END CONFIG =18= =19= ## core - =20= use File::Path qw(mkpath); =21= use File::Basename qw(dirname); =22= use File::Spec::Functions qw(catfile); =23= use File::Find qw(find); =24= =25= ## LWP - =26= use URI (); =27= use LWP::Simple qw(mirror RC_OK RC_NOT_MODIFIED); =28= =29= ## Compress::Zlib - =30= use Compress::Zlib qw(gzopen $gzerrno); =31= =32= ## first, get index files =33= my_mirror($_) for qw( =34= authors/01mailrc.txt.gz =35= modules/02packages.details.txt.gz =36= modules/03modlist.data.gz =37= ); =38= =39= ## now walk the packages list =40= my $details = catfile($LOCAL, qw(modules 02packages.details.txt.gz)); =41= my $gz = gzopen($details, "rb") or die "Cannot open details: $gzerrno"; =42= my $inheader = 1; =43= while ($gz->gzreadline($_) > 0) { =44= if ($inheader) { =45= $inheader = 0 unless /\S/; =46= next; =47= } =48= =49= my ($module, $version, $path) = split; =50= next if $path =~ m{/perl-5}; # skip Perl distributions =51= my_mirror("authors/id/$path", 1); =52= } =53= =54= ## finally, clean the files we didn't stick there =55= clean_unmirrored(); =56= =57= exit 0; =58= =59= BEGIN { =60= ## %mirrored tracks the already done, keyed by filename =61= ## 1 = local-checked, 2 = remote-mirrored =62= my %mirrored; =63= =64= sub my_mirror { =65= my $path = shift; # partial URL =66= my $skip_if_present = shift; # true/false =67= =68= my $remote_uri = URI->new_abs($path, $REMOTE)->as_string; # full URL =69= my $local_file = catfile($LOCAL, split "/", $path); # native absolute file =70= my $checksum_might_be_up_to_date = 1; =71= =72= if ($skip_if_present and -f $local_file) { =73= ## upgrade to checked if not already =74= $mirrored{$local_file} = 1 unless $mirrored{$local_file}; =75= } elsif (($mirrored{$local_file} || 0) < 2) { =76= ## upgrade to full mirror =77= $mirrored{$local_file} = 2; =78= =79= mkpath(dirname($local_file), $TRACE, 0711); =80= print $path if $TRACE; =81= my $status = mirror($remote_uri, $local_file); =82= =83= if ($status == RC_OK) { =84= $checksum_might_be_up_to_date = 0; =85= print " ... updated\n" if $TRACE; =86= } elsif ($status != RC_NOT_MODIFIED) { =87= warn "\n$remote_uri: $status\n"; =88= return; =89= } else { =90= print " ... up to date\n" if $TRACE; =91= } =92= } =93= =94= if ($path =~ m{^authors/id}) { # maybe fetch CHECKSUMS =95= my $checksum_path = =96= URI->new_abs("CHECKSUMS", $remote_uri)->rel($REMOTE); =97= if ($path ne $checksum_path) { =98= my_mirror($checksum_path, $checksum_might_be_up_to_date); =99= } =100= } =101= } =102= =103= sub clean_unmirrored { =104= find sub { =105= return unless -f and not $mirrored{$File::Find::name}; =106= print "$File::Find::name ... removed\n" if $TRACE; =107= unlink $_ or warn "Cannot remove $File::Find::name: $!"; =108= }, $LOCAL; =109= } =110= }