Copyright NoticeThis text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.
This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
[suggested title: ``The Big Modules in the Mini-CPAN'']
I recently attended a presentation by Adam Kennedy on his massive CPAN testing effort, at the third Open Source Developers' Conference (osdc.com.au) in Melbourne (Victoria, Australia). In the presentation, Adam referred to a number of large CPAN modules, and this got me thinking about what other large distributions might be in there.
As I retreated to my hotel that night, I refreshed my ``mini CPAN'', using the
script I wrote for [this column, nov 2002], also while in Australia. (What is
it about Australia that gets me thinking about the CPAN?) Although the code
in that article became the basis for
CPAN::Mini (in the CPAN), I still use
my original code. Over the past year or so, the Mini-CPAN size has approached
being no longer able to fit on a CD-ROM (even though fitting on a CD was the
original goal). Now inspired by Adam's talk, I wanted to see what the largest
items were in this Mini-CPAN mirror.
I initially thought of writing a quick Perl script to go through the
files of the Mini-CPAN (perhaps using my own
but I decided to go remarkably low tech, using a simple command line:
cd ~/MIRROR/MINICPAN/authors/id find . -type f -ls | sort +1nr -2 | head -10
Sometimes, a shell command is enough. No need to make everything be Perl. And since the result was rather interesting, I thought I'd write about it for this month's column. Starting with the biggest item, here they are.
The massive distribution for the ``dot reader'' weighs in at nearly 25 megabytes! But this may just be an anomoly: the next release is far smaller, around 3 megabytes.
Peering in under the hood, we get a
du -sk * that reveals the reason
for the giant-sized module. The two large items there are:
12220 dotreader-linux-v0.0.7 10852 dotreader-linux-v0.0.7.tar.gz
And that's the problem with this module. It includes not one, but two binary distributions of the dot reader, configured for Linux on a 386 machine. Not very useful to most people, and especially not twice.
So, let that be a lesson to you module authors: be very careful when building a distro that you don't accidentally include intermediate or build files.
At 13 megabytes (nearly half the size of the previous item), we have
a module that describes itself in the
README using the paragraph:
These set of modules allow the parsing of files produced by the Affymetrix microarray system. If you don't know what one of those is, these are not the modules you're looking for.
And indeed, I don't. However, peering down into the distro to find
out where the 13 megabytes went, my trusty
du -sk * reveals that
the big item here is
testdata, dominated by the 39-megabyte (when
MASS-ATH1.CDF. In fact, that file is so large that
GNU Emacs questioned my sanity before I opened it, and after I opened
it, I questioned my own sanity. It seems to be lots and lots and lots
of data. I'm presuming this data wasn't hand-generated, or else David
J Craigon has far more time than I ever will. It'd be interesting if
this was the DNA of some cool worm or something, but I can't tell.
The moral here again is: don't put massive test data into your distro. But if you do, at least make sure it compresses nicely.
Oh, those whacky bioinformatics people, and their very large file sizes. Coming in at number 3 on our countdown, a mere 10 megabytes, is the module whose synopsis includes:
Bio::PrimerDesigner - Design PCR Primers using primer3 and epcr
Again, not my area, so I can't comment on the utility of this module.
But pulling out the disk usage report, I see the big culprit is a
bin/website_example.tar.gz. First, why is that in
But then opening it up in its own space, I see that it's an entire website with 22 megabytes of its own data, but luckily highly compressible stuff like:
>NC_000963 Rickettsia prowazekii strain Madrid E, complete genome. ATGACAAAGCTAATTATTCACTTGGTTTCAGACTCTTCTGTGCAAACTGCAAAACATGCA GCAAATTCTGCTCTTGCTCAATTTACTTCTATAAAACAAAAATTGTATCATTGGCCAATG ATTAGAAATTGTGAATTACTAAATGAAGTATTAAGTAAAATAGAATCTAAACATGGAATA GTATTATACACAATTGCTGATCAAGAACTCCGAAAAACTTTAACAAAATTTTGCTATGAA TTAAAAATTCCATGTATTTCTGTAATAGGTAAAATTATTAAAGAAATGTCTGTTTTTTCA GGTATTGAAATAGAAAAAGAACAAAATTATAATTATAAATTCGATAAAACTTATTTTGAT ACACTCAATGCTATAGATTATGCTATAAGACATGATGATGGACAAATGATTAATGAATTA TCAGAATCTGATATAATATTAATAGGTCCTTCTAGAACTTCTAAAACACCGACTTCCGTA TTTTTAGCGTATAATGGTTTAAAAGCTGCAAATATTCCTTATGTTTATAATTGTCCATTT CCTGATTTTATAGAAAAGGATATAGATCAATTAGTAGTAGGACTTGTTATTAATCCAAAT AGGTTAATTGAGATAAGAGAAGCTAGATTAAATTTATTGCAAATTAATGAAAATAAAAGC TATACAGATTTTAATATAGTACAAAGAGAGTGCATAGAAGTCAGAAAAATTTGTAATCAA AGAAATTGGCCAGTGATTGATGTATCAACCAGATCAATAGAGGAAACAGCAGCTTTAATA ATGCGAATATATTATAATAGAAAAAATAAATATCATAAATAAAAAGATTTTTCATTATTT ACAAGTAGAAGTGACTAATTTATAATTTTATTTATTGCTTTTCGTTGTTATGAGTTAAAA
That's just the start... it goes for another 18000 lines like that. According to Wikipedia, ``Rickettsia is a genus of non-motile, Gram-negative, non-sporeforming, highly pleomorphic bacteria''. Cool. With this module, you can apparently build your own bacteria at home!
Again, if you're going to provide a sample, please make sure it's a small sample. (This also goes for visits to the doctor's office.)
At number four, we find (at just over 9 megabytes), the ``undocumented
utility garbage for our crossfire client''. This distro is pretty
cool, because the bulk of the distro is under
resources. In here,
I find a bunch of images, sounds, and game music that had me
distracted for about ten minutes, and if I had had an office mate,
would have annoyed him thoroughly.
One interesting subdirectory is
resources/fonts, which provide the TrueType
fonts for a
DejaVuSans font. A bit of netsearching shows that these
probably originally came from
dejavu.sf.net, ``a font family based on the
Bitstream Vera Fonts''. Being a heavy user of the Vera fonts from the moment I
discovered them, this is an interesting find for me. And those sounds will
likely find their way into some sort of beep or boop that my MacBook Pro will
be making soon.
Ahh, finally one I've heard of! At just under 9 megabytes, we find one of the most recent web application frameworks, Gantry (www.usegantry.org). Gantry seems to be in a similar space with Catalyst, Jifty, and Ruby on Rails, as a ``we do most of the work for you'' web framework.
The ``fat'' in the distro comes from the 9-megabyte
docs/contact2.mov, which is one of those ``here's how fast it is to
set up a website'' movies that Ruby on Rails started making so popular.
This one is three minutes long. Again, since the movie isn't
necessary to get Gantry up and running, it'd be nice if this was
simply linked, rather than pushed out to every CPAN mirror and Mini-CPAN.
Number 6 takes me across the pond to the UK, finding this 8-megabyte gem, which disclaims itself with the following text:
This module does nothing on its own, but is an add-on for Number::Phone::UK, to make the
location()method more accurate. I decided to distribute it seperately because to include all this data as well as that for the national numbering scheme would make the module ridiculously big. Better to have this, which will not be needed so often and is updated far less often, as a seperate package.
Well, hooray for that, although we've still got a lot of data here. In fact,
drilling down into the
lib subdirectory, we find a 35-megabyte module
when uncompressed. Try loading that during a CGI hit. Wow.
Number 7 is none other than our next generation Perl6 core, from Audrey Tang and the crew. This thing is huge because there's a lot of meat to it, and very little in the way of binary data, with literally hundreds of contributors all typing away madly committing around 30 to 50 commits a day to bring you Perl6 in the very near future. Most of the text is in the way of examples and documents though, not actual installed code.
You can install this module by asking the CPAN shell for
Perl6::Pugs, but you'll need a lot of patience, and the GHC Haskell
compiler, and a lot of other CPAN prerequisities. However, if you're
patient enough, you can end up with the bleeding-edge Perl6 execution
Coming in at a bit under seven megabytes is a module I've used in a production
environment for a client,
PDF::API2. Not sure where the 2 comes from, but
this is a complete PDF creation and manipulation package. Unfortunately, it
seems to be a bit underdocumented for my small brane, so I had to do a lot of
fussing and fidgeting and net-searching for examples to get it to work.
When I first opened up the distro to see where the fat was, I expected the
bulk to be in
examples, but I was surprised to find 26 megabytes
lib. Drilling down, I see 15 meg of that is in
lib/PDF/API2/Resource/CIDFont/CMap/*.pm, which apparently provide a huge
amount of mapping information for CJK fonts to and from Unicode. I won't even
pretend to understand or explain that.
Surprisingly enough, another 6 megabytes belongs to Yet Another Copy of the DejaVu TrueType fonts! (I'm getting a bad case of Deja Vu here.) If these guys would get together with the CFPlus people, we could have one copy in the CPAN, with multiple dependencies, instead of multiple copies.
The brilliant Russian math professor Ilya Zakharevich, who brought us the core of the Perl5 regular expression engine and greatly expanded the Perl5 debugger implementation, also brings us this 6 megabyte binary. I can only guess is a version of Perl 5.8.2 precompiled for OS/2, updated at the end of 1993 (which doesn't make sense because Perl 5.8.2 wasn't around in 1993), but I didn't bother unzipping it for fear that it would want to take over my system somehow.
I'm not sure my Mini-CPAN tool should mirror this, since the only
unique Perl module to install from this distro is
OS2::Process:Const, according to the CPAN index. And I certainly
don't need that. But there it is, coming in at number 9 in my list of
largest Mini-CPAN modules.
And finally, rounding out our top 10 items, we find the 6-megabyte Tk
bindings for Perl, commonly known as ``Perl-Tk''. By ``bindings'', I
should actually explain that this module source contains the entire Tk
distro, so you don't need to hunt down a separate download. Just
Tk at your CPAN shell, and after a few dozen minutes of
chugging away, you have the first (and most completely documented)
screen widgets for Perl to project on an X11 display server.
The Perl-Tk project was driven by Nick Ing-Simmons for many years. I regettably say ``was'', as Nick passed away just a few months ago. He was a huge contributor to the Perl community, and is sadly missed. But thanks to his efforts, Perl moved from being a simple text-based language into a full graphics-driving widget engine. Thank you, Nick.
Well, that's about all the space I have for this month's column. I hope you've learned a bit about what not to put into your next CPAN distribution. Until next time, enjoy!