Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Linux Magazine Column 90 (Feb 2007)

[suggested title: ``The Big Modules in the Mini-CPAN'']

I recently attended a presentation by Adam Kennedy on his massive CPAN testing effort, at the third Open Source Developers' Conference (osdc.com.au) in Melbourne (Victoria, Australia). In the presentation, Adam referred to a number of large CPAN modules, and this got me thinking about what other large distributions might be in there.

As I retreated to my hotel that night, I refreshed my ``mini CPAN'', using the script I wrote for [this column, nov 2002], also while in Australia. (What is it about Australia that gets me thinking about the CPAN?) Although the code in that article became the basis for CPAN::Mini (in the CPAN), I still use my original code. Over the past year or so, the Mini-CPAN size has approached being no longer able to fit on a CD-ROM (even though fitting on a CD was the original goal). Now inspired by Adam's talk, I wanted to see what the largest items were in this Mini-CPAN mirror.

I initially thought of writing a quick Perl script to go through the files of the Mini-CPAN (perhaps using my own File::Finder module), but I decided to go remarkably low tech, using a simple command line:

  cd ~/MIRROR/MINICPAN/authors/id
  find . -type f -ls | sort +1nr -2 | head -10

Sometimes, a shell command is enough. No need to make everything be Perl. And since the result was rather interesting, I thought I'd write about it for this month's column. Starting with the biggest item, here they are.

EWILHELM/dotReader-v0.0.7.tar.gz

The massive distribution for the ``dot reader'' weighs in at nearly 25 megabytes! But this may just be an anomoly: the next release is far smaller, around 3 megabytes.

Peering in under the hood, we get a du -sk * that reveals the reason for the giant-sized module. The two large items there are:

  12220 dotreader-linux-v0.0.7
  10852 dotreader-linux-v0.0.7.tar.gz

And that's the problem with this module. It includes not one, but two binary distributions of the dot reader, configured for Linux on a 386 machine. Not very useful to most people, and especially not twice.

So, let that be a lesson to you module authors: be very careful when building a distro that you don't accidentally include intermediate or build files.

SKIPPY/Bio-Affymetrix-0.1.tar.gz

At 13 megabytes (nearly half the size of the previous item), we have a module that describes itself in the README using the paragraph:

And indeed, I don't. However, peering down into the distro to find out where the 13 megabytes went, my trusty du -sk * reveals that the big item here is testdata, dominated by the 39-megabyte (when uncompressed) MASS-ATH1.CDF. In fact, that file is so large that GNU Emacs questioned my sanity before I opened it, and after I opened it, I questioned my own sanity. It seems to be lots and lots and lots of data. I'm presuming this data wasn't hand-generated, or else David J Craigon has far more time than I ever will. It'd be interesting if this was the DNA of some cool worm or something, but I can't tell.

The moral here again is: don't put massive test data into your distro. But if you do, at least make sure it compresses nicely.

SMCKAY/Bio-PrimerDesigner-0.01.tar.gz

Oh, those whacky bioinformatics people, and their very large file sizes. Coming in at number 3 on our countdown, a mere 10 megabytes, is the module whose synopsis includes:

  Bio::PrimerDesigner - Design PCR Primers using primer3 and epcr

Again, not my area, so I can't comment on the utility of this module. But pulling out the disk usage report, I see the big culprit is a 9-megabyte bin/website_example.tar.gz. First, why is that in bin?

But then opening it up in its own space, I see that it's an entire website with 22 megabytes of its own data, but luckily highly compressible stuff like:

    >NC_000963 Rickettsia prowazekii strain Madrid E, complete genome.
    ATGACAAAGCTAATTATTCACTTGGTTTCAGACTCTTCTGTGCAAACTGCAAAACATGCA
    GCAAATTCTGCTCTTGCTCAATTTACTTCTATAAAACAAAAATTGTATCATTGGCCAATG
    ATTAGAAATTGTGAATTACTAAATGAAGTATTAAGTAAAATAGAATCTAAACATGGAATA
    GTATTATACACAATTGCTGATCAAGAACTCCGAAAAACTTTAACAAAATTTTGCTATGAA
    TTAAAAATTCCATGTATTTCTGTAATAGGTAAAATTATTAAAGAAATGTCTGTTTTTTCA
    GGTATTGAAATAGAAAAAGAACAAAATTATAATTATAAATTCGATAAAACTTATTTTGAT
    ACACTCAATGCTATAGATTATGCTATAAGACATGATGATGGACAAATGATTAATGAATTA
    TCAGAATCTGATATAATATTAATAGGTCCTTCTAGAACTTCTAAAACACCGACTTCCGTA
    TTTTTAGCGTATAATGGTTTAAAAGCTGCAAATATTCCTTATGTTTATAATTGTCCATTT
    CCTGATTTTATAGAAAAGGATATAGATCAATTAGTAGTAGGACTTGTTATTAATCCAAAT
    AGGTTAATTGAGATAAGAGAAGCTAGATTAAATTTATTGCAAATTAATGAAAATAAAAGC
    TATACAGATTTTAATATAGTACAAAGAGAGTGCATAGAAGTCAGAAAAATTTGTAATCAA
    AGAAATTGGCCAGTGATTGATGTATCAACCAGATCAATAGAGGAAACAGCAGCTTTAATA
    ATGCGAATATATTATAATAGAAAAAATAAATATCATAAATAAAAAGATTTTTCATTATTT
    ACAAGTAGAAGTGACTAATTTATAATTTTATTTATTGCTTTTCGTTGTTATGAGTTAAAA

That's just the start... it goes for another 18000 lines like that. According to Wikipedia, ``Rickettsia is a genus of non-motile, Gram-negative, non-sporeforming, highly pleomorphic bacteria''. Cool. With this module, you can apparently build your own bacteria at home!

Again, if you're going to provide a sample, please make sure it's a small sample. (This also goes for visits to the doctor's office.)

MLEHMANN/CFPlus-0.95.tar.gz

At number four, we find (at just over 9 megabytes), the ``undocumented utility garbage for our crossfire client''. This distro is pretty cool, because the bulk of the distro is under resources. In here, I find a bunch of images, sounds, and game music that had me distracted for about ten minutes, and if I had had an office mate, would have annoyed him thoroughly.

One interesting subdirectory is resources/fonts, which provide the TrueType fonts for a DejaVuSans font. A bit of netsearching shows that these probably originally came from dejavu.sf.net, ``a font family based on the Bitstream Vera Fonts''. Being a heavy user of the Vera fonts from the moment I discovered them, this is an interesting find for me. And those sounds will likely find their way into some sort of beep or boop that my MacBook Pro will be making soon.

PHILCROW/Gantry-3.42.tar.gz

Ahh, finally one I've heard of! At just under 9 megabytes, we find one of the most recent web application frameworks, Gantry (www.usegantry.org). Gantry seems to be in a similar space with Catalyst, Jifty, and Ruby on Rails, as a ``we do most of the work for you'' web framework.

The ``fat'' in the distro comes from the 9-megabyte docs/contact2.mov, which is one of those ``here's how fast it is to set up a website'' movies that Ruby on Rails started making so popular. This one is three minutes long. Again, since the movie isn't necessary to get Gantry up and running, it'd be nice if this was simply linked, rather than pushed out to every CPAN mirror and Mini-CPAN.

DCANTRELL/Number-Phone-UK-DetailedLocations-1.2.tar.gz

Number 6 takes me across the pond to the UK, finding this 8-megabyte gem, which disclaims itself with the following text:

Well, hooray for that, although we've still got a lot of data here. In fact, drilling down into the lib subdirectory, we find a 35-megabyte module when uncompressed. Try loading that during a CGI hit. Wow.

AUDREYT/Perl6-Pugs-6.2.13.tar.gz

Number 7 is none other than our next generation Perl6 core, from Audrey Tang and the crew. This thing is huge because there's a lot of meat to it, and very little in the way of binary data, with literally hundreds of contributors all typing away madly committing around 30 to 50 commits a day to bring you Perl6 in the very near future. Most of the text is in the way of examples and documents though, not actual installed code.

You can install this module by asking the CPAN shell for Perl6::Pugs, but you'll need a lot of patience, and the GHC Haskell compiler, and a lot of other CPAN prerequisities. However, if you're patient enough, you can end up with the bleeding-edge Perl6 execution engine.

AREIBENS/PDF-API2-0.55.tar.gz

Coming in at a bit under seven megabytes is a module I've used in a production environment for a client, PDF::API2. Not sure where the 2 comes from, but this is a complete PDF creation and manipulation package. Unfortunately, it seems to be a bit underdocumented for my small brane, so I had to do a lot of fussing and fidgeting and net-searching for examples to get it to work.

When I first opened up the distro to see where the fat was, I expected the bulk to be in examples, but I was surprised to find 26 megabytes (uncompressed) in lib. Drilling down, I see 15 meg of that is in lib/PDF/API2/Resource/CIDFont/CMap/*.pm, which apparently provide a huge amount of mapping information for CJK fonts to and from Unicode. I won't even pretend to understand or explain that.

Surprisingly enough, another 6 megabytes belongs to Yet Another Copy of the DejaVu TrueType fonts! (I'm getting a bad case of Deja Vu here.) If these guys would get together with the CFPlus people, we could have one copy in the CPAN, with multiple dependencies, instead of multiple copies.

ILYAZ/os2/582+/perl_mlb.zip

The brilliant Russian math professor Ilya Zakharevich, who brought us the core of the Perl5 regular expression engine and greatly expanded the Perl5 debugger implementation, also brings us this 6 megabyte binary. I can only guess is a version of Perl 5.8.2 precompiled for OS/2, updated at the end of 1993 (which doesn't make sense because Perl 5.8.2 wasn't around in 1993), but I didn't bother unzipping it for fear that it would want to take over my system somehow.

I'm not sure my Mini-CPAN tool should mirror this, since the only unique Perl module to install from this distro is OS2::Process:Const, according to the CPAN index. And I certainly don't need that. But there it is, coming in at number 9 in my list of largest Mini-CPAN modules.

NI-S/Tk-804.027.tar.gz

And finally, rounding out our top 10 items, we find the 6-megabyte Tk bindings for Perl, commonly known as ``Perl-Tk''. By ``bindings'', I should actually explain that this module source contains the entire Tk distro, so you don't need to hunt down a separate download. Just install Tk at your CPAN shell, and after a few dozen minutes of chugging away, you have the first (and most completely documented) screen widgets for Perl to project on an X11 display server.

The Perl-Tk project was driven by Nick Ing-Simmons for many years. I regettably say ``was'', as Nick passed away just a few months ago. He was a huge contributor to the Perl community, and is sadly missed. But thanks to his efforts, Perl moved from being a simple text-based language into a full graphics-driving widget engine. Thank you, Nick.

Well, that's about all the space I have for this month's column. I hope you've learned a bit about what not to put into your next CPAN distribution. Until next time, enjoy!


Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.