Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 22 (Oct 1998)

You can save a lot of time by using prewritten modules effectively. Many modules are included with the Perl distribution, but an enormous number are available (for free!) in the Comprehensive Perl Archive Network (called the CPAN). If you're new to the idea of downloadable modules for Perl, you should browse http://www.perl.com/CPAN/CPAN.html to get a feel for what's available.

Installing CPAN modules has even been made pretty easy with the CPAN.pm module (built in to Perl). For example, if I needed the Foo::Bar module from the CPAN, it's as simple as typing:

    $ perl -MCPAN -eshell
    cpan shell -- CPAN exploration and modules installation (vX.XX)
    ReadLine support enabled
    cpan> install Foo::Bar
    [messages about fetching, unpacking,
    compiling, testing, and installing appear here]
    cpan> quit

The first time you do this, you might have to answer some questions about the way to fetch things from the net, or where the nearest CPAN archive is located. Use http://www.perl.com/CPAN/ if you're not sure. Also, if you're not the system administrator, you'll need to add PREFIX=/some/path/you/can/write to the makepl_arg configuration parameter to install the binaries, modules, and documentation below that PREFIX, rather than the system directories. See perldoc CPAN for more information.

So, let's take a look at a task that was made tremendously easier using the CPAN. The other day, I was thinking about the rec.humor.funny newsgroup, which gets a mere two postings a day of some relatively funny jokes. However, sometimes I don't always read that newsgroup every day, and I miss some of the jokes, because they expire off my news-server before I get to them.

So I decided to write a program that I could run on a regular basis (like nightly from a cron job) to connect to the NNTP server, fetch the jokes, and send them to standard output (which will get mailed to me from a cron job). At first, that sounds like it might be a lot of work, because talking to an NNTP server would seem to require knowing about sockets and the NNTP protocol. Not so.

Graham Barr has written a nice module called Net::NNTP that handles all the greasy stuff behind the scenes to talk to an NNTP server. If you don't have it installed yet, it's simple to get, because it's in the CPAN!

Once installed, using the module is pretty easy. First, I'll add the appropriate use directive to my program:

    use Net::NNTP;

Next, I'll define the news server location and the group I want to read as global scalars with uppercase names, to let me know they're configuration things:

    my $SERVER = "nntp.your-isp-goes-here.com";
    my $GROUP = "rec.humor.funny";

The name of the news server needs to be appropriate to wherever you read news from.

Now, we need to connect to the server using the code provided by Net::NNTP. This lets us ``talk'' to the server via a connection object, here stored in $c:

    my $c = Net::NNTP->new($SERVER)
      or die "Cannot open NNTP: $!";

Now we need to hop over to the rec.humor.funny group:

    my ($arts,$low,$high) =
      $c->group($GROUP)
      or die "Cannot go to $GROUP: $!";

The return value tells how many articles are in the group, along with a minimum and maximum article number. We can use that to scan through all possible article numbers and dump them out. Let's do that with a foreach loop:

    foreach my $artnum ($low..$high) {
      my $art = $c->article($artnum) or next;

If an article doesn't exist (perhaps a cancellation or a different expiration date), we skip over to the next article number. The value of $art here is either undef, or a listref pointing to the full text of the article. If it's the listref, we'll just dump it out.

      print "=== article $artnum ===\n";
      print @$art;
    }

And that would be a good program that successfully dumps out all the articles in rec.humor.funny. The format's not very pretty though... it has all the headers and the silly common .signature on each posting. It also includes all the administrivia messages. Let's fix that.

We could probably write some quick regular expressions to modify the article text, but let's steal some additional resources from the CPAN again. In this case, it's Mail::Internet, also by Graham Barr, which understands RFC822 mail, which happens to be the same format as a news message. Now our sample program starts like this:

    use Net::NNTP;
    use Mail::Internet;
    my $SERVER = "nntp.your-isp-goes-here.com";
    my $GROUP = "rec.humor.funny";
    my $c = Net::NNTP->new($SERVER)
      or die "Cannot open NNTP: $!";
    my ($arts,$low,$high) =
      $c->group($GROUP)
      or die "Cannot go to $GROUP: $!";

And except for the additional use line, that's the same so far. We'll also need the same article loop:

    for my $artnum ($low..$high) {
      my $art = $c->article($artnum) or next;

But here's where we diverge now. We take the listref returned in $art, and build a mail message object from it:

      my $mail = Mail::Internet->new($art);

Now we can look at this message as a mail message using the methods defined for them. First, let's skip over the administrivia messages:

      next if $mail->head->get("From") =~ /netfunny\.com/;

The expression $mail-head returns a Mail::Header object for the message, which in turn has a get method to extract a particular field. If that matches administrivia address domain, then we skip over the article. Next, we'll dump the same banner as before:

      print "=== article $artnum ===\n";

But now that we have a mail message object, we can do some massage. Let's remove the signature, and clean up any extra whitespace:

      $mail->remove_sig;
      $mail->tidy_body;

And print just those headers that we're interested in (the subject, date, and original submitter):

      for my $tag (qw(Subject Date From)) {
        print "$tag: ", $mail->head->get($tag);
      }

And finally, dump the cleaned up body:

      print "\n", @{$mail->body};
    }

There. Nicer, without a lot of hassle. So, now that we have a complete dump of all rec.humor.funny articles currently on the news server, cleaned up in a nice way, what next?

Well, we still have a problem here. It's dumping all articles every time. And when's the last time you wanted to hear the same joke twice in two days (or as many days as it stays on your system)?

We need some memory to let us know what we've dumped. Fortunately, there's a standard memory, called a newsrc file in a common format that most newsreaders understand. And, since it's a common format, there's (once again) a module in the CPAN that can deal with it. In this case, it's the News::Newsrc module, by Steven McDougall.

So, let's add some memory to the program.

    use Net::NNTP;
    use Mail::Internet;
    use News::Newsrc;
    my $SERVER = "nntp.your-isp-goes-here.com";
    my $GROUP = "rec.humor.funny";

Again, nearly the same header, but we've added the use line for the newest module in the list. Next, we'll fetch the current newsrc file by creating a newsrc object:

    my $newsrc = News::Newsrc->new;
    $newsrc->load;

Which brings up a point... we're using the same newsrc file here that my newsreader also uses, which means that this program will know which rec.humor.funny articles I've already read, either via this program, or via my normal newsreader! Nice.

Some more unchanged stuff from the previous version:

    my $c = Net::NNTP->new($SERVER)
      or die "Cannot open NNTP: $!";
    my ($arts,$low,$high) =
      $c->group($GROUP)
      or die "Cannot go to $GROUP: $!";

But now we no longer want to cycle through all the articles from $low to $high. We want to hit only the articles that we've not seen. This is called unmarked in newsrc jargon, so we'll use the appropriately named method:

    my @unmarked =
      $newsrc->unmarked_articles
      ($GROUP, $low, $high);

Now @unmarked is a list of article numbers that are potentially on the newsserver (unless they've been cancelled) that are not already seen by me. Let's cycle through them:

    for my $artnum (@unmarked) {
      my $art = $c->article($artnum) or next;

If the article has been fetched, we'll mark it. That way, I'll see it only once:

      $newsrc->mark($GROUP, $artnum);

And now the rest of the loop body looks the same as before:

      my $mail = Mail::Internet->new($art);
      next if $mail->head->get("From") =~ /netfunny\.com/;
      print "=== article $artnum ===\n";
      $mail->remove_sig;
      $mail->tidy_body;
      for my $tag (qw(Subject Date From)) {
        print "$tag: ", $mail->head->get($tag);
      }
      print "\n", @{$mail->body};
    }

Finally, we need to update the newsrc file to reflect the additionally read articles. Again, all the hard work is done... we just need to invoke a method to do the right thing behind the scenes.

    $newsrc->save;

So a fairly short program is now a tiny newsreader, updating the newsrc file, and even rejecting unwanted administrivia articles. And it took me under a half hour to write and debug. This is, indeed, the power of the CPAN. Enjoy!


Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.