Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 82 (Jun 2006)

[Suggested title: ``Web 2.0 meets Usenet 1.0'']

The ``new'' web is all shiny, with user-collaborative reviews and comments, AJAX interactions, and RSS feeds to track all those blogs and podcasts. But before we had all that nice IP traffic, we were still communicating ``over the net'', via email, mailing lists, and Usenet. Yes... Usenet, the original ``distributed bulletin board'' system, gave us our soapboxes to scream and rant and ask and answer, and have our message be distributed to ``thousands of machines around the net'', as Larry Wall's rn program used to warn us when we made a post.

However, while the new generation of ``net'' users focusses on direct-IP communication (through the web, in blogs, and with instant messaging and IRC), legacy systems like Usenet are still chugging along behind the scenes, being operated more or less as they have been since the beginning in 1979.

For example, a portion of the Usenet newsgroups are moderated, meaning that articles posted to them aren't immediately distributed to the world, but instead are mailed to a moderator for approval before posting. comp.lang.perl.moderated operates in this fashion (as the name implies), which theoretically means the group has more light of knowledge and less heat from flames. A dedicated group of moderators manages the group, including Stonehenge trainer Tad McClellan.

Similarly, I'm the primary moderator for comp.lang.perl.announce (CLPA), an announcements mailing list for new or updated Perl software. I was selected into this position when the newsgroup was being created, and spend a few minutes a day making sure announcements get out in a timely fashion. At one point, CLPA was also gatewayed into a mailing list, allowing people to get frequent new-perl-code announcements directly into their email without having to find the Usenet group.

Over the years, CLPA has become a bit quieter, getting a posting only every few days from a handful of dedicated CPAN contributors. On my long list of items waiting for ``round tuits'', I had observed that the list of new and updated modules in the CPAN would be well within the charter of CLPA, but didn't want to write the necessary tools to scrape the frequently updated module list to find the differences, and certainly wasn't interested in doing such work by hand.

However, I recently noted that search.cpan.org, my favorite webview into the CPAN, has a public RSS feed of new modules going back a few days, along with a direct link to get more information. A-ha! Finally with a bit of automation, I could start pumping timely data into CLPA. By bolting together a few CPAN modules, I produced a nightly-run ``CPAN 2 CLPA'' module, presented in [listing one, below].

Lines 1 and 2 define the path to Perl, enable warnings, and turn on strict mode, as always.

Lines 4 to 14 provide the ``user serviceable parts'' for things below. I get my home directory using a glob trick, although this is probably on par with the mystery of:

  my $HOME = (getpwuid $<)[7];

I'm not sure whether counting on a glob of tilde is more or less portable than getting the eighth value of the password file entry for the current user, but in any case there's more than one way to do it.

From the home directory, I derive paths for the two data directories used by this program. The XML::RSS::Feed module needs a place to keep information about RSS headlines that have been already seen, so we'll throw those into $RSS_TEMP_DIR. And HTTP::Cache::Transparent needs a local cache area, which we'll put into HTTP_CACHE_TEMP_DIR.

Finally, once I have a posting, I have to push it into the news network, so I list my (not-real) NNTP host from my ISP, along with my personal authentication credentials.

Lines 16 to 20 define the needed modules. The Encode module is included with the core Perl distribution, and is used to remap non-ASCII characters into their ASCII equivalents. The remaining modules are found in the CPAN, and will be described as they are used below.

Line 22 works around a bug in the XML::RSS::Feed module: if you give a path that doesn't exist (which I've done, more than once), XML::RSS::Feed does not create the directory for you, and doesn't tell you that it's not there, so you simply get confusing behavior (all headlines are always marked new).

Line 23 enables the transparent web cache. Most modern RSS generators can take advantage of client-side caching to reduce the traffic and CPU load. If a web client already has a prior fetch of an RSS feed, the client can include the modification time of that fetch along with the next request, and the server can say ``nope, you've already got the latest version''. Normally, LWP::UserAgent does no caching of prior fetches, but dropping in HTTP::Cache::Transparent modifies the behavior of LWP so that caching is performed automatically, much as if a proxy cache server were inserted upstream. It's quite a nice module, and can be used to improve many web-fetching scenarios for cooperating servers.

Lines 25 to 29 set up the XML::RSS::Feed object, representing our source data stream. The URL was obtained from the ``RSS 1.0'' button on the http://search.cpan.org/recent page, although like many modern websites, the RSS information is also in a metatag link, available in modern browsers through a separate user interface for easy grabbing.

Line 31 provides a cache for the output of this program. I had originally just printed the information to STDOUT, but then I realized I didn't want to post an article if there were no new items; so I replaced all the print operations with push @OUTPUT, to save the data.

Lines 33 uses LWP::Simple's get function to grab the RSS data. Because LWP::Simple uses LWP::UserAgent underneath, and we've modified LWP::UserAgent to cache the fetches, we're actually performing a cached fetch.

Line 34 parses the RSS feed data, as copied from the XML::RSS::Feed manpage example. Lines 35 to 41 process each ``new'' headline, as determined by XML::RSS::Feed to be something that we haven't seen before.

For each new headline, lines 36 to 40 grab the text of the headline, the URL for further information (here, the detailed page on the updated module), and the one-line text description as provided by search.cpan.org, and push them onto the end of @OUTPUT, followed by a separator.

Now, if we make it all the way to line 43, and we still don't have any output, there's no point in posting a news message, because it'll be empty. We might have no output if people stopped submitting things to the CPAN (unlikely) or something has broken in the CPAN indexer or CPAN mothership (rare, but it can and has happened), or something is broken in search.cpan.org's update of the RSS feed (also rare, but it also has happened). Hopefully, on the next day's run, we'll get everything that was missed from the time before.

Line 45 cleans up the output just a bit, turning the trailing dash-line into a separator line instead. At this point, @OUTPUT is the guts of a news posting that I want to make into CLPA, but I'll still need some wrapper headers and footers to make it nice.

Lines 47 to 51 fire up a connection to my ISP's news host, including verifying that I can post something: an essential role here.

Line 53 shoves the name of my dot-signature file into @ARGV, so that I can easily open and read it with a diamond read below.

Lines 55 through 75 wrap the @OUTPUT variable with the boilerplate headers and footers for a full news posting. The result of splitting (by newlines) the single string of the here-document updates the value in @OUTPUT. Note that the here-document is double-quote interpolated, because the keyword END is enclosed in double-quotes. This gives me a simple templating strategy, because any scalar or array variables within the here-document will be expanded.

Lines 56 to 59 provide the newsposting header text. Note that the @ in line 58 had to be escaped, or else a variable named @stonehenge would have been needed, and would have failed to compile, because this program is use strict.

Line 59 bears further examination. The outer @{ .. } is an array interpolation, but the value that it interpolates results from the square-bracketed expression [ .. ]. Thus, we have an expression computed within a double-quoted string, providing some data for the interpolation. The unpack extracts the day of week and the date from a scalar-value gmtime expansion, using unpack operations that I described in [last month's column].

Lines 61 to 64 include some boilerplate text above the list of headlines. Line 66 interpolates the original @OUTPUT variable into this string. I can't let the original elements remain separate, or I'll get the single-space-between-elements mess that seems to trouble the beginners. (Honestly, I actually originally had @OUTPUT there, and couldn't figure out where the space was coming from myself!)

Lines 68 through 74 finish up the posting text, using the diamond read to grab my signature below the signature marker automatically.

All that's left to do is post the message! For debugging, I dump the contents to STDERR (line 77), which my cron job happily emails me each night. And then, I push the button in line 79, which posts my automatically generated CLPA message to ``thousands of machines'' within the space of mere minutes. Mission accomplished. All that's left to do is point a nightly cron task at this program, and put everything on autopilot.

Obviously, the program as-is has limited use. But consider taking a blogsearch.google.com RSS feed and posting the search results to your group's internal news server every few hours. By distributing the results as news postings, you can minimize the hit on Google's resources, as well as have a historical record of searches to see when things first appeared. I hope you have fun adapting these techniques. Until next time, enjoy!

LISTING

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     
        =4=     ### config
        =5=     
        =6=     my ($HOME) = glob "~";
        =7=     my $RSS_TEMP_DIR = "$HOME/lib/xml-rss-feed";
        =8=     my $HTTP_CACHE_TEMP_DIR = "$HOME/lib/httpcache";
        =9=     my $SIGNATURE = "$HOME/.signature";
        =10=    
        =11=    ## for news posting:
        =12=    my ($HOST,$USER,$PASS) = qw(nntp.example.com merlyn guesswhat);
        =13=    
        =14=    ### end config
        =15=    
        =16=    use Encode qw(encode);
        =17=    use XML::RSS::Feed ();
        =18=    use HTTP::Cache::Transparent ();
        =19=    use LWP::Simple qw(get);
        =20=    use News::NNTPClient ();
        =21=    
        =22=    mkdir $RSS_TEMP_DIR, 0755 unless -e $RSS_TEMP_DIR; # one time init
        =23=    HTTP::Cache::Transparent::init({BasePath => $HTTP_CACHE_TEMP_DIR});
        =24=    
        =25=    my $feed = XML::RSS::Feed->new
        =26=      (url => "http://search.cpan.org/uploads.rdf";,
        =27=       name => "search.cpan.org",
        =28=       tmpdir => $RSS_TEMP_DIR,
        =29=      );
        =30=    
        =31=    my @OUTPUT;
        =32=    
        =33=    my $xml = get($feed->url);
        =34=    $feed->parse($xml);
        =35=    for my $headline ($feed->late_breaking_news) {
        =36=      push @OUTPUT, $headline->headline . "\n";
        =37=      push @OUTPUT, $headline->url . "\n";
        =38=      my $desc = encode('ascii' => $headline->description);
        =39=      push @OUTPUT, "$desc\n" if defined $desc;
        =40=      push @OUTPUT, "----\n";
        =41=    }
        =42=    
        =43=    exit 0 unless @OUTPUT;          # we have something to say
        =44=    
        =45=    pop @OUTPUT;                    # remove final --- line
        =46=    
        =47=    my $c = News::NNTPClient->new(split /:/, $HOST);
        =48=    if ($USER) {
        =49=      $c->authinfo($USER, $PASS);
        =50=    }
        =51=    $c->postok or die "Cannot post to $HOST: $!";
        =52=    
        =53=    @ARGV = $SIGNATURE;
        =54=    
        =55=    @OUTPUT = split /\n/, <<"END";
        =56=    Newsgroups: comp.lang.perl.announce
        =57=    Followup-to: poster
        =58=    From: merlyn\@stonehenge.com (Randal Schwartz)
        =59=    Subject: new CPAN modules on @{[unpack 'A10 x10 A*', gmtime]}
        =60=    
        =61=    The following modules have recently been added to or updated in the
        =62=    Comprehensive Perl Archive Network (CPAN).  You can install them using the
        =63=    instructions in the 'perlmodinstall' page included with your Perl
        =64=    distribution.
        =65=    
        =66=    @{[join '', @OUTPUT]}
        =67=    
        =68=    If you're an author of one of these modules, please submit a detailed
        =69=    announcement to comp.lang.perl.announce, and we'll pass it along.
        =70=    
        =71=    print "Just another Perl hacker," # the original
        =72=    
        =73=    --
        =74=    @{[join '', <>]}
        =75=    END
        =76=    
        =77=    warn map "$_\n", @OUTPUT;
        =78=    
        =79=    $c->post(@OUTPUT) or warn "failed post!";

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.