Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 06 (Nov 1999)

[suggested title: No news is not good news]

Usenet news has been around since 1979. I've been reading news nearly daily since 1980, except for a brief hiatus in 1984 where I missed the ``great renaming''. Because news is important (and familiar) to me, it's important for me to read news from a news server that has fairly decent article coverage.

I'm a charter subscriber to the largest ISP in town. Recently, there were some complaints on the ISP-only internal newsgroups that the newsfeed seemed a little less than normal. I wondered if it was a summertime slowdown or an actual problem, and since I like to help out the admins of this ISP when I can, I took it upon myself to hack out a Perl tool to verify whether the problem was real or merely perceived.

Because I wanted some quantitative data, I decided to ask Deja (formerly Dejanews) and AltaVista about all the articles they have seen in a given time frame. I figured that if my ISP also had all of those articles, there wasn't a problem. If only some of those articles had shown up, then it's time to figure out how to have the ISP solve the feed issues. And while I was at it, I could compare three ISPs to which I have news access all at the same time.

Now, doing this all from scratch would have been quite difficult. I'd have to parse the output of the Deja and AltaVista search engines, looking for links, then extracting each of their Message IDs carefully from the results. Thankfully, I worked smartly on this one, and noted that there's a nice CPAN module called WWW::Search that does exactly this. So, in less than 150 lines of code, I could do all the research I needed and still have some time left over to actually read the news that was there.

Also, this program leverages off of the very nice LWP package from Gisle Aas and friends, allowing me to trivially fetch data from a given URL, and break apart the returned URLs.

Now, even if you don't have a potentially flakey newsspool nearby, you can still use the techniques presented here to discover other interesting news-related events. So, let's take a look at the code in [listing one, below].

Lines 1 through 3 start nearly every program I write, enabling warnings, turning on the most common compiler restrictions, and disabling buffering on STDOUT.

Lines 5 through 8 pull in the modules that I'll be using, all found in the CPAN. Net::NNTP comes from Graham Barr's libnet. LWP::Simple and URI are both in the Bundle::LWP group. And WWW::Simple is on its own. If you don't have these modules, use the CPAN installation tool to fetch and install them for you.

Lines 10 through 30 are the configuration area. I tend to lump things I might want to change between runs in a special area at the top of the program, and mark it as such. I also generally use uppercase variables for these constants.

Line 12 defines the verbosity level of this program. Here it's set to 1, meaning that we'll know as each article is being fetched from one of the net sources. While reassuring, that can be a bit noisy, so setting this to 0 means we'll see only the final report.

Line 14 defines the newsgroups that we'll be checking, as a list. I wanted a representative sampling, so I picked a few of the newsgroups I read frequently. A newsgroup has to be carried by Deja and AltaVista or it can't be checked, so using internal or very-local newsgroups probably won't work well.

Lines 16 and 17 define the window of articles to be considered. Because the primary source feed is Deja, which doesn't honor cancels, I pick dates that are old enough to also be in AltaVista, which seems to take two weeks to get new articles into the searching database. That way, if it's seen in Deja but nowhere else, I can presume it's a cancelled article instead of being worried that it never got to my server. The downside of this is that my newsserver might have expired the article by now, so I'll get a false ``missing''. It's too bad AltaVista doesn't have current databases like they had originally.

Lines 19 through 28 define the newsservers that I'm scanning. I defined the three local newsservers from the three ISPs to which I'm subscribed. For each ISP, I must define a host, giving the hostname and optional port number on which the NNTP server is located. (I've obscured the actual hostnames here so as not to make my ISPs mad.)

For two of these news servers, I access them via an SSH tunnel. The tunnel command will be executed prior to attempting to connect to the news server. This particular SSH tunnel command establishes (for 180 seconds) a local port (like 1190 or 1191) that is connected to a newsserver via a remote command-line host. So, for example, connecting to localhost at port 1191 will really be connecting to news.teleXXXX.com. The nice thing about an SSH tunnel is that it's compressed and I don't have to have a real password flying over the wire.

If you have an ISP that requires authinfo style authentication, you may also include user and pass as parameters here to give that. Be aware that those passwords are transmitted in the clear, so wire-snoopers will see them.

Lines 32 through 34 set up some global variables. %id holds the information about each message-id. $FROM and $TO are the human-readable start and end dates for this particular report.

Lines 36 through 65 handle the initial Deja lookup. Line 39 holds a hash used to ensure that we look at a particular Deja articlenumber only once. Deja breaks up long articles into multiple hits and we need only the first hit to find the message-id.

Lines 41 through 49 set up a WWW::Search object, looking for the right articles in the designated groups in the indicated date range. We'll set the maximum to a number that is nice and high (10000), although I probably wouldn't have patience to fetch more than a 1000 or so hits from Deja. Nor would it be all that useful for information.

Lines 47 to 64 discover the matching hits, and look for message IDs in those hits. Each hit will come out in $result in line 51, with its URL extracted in line 52.

Because multiple hits can refer to the same article, I have to process the query-form parameters of the URL to determine which Deja article number is being fetched. If article 9876 is too long, Deja will return successive chunks in hits 9876.1, 9876.2, 9876.3 and so on, but we want only the whatever-dot-1 part. So, lines 57 to 59 determine the article number, and skip any later hits on the same article number.

Lines 60 to 62 fetch the text of the article (or just the first part of a long article), and extract the message ID, noting it into the %id two-level hash with a first-level key of the message-ID and a second level key of DJ (for DeJa).

Lines 67 to 88 do basically the same thing for AltaVista. The biggest change is that AV wants the newsgroups in the query string (not as a separate field), and the date format is wonderfully incompatibly different (month/day/year for Deja, day-month-year for AV). Additionally, the returned URL needs to be hacked in a slightly different way to get a good text file to search (see line 82). And we'll record message IDs found from AV in the %id hash again, this time with a sub-key of AV.

Lines 90 to 94 extract the known message IDs from the %id hash. Initially, I just sorted the IDs to make the report consistent between runs, but I thought it might be nice to see the messages grouped by originating host to see if there was a pattern. So, we have a classic Schwartzian Transform here (named for me, but not by me) to sort the message IDs by their hostname first, then localpart second. The result is a list of messages that the big two archivers have seen, for which we now need to scan our local newsservers for confirmation.

Lines 96 through 123 look at each of the newsservers being tested. $short_host gets a two-character unique identification in line 98. We'll extract the access information in line 99 from the %NNTP hash into %INFO. If it's tunneled, that tunnel program is launched in lines 100 to 103.

Line 105 attempts a connection to the NNTP host and port. If that doesn't work, lines 106 to 109 discover that, and move on to the next one. Lines 110 through 112 provide authinfo connection information if that's designated in the %NNTP hash above for this host. Note that this may fail, but that'll just make the stuff later exit early.

Lines 113 to 123 ask each particular NNTP server if it has seen each message ID. The return value from the nntpstat method will be true in this case, so we'll note in line 116 that this was so. The code controlled by $NOISY notes our progress and results.

Lines 125 to 134 dump out the final report. First, we'll get the host list from the %NNTP hash in line 125. Line 127 dumps out a nice banner. Lines 129 to 132 dump out the results for each message ID for each host, including Deja and AltaVista, as a nice two-character code if found or spaces if absent, make a nice set of columns in front of each message ID.

Finally, lines 136 to 148 turn a Unix timestamp into an approriate date in the incompatible Deja and AV formats.

So, if you're suspecting propogating problems with your newsserver, now you can see just how much of the news you're really getting. Until next time, enjoy!

Listings

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use Net::NNTP;
        =6=     use WWW::Search;
        =7=     use LWP::Simple;
        =8=     use URI;
        =9=     
        =10=    ## CONFIG ##
        =11=    
        =12=    my $NOISY = 1;
        =13=    
        =14=    my @GROUPS = qw(comp.lang.perl.misc rec.humor.funny pdx.general comp.risks);
        =15=    
        =16=    my $DAYS_AGO_FROM = 21;
        =17=    my $DAYS_AGO_TO = 19;
        =18=    
        =19=    my %NNTP =
        =20=      (
        =21=       'in' => {host => 'news.inetXXXXX.com'},
        =22=       'te' => {host => 'localhost:1191',
        =23=                tunnel => 'ssh -f -q -L 1191:news.teleXXXX.com:119 teleXXXX.com sleep 180',
        =24=               },
        =25=       'ag' => {host => 'localhost:1190',
        =26=                tunnel => 'ssh -f -q -L 1190:herXXX.rXXXX.com:119 agXXX.rXXXX.com sleep 180',
        =27=               },
        =28=      );
        =29=    
        =30=    ## END CONFIG ##
        =31=    
        =32=    my %id;
        =33=    my $FROM = days_ago_to_deja_date($DAYS_AGO_FROM);
        =34=    my $TO = days_ago_to_deja_date($DAYS_AGO_TO);
        =35=    
        =36=    ## deja phase
        =37=    
        =38=    {
        =39=      my %seen;
        =40=    
        =41=      my $search = WWW::Search->new('Dejanews');
        =42=      $search->native_query
        =43=        ("",
        =44=         {
        =45=          groups => join(',', @GROUPS),
        =46=          fromdate => $FROM,
        =47=          todate => $TO,
        =48=         });
        =49=      $search->maximum_to_retrieve(10000);
        =50=      print "Deja: " if $NOISY;
        =51=      while (my $result = $search->next_result) {
        =52=        my $url = $result->url;
        =53=        my $uri = URI->new($url);
        =54=        my %query = $uri->query_form;
        =55=        next unless exists $query{AN};
        =56=        print "." if $NOISY;
        =57=        my($an) = $query{AN} =~ /(\d+)/;
        =58=        next if $seen{$an}++;
        =59=        $uri->query_form(AN => "$an.1", fmt => 'raw');
        =60=        next unless $_ = get "$uri";
        =61=        next unless /^Message-ID:\s+(.*\S)\s*$/m;
        =62=        $id{$1}{DJ}++;
        =63=      }
        =64=      print "\n" if $NOISY;
        =65=    }
        =66=    
        =67=    ## alta phase
        =68=    
        =69=    {
        =70=      my $search = WWW::Search->new('AltaVista::AdvancedNews');
        =71=      $search->native_query
        =72=        (join(" OR ", map "newsgroups:$_", @GROUPS),
        =73=         {
        =74=          d0 => days_ago_to_alta_date($DAYS_AGO_FROM),
        =75=          d1 => days_ago_to_alta_date($DAYS_AGO_TO),
        =76=         });
        =77=      $search->maximum_to_retrieve(10000);
        =78=      print "Alta: " if $NOISY;
        =79=      while (my $result = $search->next_result) {
        =80=        my $url = $result->url;
        =81=        print "." if $NOISY;
        =82=        $url =~ s/news\?msg/news?plain\@msg/;
        =83=        next unless $_ = get $url;
        =84=        next unless /^Message-ID:\s+(.*\S)\s*$/m;
        =85=        $id{$1}{AV}++;
        =86=      }
        =87=      print "\n" if $NOISY;
        =88=    }
        =89=    
        =90=    my @msg_id =
        =91=      map { $_->[0] }
        =92=      sort { $a->[2] cmp $b->[2] or $a->[1] cmp $b->[1] or $a->[0] cmp $b->[0] }
        =93=      map { /(.*)\@(.*)/ ? [$_, $1, $2] : [$_, "", ""] }
        =94=      keys %id;
        =95=    
        =96=    ## nntp phase
        =97=    
        =98=    for my $short_host (sort keys %NNTP) {
        =99=      my %INFO = %{$NNTP{$short_host}};
        =100=     if (my $tun = $INFO{tunnel}) {
        =101=       print "launching $tun\n" if $NOISY;
        =102=       system $tun;
        =103=     }
        =104=   
        =105=     my $c = Net::NNTP->new($INFO{host});
        =106=     unless (defined $c) {
        =107=       warn "cannot connect to $short_host, skipping\n";
        =108=       next;
        =109=     }
        =110=     if ($INFO{user}) {
        =111=       $c->authinfo($INFO{user},$INFO{pass});
        =112=     }
        =113=     for my $msg_id (@msg_id) {
        =114=       print "$msg_id at $short_host: " if $NOISY;
        =115=       if ($c->nntpstat($msg_id)) {
        =116=         $id{$msg_id}{$short_host}++;
        =117=         print "yes" if $NOISY;
        =118=       } else {
        =119=         print "no" if $NOISY;
        =120=       }
        =121=       print "\n" if $NOISY;
        =122=     }
        =123=   }
        =124=   
        =125=   my @hosts = sort keys %NNTP;
        =126=   
        =127=   print "report from $FROM to $TO for @GROUPS\n";
        =128=   for my $msg_id (@msg_id) {
        =129=     for my $host ("DJ","AV",@hosts) {
        =130=       print $id{$msg_id}{$host} ? $host : "  ";
        =131=       print " ";
        =132=     }
        =133=     print "$msg_id\n";
        =134=   }
        =135=   
        =136=   ## subroutines
        =137=   
        =138=   sub days_ago_to_deja_date {
        =139=     my $days = shift;
        =140=     my @gm = gmtime(time - 86400 * $days);
        =141=     return sprintf "%02d/%02d/%04d", 1 + $gm[4], $gm[3], 1900 + $gm[5];
        =142=   }
        =143=   
        =144=   sub days_ago_to_alta_date {
        =145=     my $days = shift;
        =146=     my @gm = gmtime(time - 86400 * $days);
        =147=     return sprintf "%02d-%02d-%04d", $gm[3], 1 + $gm[4], 1900 + $gm[5];
        =148=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.