Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 13 (Jun 2000)

[Suggested title: Moving your news service]

About a half a year ago in this column I talked about how my ISP was looking at the performance of their news server, and I wrote a program to see just how bad the news service was compared to the other local ISPs using Deja as a baseline. Well, the ISP just got bought out by a big national chain, and they decided not to fight the spotty news service any more, and just convert over to everyone using the conglomerate's big service.

But the problem with moving from one news server to another is that the article numbers are not in sync, so a ``.newsrc'' file will have the right newsgroups, but the wrong ``already read'' marks. And since I read a lot of newsgroups, I don't have time to re-read existing articles, and I don't want to just throw away any new articles because I might miss something.

The solution is a bit complicated and has extensive bookkeeping requirements, but that's what computers are for, and Perl in particular. What you need to do is mark read any article you've already seen. Messages are uniquely identified by a ``Message ID'', and you can get that mapped into article numbers via the appropriate ``XHDR'' request to the NNTP server.

So, basically, for every subscribed newsgroup, we fetch the message IDs of the last 500 articles from the new server. (500 being the maximum number of unread articles per group I'd care to face in any event.) Then, we fetch the last 1500 or so message IDs from the old server. Then, for every message ID I know about on the new server, I see if I've already read it on the old server, and if so, mark it read in the new newsrc.

The newsrc file is the classic rn format. Most modern newsreaders can import and export this format, so it's a nice least-common-denominator of exchange. And there's a good module or two in the CPAN to deal with this format as well.

There was one additional requirement, just to make this even more interesting. My newsreading and general information processing is on yet another ISP from where either the old or new news servers are located. So I use ssh tunneling to go back to the shell account machine of the old ISP to get to its news server, and also to get to the new server at the takeover ISP's machine, which is permitted access only to its customers so I can't use them on my computation server ISP. Almost as bad as trying to figure out those spy novels with all the odd names, but most of the time this is transparent to me. However, I had to use ssh tunnelling to get to both the old and new news servers, although the program was set up so that it could also run localling on the old ISP shell machine.

It's a mess, but it works. After my ISP converted, I had a fairly nice looking new newsrc with all my previously read articles punched out already. And the program to do this all is in [the listing below], which goes as follows.

Line 2 turns on strict mode - needed for every program that is longer than 10 lines or used more than 10 minutes. In this case, the first applies but not the second, since I hope I'm not changing servers frequently.

Line 3 unbuffers standard output. There's not a lot of output from this program, and I want to see it as it comes along.

Lines 5, 6, and 7 pull in the modules we'll need. Net::NNTP comes from the CPAN, and lets us talk to NNTP servers. News::Newsrc also comes from the CPAN, and provides parsing and updating of ``newsrc''-format files. IO::FIle is a core module installed with Perl, and lets us have generic filehandles as objects.

Lines 9 through 23 provide the most-likely-to-be-tweaked settable variables. As always, I'm providing my programs not as ``ready-to-run'' robust programs, but as snippets for your own inspiration (steal the ideas, not the code). However, since I'll probably brush the dust off this program in another year or so when this ISP merges with another one, I'll make it easy to remember my thinking by providing a distinct configuration area.

Here, $DST_MAX is the most number of unread articles we're willing to tolerate on the new server. You could probably crank this up to 20000 or so if you wanted to be sure to read everything the new server has to offer, but if you have a lot of groups, the bigger numbers will mean slower operations. (I had about 120 subscribed newsgroups, and it took about 10 minutes to process at my value of 500 here, if that means anything.) $SRC_MAX is how many articles to map in the old news server. Because articles come in in a scrambled order, this should be a number bigger than $DST_MAX to ensure that we don't miss an article number mapping on the old server that we'll need.

$OLD and $NEW are the old and new news server hosts, respectively. I presumed that I'd always be using port 119 (the NNTP port) on the addresses, although I see that it wouldn't be hard to parameterize that. No sense in making everything too flexible for such an infrequently used program!

$VIA is used when I need to ssh-tunnel the connections. It's the hostname of the shell machine at the old ISP. (Please note that these are not the real hostnames... the comm suffix should be enough of a clue not to try them.) If $VIA is false (such as 0, undef, or the empty string), tunnelling won't be used, so this is an optional step. However, if it's used, we need to select two hopefully unused port numbers for the local tunnel ports, and those are given in $VIA_OLD_PORT and $VIA_NEW_PORT.

Finally, $VERBOSE says how noisy to be. If we turn on all the noise, we get a pretty good complete description of where we are in the process, and what we've accomplished. However, $VERBOSE of 0 is just fine if you don't like peering under the hood.

Lines 25 to 33 set up the tunnel if needed. For this to work, I have to have ssh trained to accept connections from my workhorse ISP to my newsreading ISP, which I needed to do for my newsreader anyway. The crucial parts are the selection of the tunnels (the -L parameters), the command to run (sleep 60), and the additional sleep for 5 seconds after firing off the ssh to let everything warm up. The sleep 60 is executed on the remote host, and needs to be longer than it takes for my program to connect to the local tunnel ports. Once the connections are established, the remote command can terminate without any problem.

$SRC_NNTP and $DST_NNTP, defined in lines 34 and 35, set up the hostname and portnumber (if needed) for the old and new news servers. Lines 37 and 38 attempt the connection to those servers, die-ing if things are bad.

Lines 40 and 41 create News::newsrc objects to hold the newsrc for the old server and the newsrc for the new server. Line 42 sets aside a place for lines from the old newsrc that aren't really about subscribed or unsubscribed newsgroups - apparently, News::newsrc blows up on these.

Lines 44 to 50 grab the old newsrc information into the newsrc object. As you can tell, this is pretty inflexible, grabbing the file directly from my home directory. Maybe this should have been a parameter, but I don't care, because the job got done. @extra_lines gets all the stuff that's not about a newgroup, while the remaining lines are sucked into the newsrc object.

Lines 52 to 90 do the bulk of the job. For every newsgroup mentioned in the old newsrc, we loop once with $group set to that group. A large eval block protects us from premature death on any particular newsgroup, giving us instead a group that won't be transferred to the new newsrc.

Line 54 determines if it's a subscribed newsgroup, and if so, sends us through the bulk of lines 55 to 81 (described in a moment). If not, we skip down to lines 83 to 87 and mark the group as unsubscribed in line 84. Line 85 grabs the lowest article number still active on the news server, and line 87 ensures that we don't try reading any article number below that. (Most newsreaders do the equivalent already, but I'm trying to make an accurate newsrc here.)

Now, back to the harder part. Line 58 gets the info from the old server about article number range present in the group. Line 60 computes a range not to exceed $SRC_MAX items for which we must get a ``message-id-to-article-number'' map constructed. Line 61 creates a hash from the hashref returned by calling the NNTP XHDR operation for all the message IDs in the given article number range. Sure, you can get this info one article at a time, but the XHDR command is very fast since it reads directly from the .overview file that most news servers now maintain. The result is that we have a hash called <%src_msgid_to_art> that we can feed a message ID and get back the article number. Since we can then see if this article number has already been read, we'll be able to tell if we should mark it as having been read in the new newsrc. Lines 65 and 66 do the same thing for the other direction, figuring out what message IDs correspond to which article numbers in the new news server.

And then it's time for the heavy bookkeeping. Lines 68 to 80 check each article in the new server for its message ID number (line 70). If that same article (line 73) has been read on the old server (line 76), we mark it as read on the new server (line 78). Not rocket science, but a lot of details to get right. At this point, we're not talking to either of the servers - all of the information is in hashes in memory.

Line 81 then marks as read anything below the articles we've considered. This means we can never have more than $DST_MAX articles unread.

And now that we're all done, line 93 dumps the result! I could have made it save the new newsrc directly, but I'm running this program inside a window that I can cut-n-paste, so it didn't matter.

So there you have it. I wish you the luxury of never having to move from one news server to another, but at least if you have this program and a short period of overlap, it'll ease the pain a bit when you must move.

This is my last column for Linux Magazine that will be strictly about Perl. Next month, I'll begin writing about general webmaster topics. Of course, most of those topics will probably involve some Perl solution as some code snippet, but I'll be able to look at some non-Perl things as well. I hope you enjoy the new format as much as you've said you've liked this column in the past. Until next time, go forth and be Perl-y!

Listings

        =1=     #!/usr/bin/perl
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use Net::NNTP;
        =6=     use News::Newsrc;
        =7=     use IO::File;
        =8=     
        =9=     ## config
        =10=    
        =11=    my $DST_MAX = 500;
        =12=    my $SRC_MAX = $DST_MAX * 3;
        =13=    
        =14=    my $OLD = "news.old-isp.comm";
        =15=    my $NEW = "news.big-mega-isp.comm";
        =16=    
        =17=    my $VIA = "shell.old-isp.comm";
        =18=    my $VIA_OLD_PORT = 42001;
        =19=    my $VIA_NEW_PORT = 42002;
        =20=    
        =21=    my $VERBOSE = 2;                # 0 quiet, 1 expected errors, 2 noisy
        =22=    
        =23=    ## end config
        =24=    
        =25=    system join " ",
        =26=      "ssh -f -q",
        =27=      "-L $VIA_OLD_PORT:$OLD:119",
        =28=      "-L $VIA_NEW_PORT:$NEW:119",
        =29=      "$VIA",
        =30=      "exec sleep 60",
        =31=      "&",
        =32=      "sleep 5" if $VIA;
        =33=    
        =34=    my $SRC_NNTP = $VIA ? "localhost:$VIA_OLD_PORT" : $OLD;
        =35=    my $DST_NNTP = $VIA ? "localhost:$VIA_NEW_PORT" : $NEW;
        =36=    
        =37=    my $src = Net::NNTP->new($SRC_NNTP) or die "src: $!";
        =38=    my $dst = Net::NNTP->new($DST_NNTP) or die "dst: $!";
        =39=    
        =40=    my $src_rc = News::Newsrc->new or die "Cannot new newsrc for src";
        =41=    my $dst_rc = News::Newsrc->new or die "Cannot new newsrc for dst";
        =42=    my @extra_lines = ();
        =43=    
        =44=    {
        =45=      my $newsrc = IO::File->new("$ENV{HOME}/.newsrc", "r")
        =46=        or die "Cannot open .newsrc: $!";
        =47=      my @all = <$newsrc>;
        =48=      @extra_lines = grep !/^\S+[:!]\s/, @all;
        =49=      $src_rc->_scan(join "", grep /^\S+[:!]\s/, @all); # dies if fail
        =50=    }
        =51=    
        =52=    for my $group ($src_rc->groups) {
        =53=      eval {
        =54=        if ($src_rc->subscribed($group)) {
        =55=          print "subscribed to $group\n" if $VERBOSE > 1;
        =56=          $dst_rc->subscribe($group);
        =57=    
        =58=          (undef, my $src_low, my $src_high) = $src->group($group)
        =59=            or die "Cannot get info for src $group\n";
        =60=          $src_low = $src_high - $SRC_MAX if $src_low < $src_high - $SRC_MAX;
        =61=          my %src_msgid_to_art = reverse %{$src->xhdr("Message-Id", "$src_low-$src_high")};
        =62=          (undef, my $dst_low, my $dst_high) = $dst->group($group)
        =63=            or die "Cannot get info for dst $group\n";
        =64=          
        =65=          $dst_low = $dst_high - $DST_MAX if $dst_low < $dst_high - $DST_MAX;
        =66=          my %dst_art_to_msgid = %{$dst->xhdr("Message-Id", "$dst_low-$dst_high")};
        =67=    
        =68=          for my $dst_art ($dst_low..$dst_high) {
        =69=            eval {
        =70=              my $msgid = $dst_art_to_msgid{$dst_art} or
        =71=                die "no msgid for $dst_art in $group at dst\n";
        =72=                ## next;
        =73=              my $src_art = $src_msgid_to_art{$msgid} or
        =74=                die "no art for $msgid in $group at src\n";
        =75=                ## next;
        =76=              next unless $src_rc->marked($group,$src_art);
        =77=              print "mapping $msgid from $src_art to $dst_art\n" if $VERBOSE > 1;
        =78=              $dst_rc->mark($group, $dst_art);
        =79=            }; warn $@ if $@ and $VERBOSE;
        =80=          }
        =81=          $dst_rc->mark_range($group, 1, $dst_low - 1);
        =82=        } else {
        =83=          print "unsubscribed to $group\n" if $VERBOSE > 1;
        =84=          $dst_rc->unsubscribe($group);
        =85=          (undef, my $dst_low, my $dst_high) = $dst->group($group)
        =86=            or die "Cannot get info for dst $group\n";
        =87=          $dst_rc->mark_range($group, 1, $dst_low - 1) if $dst_low;
        =88=        }
        =89=      }; warn $@ if $@ and $VERBOSE;
        =90=    }
        =91=    
        =92=    print "==== RESULT ====\n";
        =93=    print @extra_lines, $dst_rc->_dump;

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.