Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 62 (Jun 2001)

[suggested title: Hooking up with the news]

In a recent newsgroup thread I was following (I think it was the comp.infosystems.www.authoring.cgi newsgroup), the question came up of a CGI-to-NNTP gateway. This would permit their inhouse staff to gain access to their NNTP server in a limited fashion, without having to learn the news client.

I initially balked at the idea: every CGI hit is a separate process, requiring it to reestablish itself with the NNTP server in a new connection. And from what I've seen about NNTP servers, they tend to be mighty slow on that initial handshake. But I thought about it for a while, and realized that if the first hit could start a web ``miniserver'', then that single miniserver could keep the connection alive at very little overhead.

I've written about the very slick HTTP::Daemon module in this column a few times before (as recently as the December 2000), but it keeps coming up as a great way to solve little niggling problems related to the stateless nature of HTTP. So here we go again, this time with maintaining an NNTP connection. This very same technique could also be used to keep a connection alive to a database, or perhaps even a shell session. (Hmm, there's an idea for another program.)

To demonstrate how this works, I've made a rec.humor.funny browser, which shows the joke-of-the-day posted by the RHF people to the moderated newsgroup. Now, since anyone can invoke this CGI program from anywhere in the world, but only people at my ISP can use my newsserver, I must ensure that I am complying with any ``reuse'' agreements for the news server. Caveat Executor.

So let's take a look at the program, presented here [in listing one, below].

Line 1 starts by turning on taint mode (always a good idea for CGI scripts) and warnings. Sadly, while testing, I found a part of Net::Cmd (a low-level module used by Net::NNTP) performing warnable events while still getting the proper job done. So running this program spit out a lot of stuff on my web log. Beware.

Line 2 disables the buffering on standard output, also generally a good thing on CGI programs.

Line 3 enables the compiler restrictions: no soft (symbolic) references, variable declarations mandatory, and no barewords. Good thing for programs over 10 lines.

Lines 4 through 8 pull in the modules we'll be using. CGI comes standard with Perl. HTTP::Daemon is part of LWP. Net::NNTP is part of ``libnet''. URI::Find is a separate package, and Mail::Internet is part of MailTools. What would you have done had I not told you this? Well, the CPAN shell interface (see perldoc CPAN) permits you to merely say install Net::NNTP, and it figures out which package to fetch and install. Nice.

Lines 11 through 13 define the easy-to-change configuration constants. The port number in line 11 must be between 1024 and 65535, and not conflict with anything else already running. We can use ``0'' here to pick a different port for each invocation and guarantee non-collision, but then the program would fork separately for each invocation. I did that while I was testing. But in that case, you'll want to reduce the timeout value in the next line. At 600 seconds of no activity from anyone, a single daemon will get lonely and go away. If each daemon is talking only to a single user, you'll probably want this to be more like 180 or so: whatever time you think it'll take for a user to click between articles.

And line 13 is the NNTP server to which I'll be connected. An optional port number can be designated by adding a colon and then the port number following the hostname. My ISP's news server does not require additional authentication (they restrict by source IP address), so if you need that, you'll have to add that down later in the code where the NNTP connection is opened.

Line 16 untaints the server name from the environment so that I can use it in a system call.

Lines 18 to 21 attach this process to the server port. If the port number is 0, then this will always succeed, picking a distinct port number. If the port number is non-zero, then this will fail if some other process already has that port number, which we're using as a clue that another of us is already running there. So we won't become the miniserver in that case: just redirect to it.

Line 22 figures out the name of the miniserver (whether it's us or them) for the redirection.

Lines 24 and 25 grab some information from this CGI invocation to be used by the miniserver. The URL to restart this CGI is saved in $SELF_URL: this is used when an error occurs to redirect the user back to the startup script. And $ICONS is the location on my server of the ``icon'' directory, a standard directory provided with Apache for the ``normal'' icons used in directory listings. I'm using the ``left'', ``up'', and ``right'' icons from there for a simple navigation tool.

Line 27 sends the browser connected to this script over to the miniserver. Either that's us if the HTTP::Daemon call succeeded, or hopefully an already running miniserver. Line 29 ends our run if it's somebody else.

If it's us, we have to cleanly detach from the CGI connection, so line 31 forks, line 32 causes the parent process to exit (letting Apache know that we're done sending stuff to the browser) and line 33 closes STDOUT (so that Apache really knows nothing else remains to be said to the browser). This lets the redirect in line 27 take effect immediately, while we continue to run.

Lines 35 and 36 hook up to the NNTP server. If you needed NNTP authentication, it'd go here. Now this may fail, but the only way we can tell the user is when the browser finally connects up to the miniserver. That's coming up.

Lines 39 through 112 form the main loop, which will be executed as long as activity comes in often enough. Line 40 handles that ``often enough'' timing. Each time we get to the top of the loop, we'll set an alarm at $TIMEOUT seconds away. If we don't make it back to the top of the loop in time (like when we're sitting idle waiting for a connection), the SIGALRM signal comes along, looking for a handler. As there's no handler, the default is to kill the process. Boom, we're dead.

Line 41 is where the process spends its idle time. The call to accept waits for a new connection, blocking nicely until that happens. The connection comes back in the $c object, which acts like a filehandle connected to the browser, except that we can also call methods on it to perform additional operations.

Line 42 selects the filehandle, so that it becomes the default for the print operations used later, saving me from typing a lot of print $c ....

Lines 43 through 47 get the HTTP request from the connection. If this fails (malformed, or aborted), then we loop back up to the top.

Line 49 pulls out the url requested from this miniserver, removing the leading slash in the process. For this miniserver, the URL never maps to a file, but rather is used to select different operations for the server. Line 50 sends back the initial HTTP response (always a 200 here), and line 51 takes the content of any form data and bundles it up as a CGI object. I didn't use any CGI forms in this example, but I threw this in here in case I wanted to call param on some form after I maintained it later.

Line 53 finishes off the HTTP header, declares the content to be HTML, and starts the HTML payload of the message.

Lines 55 to 60 provide the ``oops, no NNTP connection'' error message. The link will send the user back to the top of this program in a separate invocation, in hopes that simply retrying it will be enough.

If we make it this far in the program, we're expecting four different types of URLs, organized hierarchically as:

  /
  /GROUP.NAME/
  /GROUP.NAME/NN
  /GROUP.NAME/NN-DIRECTION

If a group is not specified, a list of groups is presented for further selection. If only a group name is specified, we start at the last article of that group. If an article number is present, we'll start there (or as close to there as we can get). If an article number is followed by a direction, we move in that direction by one article. Keep it simple, I say.

Lines 63 through 65 pull out the information verifying that a valid group name is present. If not, the list of groups matching a news glob pattern is presented as a series of links, in lines 66 to 69. In this case, we'll show the groups on the server beginning with rec.humor., by asking the NNTP connection for the list. The return value is a hashref, with keys being the matching group names. This is transformed with the map operation into a bulleted list (for easy layout, because I'm lazy).

If we make it to line 73, we've got a valid group name, and the NNTP server has put us ``into'' that group for further article-number operations. Lines 75 trhough 79 normalize the incoming article number to ensure that it's within range of possible article numbers for the group, and that the article is present. It still might be possible to jimmy up a URL that points at a non-existing article number, but the next/previous buttons will still find the next article up or down from that (I think).

Speaking of those next/previous buttons, lines 80 to 85 use the NNTP server to find the next higher or lower valid article number, if the number was followed by -next or -prev, respectively. Note that the Net::NNTP interface does not normally let me find out what the server has said about article numbers. I was puzzled about that, but rather than fight it, I just stared at the source code, and found that I could get at the most recent response message using the message method, and that it did indeed let me parse out the message number! So I'm using a published interface, but only by having looked at the source code for how a different public interface was actually implemented. Praise to open source!

Lines 87 to 100 create the navigation widget to go to the next or previous article, or go back to the newsgroup selection menu. First off, nobody will ever confuse me for an ``HTML designer''. But I tried to take a whack at those little ``left-up-right'' thingies, by using a 3x2 table. The three icons are always displayed, but are inactive if that direction makes no sense. Yeah, not very good GUI, and it looked like a mess when I tried it in Mac IE (not my preferred browser), but it does work.

Finally, it's time for the article itself. Line 103 pulls up the ``current'' article, returning an arrayref if successful, or undef otherwise.

Line 104 puts the article into a Mail::Internet object so that we can easily get at the headers and get rid of the signature area if we choose. News articles are a subset of mail messages, so this works nicely. Line 105 would have removed the signature, but I left that commented out because I wanted to see the URLs in most signatures during testing.

Lines 107 through 109 handle the bulk of the transformation. First, we take a few select header lines, and then the message body, and feed them through fix, which finds all the URL links and links them, and makes the rest HTML-safe. We wrap that in a pre element, and we're done!

The fix subroutine starting in line 114 uses the nearly magical URI::Find module to locate the links. It does a great job on finding most of the common things that people put into a message, erring occasionally on recognizing too much or too little. Unfortunately, the messy interface to find_uris requires marking the return string with odd marker characters to be able to perform a escapeHTML sweep over the result. But it works fine.

And there you have it. Reading the news via the web, and rather efficiently. If you're feeling adventurous, you can add a post or reply capability to this program, or perhaps look for MIME encoded images and extract them automatically, or maybe even a subject display to pick one of many messages, or threading. Well, you've probably seen a newsreader: you could make it into one of those with enough hacking. Just keep in mind that there's only one process shared amongst all the people currently reading news, so you don't want any one activity to take too long. Until next time, enjoy!

Listings

        =1=     #!/usr/bin/perl -Tw
        =2=     $|++;
        =3=     use strict;
        =4=     use CGI qw(:standard escapeHTML);
        =5=     use HTTP::Daemon;
        =6=     use Net::NNTP;
        =7=     use URI::Find;
        =8=     use Mail::Internet;
        =9=     
        =10=    ## config
        =11=    my $PORT = 42084;               # at what port
        =12=    my $TIMEOUT = 600;              # number of quiet seconds before abort
        =13=    my $NNTP = "news.my-isp.comm";  # news-server
        =14=    ## end config
        =15=    
        =16=    my ($HOST) = $ENV{SERVER_NAME} =~ /(.*)/s; # untaint
        =17=    
        =18=    my $d = do {
        =19=      local($^W) = 0;
        =20=      new HTTP::Daemon (LocalAddr => $HOST, LocalPort => $PORT, Reuse => 1)
        =21=    };
        =22=    my $url = $d ? $d->url : "http://$HOST:$PORT";
        =23=    
        =24=    my $SELF_URL = self_url;        # for restarting when server breaks
        =25=    my $ICONS = self_url(-base => 1)."/icons";
        =26=    
        =27=    print redirect($url);
        =28=      
        =29=    exit 0 unless defined $d;       # do we need to become the server?
        =30=    
        =31=    defined(my $pid = fork) or die "Cannot fork: $!";
        =32=    exit 0 if $pid;                 # I am the parent
        =33=    close(STDOUT);
        =34=    
        =35=    my $nntp = Net::NNTP->new($NNTP);
        =36=    $nntp->reader if $nntp;
        =37=    
        =38=    ## the main loop
        =39=    {
        =40=      alarm($TIMEOUT);              # (re-)set the deadman timer
        =41=      my $c = $d->accept or redo;   # $c is a connection
        =42=      select $c;                    # default for print
        =43=      my $r = $c->get_request;      # $r is a request
        =44=      unless ($r) {
        =45=        warn "cannot get request", $c->reason;
        =46=        redo;
        =47=      }
        =48=    
        =49=      (my $code = $r->url->epath) =~ s{^/}{};
        =50=      $c->send_basic_header;
        =51=      $CGI::Q = new CGI $r->content;
        =52=    
        =53=      print header, start_html("read news");
        =54=    
        =55=      unless ($nntp) {
        =56=        print "Sorry, the NNTP server is unavailable!", br,
        =57=          a({-href => $SELF_URL}, "[Start over]");
        =58=        close $c;
        =59=        redo;
        =60=      }
        =61=    
        =62=      my ($group, $article, $direction, $number, $min, $max);
        =63=      unless (($group, $article, $direction) =
        =64=              $code =~ /\A([a-z0-9.]+)\/(?:(\d+)(-\w+)?)?\z/ and
        =65=              ($number, $min, $max) = $nntp->group($group)) {
        =66=        print h2("Select a group");
        =67=        my $active = $nntp->active("rec.humor.*");
        =68=        print ul(map li(a({-href => "/$_/"}, escapeHTML("[$_]"))),
        =69=                 sort keys %$active);
        =70=        close $c;
        =71=        redo;
        =72=      }
        =73=      ## we have a valid group:
        =74=      print h2("Group ", escapeHTML($group));
        =75=      $article = $max unless defined $article; # if entering group
        =76=      $article = $min if $article < $min;
        =77=      $article = $max if $article > $max;
        =78=      ($article) = $nntp->message =~ /^(\d+)/
        =79=        if $nntp->nntpstat($article); # prepare for next/prev
        =80=      if ($direction) {
        =81=        if ($direction eq "-prev") {
        =82=          ($article) = $nntp->message =~ /^(\d+)/ if $nntp->last;
        =83=        } elsif ($direction eq "-next") {
        =84=          ($article) = $nntp->message =~ /^(\d+)/ if $nntp->next;
        =85=        }                           # might add other cases here or error checking
        =86=      }
        =87=      ## navigation box:
        =88=      print table({-border => 0, -cellspacing => 0, -cellpadding => 2},
        =89=                  Tr(td("&nbsp;"),
        =90=                     td(a({-href => "/"}, img({-src => "$ICONS/up.gif"}))),
        =91=                     td("&nbsp;")),
        =92=                  Tr(td($article > $min
        =93=                        ? a({-href => "/$group/$article-prev"},
        =94=                            img({-src => "$ICONS/left.gif"}))
        =95=                        : img({-src => "$ICONS/left.gif"})),
        =96=                     td("&nbsp;"),
        =97=                     td($article < $max
        =98=                        ? a({-href => "/$group/$article-next"},
        =99=                            img({-src => "$ICONS/right.gif"}))
        =100=                       : img({-src => "$ICONS/right.gif"}))));
        =101=     ## article:
        =102=     print h2("Article ", escapeHTML($article));
        =103=     next unless my $headbody = $nntp->article;
        =104=     my $mail = Mail::Internet->new($headbody);
        =105=     ## $mail->remove_sig;
        =106=     $mail->tidy_body;
        =107=     print pre(fix(join("", map("$_: ".($mail->head->get($_)),
        =108=                                qw(Subject Date From)),
        =109=                        "\n", @{$mail->body}),1));
        =110=     close $c;
        =111=     redo;
        =112=   }
        =113=   
        =114=   sub fix {                       # HTML escape, plus find URIs if $_[1]
        =115=     local $_ = shift; return escapeHTML($_) unless shift;
        =116=     # use \001 as "shift out", "shift in", presume data doesn't have \001
        =117=     find_uris($_, sub {my ($uri, $text) = @_;
        =118=                        qq{\1<a href="\1$uri\1" target=_blank>\1$text\1</a>\1} });
        =119=     s/\G(.*?)(?:\001(.*?)\001)?/escapeHTML($1).(defined $2 ? $2 : "")/sgie;
        =120=     $_;
        =121=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.