CPAN, NNTP, Clarinet Comics (Apr 96)

Web Techniques Column 1 (April 1996)

The Comprehensive Perl Archive Network (CPAN) is a wonderful net resource for finding everything you want to find about Perl stuff. The CPAN archives around the world are all identical, being mirrored from the master CPAN archive. You should use the archive nearest you.

To find the nearest archive, take a look at the current list of CPAN sites, which can be found in the master CPAN archive at ftp://ftp.funet.fi/pub/languages/perl/CPAN/CPAN. (Please use the master archive sparingly, unless you happen to be in Finland.)

Once you've bookmarked your nearest CPAN site, commit a few minutes to getting to know the content and organization. Those few minutes could save you hours of time you wasted while reinventing a wheel.

(Because all the CPAN archives are identical, I'll be referencing files within the archive as CPAN/something, meaning that you should look for "something" beneath the top-level directory of the CPAN.)

For example, a good place to start is to look at CPAN/ROADMAP (or CPAN/ROADMAP.html if you are surfing). Another pretty hot spot is CPAN/modules/README, which describes the available modules.

And that brings me back to the point about knowing the available modules. Like y'all, I'm often faced with a lot of stuff to do and never quite enough time to do it. So, any time I can leverage off of existing code, I can. Let me give you a recent example.

My Internet service provider (ISP), Teleport, subscribes to the ClariNet e.news service (described at http://www.clarinet.com). Each day, I am innundated with more wire stories than I could possibly read, carefully organized into topic groups to allow me to be rather selective. (The only downside is that I can't line the birdcage with it when I'm done.)

These stories are shipped using the standard Usenet news transport mechanisms, and generally kept on the same NNTP servers along with the rest of the fire-hose drinking fountain called "Usenet", making it trivial to read ClariNet wire articles in the same session as, say, comp.lang.perl.misc or comp.infosystems.www.authoring.cgi.

Part of this service includes the digitized pictures of a handful of on-line comics (both entertainment and editoral). Each day, the latest Doonesbury, Bizarro, and Toles comes whistling down the wires, just waiting to be downloaded for viewing pleasure.

However, to view these pictures, I had to go into my newsreader, save the appropriate article, find my base64 decoder (not uudecode), generate the resulting GIF or JPEG, and then download it to my laptop. Needless to say, I didn't stay current on any of the comics this way.

Apparently, Jennifer Myers (jmyers@eecs.nwu.edu), also saw the same problem, and decided to solve it with a rather nice little Perl script, which can be surfed at http://www.eecs.nwu.edu/~jmyers/cgi-src/read-comics. According to the initial documentation in the script:

'read-comics' is a Common Gateway Interface (CGI) script for CGI-compliant HTTP servers (e.g., NCSA httpd). 'read-comics' generates a hypertext interface to the ClariNet electronic newspaper's cartoon newsgroups. It does this by querying the NNTP server for a list of active articles in the newsgroup and then generates from this, a list of hyperlinks. When a link is selected, the article is retrieved from the NNTP server and then (base64) decoded on-the-fly.

Well, this sounded like the solution to my worries, so I grabbed a copy for myself, and modified the necessary pointers, and started getting my daily dose of Dilbert.

Unfortunately, a few months ago, ClariNet stopped Dilbert, and started giving us Doonesbury and the others. Jennifer's script became much less useful. I was stuck between (1) waiting for Jennifer to update the script to discard Dilbert and replace it with the other groups, (2) updating Jennifer's script myself, or (3) rewriting it from scratch, but using the new CPAN modules. I chose #3.

I decided to rewrite it from scratch so that I could learn about some very useful modules. I knew that I needed CGI access, NNTP access, and base64 decoding, so I read up on the modules and got cracking.

CGI access is easily accomplished with the CGI.pm module, by Lincoln Stein. This one-piece module handles most basic CGI interfacing, including form generation and argument parsing. (Lincoln also has a more comprehensive LWP package, which I'll be talking about in future columns.) The version I'm using right now is located at CPAN/authors/id/LDS/CGI.pm-2.13.tar.gz.

Similarly, NNTP access is handled with the nicely written NNTPClient module from Rodger Anderson. This package connects my perl program to the NNTP server, allowing me to locate newsgroups and articles within those groups. The version I'm using right now is at CPAN/authors/id/RVA/NNTPClient-0.22.pm.gz.

Finally, I needed a base64 decoder. I found one in the LWP library as MIME::Base64, so I pulled in that particular module from the LWP library. LWP is located at CPAN/authors/id/LDS/CGI-modules-2.74.tar.gz.

The importance of pulling in these libraries is that I could rely on the work of other people, rather than fussing and mussing with all this stuff myself. This really saved me time, because I could concentrate on all the high-level stuff, and leave the low-level stuff to the libraries.

The resulting script is in Listing 1 [below], which I've annotated with line numbers for the following discussion.

Lines 3 through 5 pull in the various libraries: CGI, NNTPclient, and Base64. This has to be done early in the script so that we can access the methods.

Line 7 defines the news server name. Now, obviously, if you are going to use this script, you'll have to change this. And don't even bother trying to access news.teleport.com... it's restricted to Teleport customers.

Lines 9 and 10 remind me that this script needs to be placed into a directory that has restricted access. I have a "teleport-only" CGI bin area. I've protected this directory with an NCSA "htaccess" file that looks like this:

      
col01.pl

	AuthUserFile /dev/null
	AuthGroupFile /dev/null
	AuthType Basic
	<Limit GET POST>
	order deny,allow
	deny from all
	allow from .teleport.com
	</Limit>

which permits only teleporters to access my script. If I hadn't done that, anyone around the world could have accessed this script, gaining indirect access to the ClariNet groups which Teleport pays for. This would almost certainly be a violation of the ClariNet licensing.

Lines 12 through 22 define the available newsgroups and a short human name for each. This is organized as a list of references to lists. For example, $groups[2][0] is "clari.living.comics.doonesbury", the newsgroup name, while $groups[2][1] is the corresponding human name, "Doonesbury". If ClariNet adds or deletes any comics group, I'm on top of it simply by editing this table.

Note that line 21 defines "clari.news.photos". After I had developed this script, I noticed that this newsgroup contains pointers for all the news pictures in all of the other groups. So, a simple edit, and I could access all of them.

Line 24 comes straight out of the documentation for CGI.pm. It defines a CGI object $Q, and forces all of the correct input for my CGI script (via command-line, environment, and standard-input) to be gathered and collated.

Line 25 creates a text string of a URL that points back to this script. This is handy, because some of the links that this script generates reinvoke the script with additional information (what newsgroup, which article). By letting CGI.pm generate this name, I can move the script around without having to perform any surgery on the script itself.

Line 27 validates an incoming CGI parameter "group". Initially, this parameter is not present. However, as described later, this script reinvokes itself defining "group", and later "article". If "group" isn't defined, the script needs to present a top-level index, allowing the user to choose one of the newsgroups.

Lines 28 to 44 present this top-level index. Since I was in a hurry, I didn't put very many comments on this code, so let me at least hit the high points here.

Lines 28 and 29 create $links, containing anchors for each newsgroup. This anchors are generated from the @groups list. The resulting list looks like:

      
col01.pl

	<p><a href="SOMEWHERE?group=clari.living.comics.doonesbury">
	Doonesbury</a>

(without the newline there). The "SOMEWHERE" is actually the name of this script. So, selecting this link will reinvoke the same script setting the "group" parameter to "clari.living.comics.doonesbury".

Lines 32 through 40 represent the resulting output sent back to the browser, consisting of a header, an HTML start, a list of links, and an HTML end. The fancy construct @{[thing]} is a visually attractive way of evaluating "thing" in an array context and interpolating its space-separated value into a double-quoted string. Sure, I could have done this as a series of prints, but I was in a hurry.

For the top-level index, that's the end of the execution. But if a "group" parameter was specified, we go on to the next check. Line 46 notices whether or not an "article" parameter was given. If we are coming from the top-level index, we don't have an article parameter, so we end up inside the block beginning at line 47.

Lines 47 through 65 create a second-level index. This is an index of all current articles within a particular newsgroup, each representing a seperate picture. The group name comes from the "group" parameter.

Line 47 establishes a connection to the NNTP server. The number 119 here is the standard NNTP port number. (I didn't bother to figure out if I could leave it out, but this works. :-)

Lines 48 to 51 grab the Article and Subject lines of all existing articles in the newsgroup. The ugly expression in line 48 returns a list of tab separated "xover" lines, one for each article. Inside the loop, line 49 breaks apart the number and subject, and line 50 assembles up a list of links as HTML. The result of line 50 will look like:

      
col01.pl

	<p><a href="SOMEWHERE?group=clari.living.comics.doonesbury&article=123">
	Doonesbury 950101</a>

where 123 is the article number and "Doonesbury 950101" is the ClariNet subject line. Notice that this will once again reinvoke the same script, but passing it both a group and a particular article number as two separate parameters.

Lines 54 through 62 create the output HTML, similar to the text above.

Line 69 begins the portion of the script that actually generates a picture for the browser. Because we have both a valid "group" and "article" parameter, we can fetch a specific article. Once again, using NNTPClient, I connect to the server, select the right group, and but this time I get the article into the @art array (lines 69 to 71).

The useful part of the article starts when we get to Content-Type marking... so I discard everything up to that using the mini-loop in line 72. Also, I discovered that the comics were in GIF form, but the news photos were in JPEG, so I have to save the type in line 73, and pass it along to the browser when I'm done.

The Base64 encoding of the image data apparently starts right after the next blank line. (I didn't bother looking up any standards here... so this is all by eyeball.) Line 74 gets us down there. Line 75 gets rid of the trailing line, which seems to have nothing to do with the encoded data.

Line 76 turns the base64 info into the binary data. (I call it $gif even though sometimes it's a JPEG. Oh well.)

Lines 77 and 78 dump the binary data to the browser, tagging it so that the browser knows how to interpret it. Obviously, if the browser can't handle a jpeg, we've just tossed garbage at it, but that's the way of the web.

I hope you've enjoyed this little program and walkthrough. And remember, don't reinvent the wheel! Use existing code where you can!

Listing 1

      
col01.pl

	=1=	#!/usr/bin/perl
	=2=	
	=3=	use CGI;			# must be version 2 or higher
	=4=	use News::NNTPClient;
	=5=	use MIME::Base64;
	=6=	
	=7=	$nntpserver = "news.teleport.com"; # location of news server
	=8=	
	=9=	## because of the copyright nature of this material, you should
	=10=	## put this script in a directory that has an appropriate htaccess file.
	=11=	
	=12=	@groups = (
	=13=		   ["clari.living.comics.bizarro", "Bizarro"],
	=14=		   ["clari.living.comics.cafe_angst","Cafe Angst"],
	=15=		   ["clari.living.comics.doonesbury","Doonesbury"],
	=16=		   ["clari.living.comics.forbetter","For Better or For Worse"],
	=17=		   ["clari.living.comics.foxtrot","Foxtrot"],
	=18=		   ["clari.living.comics.ozone_patrol","Ozone Patrol"],
	=19=		   ["clari.editorial.cartoons.toles","Toles"],
	=20=		   ["clari.editorial.cartoons.worldviews","Worldviews"],
	=21=		   ["clari.news.photos","News photos (not a comic, but handy)"],
	=22=		   );
	=23=	
	=24=	$Q = new CGI;
	=25=	$Qself = $Q->self_url;
	=26=	
	=27=	unless ($group = $Q->param('group')) { # nothing at all, give index
	=28=	    $links = join "\n",
	=29=	    map { "<p><a href=\"$Qself?group=$_->[0]\">$_->[1]</a>" } @groups;
	=30=	
	=31=	    print <<"GROK"; q/"/;
	=32=	@{[$Q->header]}
	=33=	@{[$Q->start_html('Comics','merlyn@stonehenge.com')]}
	=34=	<h1>Read the Comics</h1>
	=35=	<p>Select the group you want to read:
	=36=	<HR>
	=37=	$links
	=38=	<HR>
	=39=	<p>Please respect the copyrights and license agreements of this service.
	=40=	@{[$Q->end_html]}
	=41=	GROK
	=42=	q/"/;
	=43=	    exit 0;
	=44=	}
	=45=	
	=46=	unless ($article = $Q->param('article')) { # group but no art, give group
	=47=	    $N = new News::NNTPClient($nntpserver,119,0);
	=48=	    for ($N->xover($N->group($group))) {
	=49=		($numb,$subj) = split /\t/;
	=50=		$links .= "<p><a href=\"$Qself&article=$numb\">$subj</a>\n";
	=51=	    }
	=52=	    
	=53=	    print <<"GROK"; q/"/;
	=54=	@{[$Q->header]}
	=55=	@{[$Q->start_html('Comics','merlyn@stonehenge.com')]}
	=56=	<h1>Read the Comics</h1>
	=57=	<p>Select the article you wish to view:
	=58=	<HR>
	=59=	$links
	=60=	<HR>
	=61=	<p>Please respect the copyrights and license agreements of this service.
	=62=	@{[$Q->end_html]}
	=63=	GROK
	=64=	q/"/;
	=65=	    exit 0;
	=66=	}
	=67=	
	=68=	## $group and $article both valid:
	=69=	$N = new News::NNTPClient($nntpserver,119,0);
	=70=	$N->group($group);
	=71=	@art = $N->article($article);
	=72=	shift @art while @art and $art[0] !~ /^Content-Type: (image\/[-a-z]+)/;
	=73=	$type = $1;
	=74=	shift @art while @art and $art[0] !~ /^\s*$/;
	=75=	pop @art;			# heh
	=76=	$gif = decode_base64(join "", @art);
	=77=	print "Content-type: $type\n\n";
	=78=	print $gif;
	=79=	exit 0;

CPAN, NNTP, Clarinet Comics (Apr 96)

Copyright Notice

Web Techniques Column 1 (April 1996)

Listing 1

About Randal L. Schwartz