back issues of Dilbert (Jul 97)

Web Techniques Column 15 (July 1997)

In many of the previous columns, I've talked about the use of the wonderful CGI.pm module for providing easy access to client (browser) information interaction. I've also demonstrated many uses for the very powerful LWP library, which allows programs to connect to servers as if it was a web client. But there are also applications that can take advantage of both libraries at once, to provide a different view of live data presented by another server.

For example, I could create a web page that based on a user's input performs that particular search on a half-dozen of the best web search engine, and combines the result into a summary page. Or I could write an application that queries all the online catalog stores to look for the best prices on a particular item. These programs are often called "agents". (I suppose one that performs its job without any kind of detection would be a "secret agent".)

Around the same time that I was pondering a useful-but-small program to illustrate the concept, I noticed that I had been on the road quite a bit (teaching my Perl courses all over the US), and had fallen behind on my favorite comic strip, Dilbert. Luckily, UnitedMedia (the syndicator for Dilbert) has a web site with the back strips for about a month or so.

Unfortunately, they have it set up so that I have to start from a top-level page, and then keep selecting each individual back issue from a calender, getting them one at a time. This is slightly annoying, so it dawned on me that I could simply build a CGI program that would do all the hard work for me, creating a single web page with all the GIFs at once. And what luck, because this would be an interesting use of both the CGI and LWP libraries.

And the result is in [Listing one, below].

Lines 1 and 2 begin nearly every program I write, triggering taintchecks (-T), warnings (-w) and forcing me to do reasonable things with variables, barewords, and references (use strict). Line 3 disables buffering on STDOUT, not a big deal here, but handy if this script were to have system() invocations (which are typically buffered differently).

Lines 4 through 6 pull in the three library-modules that I'll need for this program. The get() routine, found in LWP::Simple, fetches web things based on a URL. The URI::URL module brings in the url() routine, needed to make some of the parsed URLs absolute instead of relative. And the all-singing, all-dancing CGI module provides nearly everything I need to talk to the web browser throught the CGI interface.

Lines 9 through 13 provide a configuration section. In here, I define the URL of the page that will be scanned for the sub-links ($TOP), and two regular expressions that define the HTML pages with the sub-references ($HTML_RE) as well as the GIF URLs on those pages ($GIF_RE). Obviously, if UnitedMedia wants to stop me from using this script, they can certainly change this stuff, but by having it up front here, I can just change the configuration to reflect the new locations, as long as there were no major structural changes. Line 12 also defines a maximum number of days to keep if none are provided.

Lines 16 through 20 define a small routine that turns an arbitrary text string into HTML "entities", to make it safe to include onto an HTML web page. I copied this code from one of the programs I wrote for a prior column. There's also a version of this code in the LWP library, but I like mine better. Oh well. Line 17 copies the first argument into a localized $_. Line 18 mangles this value in $_, replacing all dangerous characters (double-quote, less-than, greater-than, and ampersand) with their entity equivalents. Line 19 returns this mangled value back to the caller.

Line 21 defines a utility routine that takes its arguments and passes them to a td() function, along with an anonymous hash reference with a key of align and value of center. The td() routine transforms the data into an HTML TD directive (table data), but with the text alignment set to centering. I use this quite a bit below, so it was easier to write a subroutine to set it all up at once.

Lines 23 through 28 define a little set of utiity routines along with some common static data that they share. Line 24 creates a variable $notes, visible only to these three routines, but having a persistance of the life of the program. Its initial value is the empty string, initialized at compile-time because this is in a BEGIN block. Line 25 creates an add_note() subroutine that adds information to the $notes values, while lines 26 and 27 define two subroutines that access the resulting string as both the original data and automatically HTML-entitized (via ent(), defined earlier). These subroutines are used to hang on to the error messages resulting from the inability to access the pages.

Line 30 prints the top part of the common web page response, consisting of the HTTP header, and some material at the beginning of the displayed web page. Whether the page is being used to generate the initial query form, or the resulting web page with the GIFs doesn't matter yet.

Line 31 grabs the form-field called "max". If there's no such field defined yet (such as when we first enter this CGI URL into a browser or follow it from an A link), then the "max" field, as well as the $max variable, will be undefined. Line 32 detects this, heading either into a "display the form" mode or a "display the results" mode.

Lines 33 through 56 are used in the "display the results" mode, entered only when $max was a good integer. Line 33 ensures that $max is never greater than $KEEP. If you want to be sure to grab only the last 14 days, for example, you could set $KEEP to 14.

Line 34 fetches the top-level pages, the one that hopefully has references to all the sub-pages. if the page is not successfully fetched, lines 36 and 37 note that by adding a note to that effect via add_note(). However, if the page is succeessfully fetched, we've got work to do.

Line 39 parses the results of the fetch, and creates the absolute URLs, all in one fell swoop. It's best to read this one right-to-left. Starting from a global-match in $top of $HTML_RE, we extract all the raw (relative) references. One at a time, these are dropped into the $_ of the map expression, yielding an absolute reference.

So, at this point @old_urls points at all possible prior days. Line 41 reduces the list to just $max number of entries if it's longer than that (which usually it should be). Now for the fun part, because we have to fetch each of those pages to get the GIF references on them. Good thing UnitedMedia has a very fast web server.

Lines 43 through 48 fetch each of the pages containing the GIF references. For each of these URLs, line 44 tries to get the content. If it fails, then we simply add a note about a failure, and move on. However, if we're successful, we then have to parse the page looking for the appropriate GIF reference. That's handled by lines 45 and 46.

Once the GIF URLs have all been collected (or at least as many of them as we can find this time), it's time to return the results to the user. Lines 50 to 56 print the results, as a table of nested tables. This one has to be read from the back to front as well, and involves two nested maps as well as a call to fetch the "footnotes" for the table.

Starting at line 54, we can see that @gif_urls is going to be passed through a map expression in lines 52 and 53. Within this expression, each URL will end up in $_, so we're creating a table with two rows, each of which will have a centered TD item. The first row is the name of the GIF (for later reference), and the second row is an HTML IMG directive pointing at the URL itself. The browser takes care of the actual fetching of the GIF, meaning that this script never passes along the copyrighted information in the GIF (an important consideration for the creator of this script).

Line 51 takes each of those inner tables in turn, and make them each into a row in the outer table. In earlier versions of this script, I used the "table within a table" concept to turn on borders and colors around each day's output, but eventually settled down on just the unbordered result, although a browser is free to somehow provided a tighter association between the label and its corresponding GIF this way than if I had just made the result one large flat table.

That's all there is to do when the data is actually being fetched, but if the original $max parameter is missing or out of range, we must instead pop up a form to give the user a chance to specific the date range. This is handled in lines 58 through 61. Line 58 displayes a horizontal rule and a form-start tag. Lines 59 and 60 put in a submit button and a pop-up menu allowing the user to select a day count. Note that I give a range of 1 through 45 days of back-issues, and set the default to 14. The submit button gets a label of "get this many days of back-images" as well, leaving me no need to write additional text for this application. (On browsers that don't support labled submit buttons, it'll probably be pretty obvious.)

Line 63 adds the common information for both the form-output and the result-image output.

All that's left to do is drop this script into a CGI area somewhere, such as:

      
col15.pl

	http://www.stonehenge.com/cgi/dilbert

and then invoke it. If invoked without any parameters, the form pops up, allowing me to specify a number of days, which when submitted brings up the proper, easy-to-read, scrollable page of back issues. If I want to skip the intermediate step, I can say:

      
col15.pl

	http://www.stonehenge.com/cgi/dilbert?max=14

because this "GET" URL will pass the proper "max" value into the script. Way cool. Now I can just bookmark that.

And there you have it. A CGI script that also fetches information live off the web, processes it to just the stuff we wanted, and then feeds this back. The possibilities of such scripts are endless. See ya next time.

Listing One

      
col15.pl

	=1=	#!/home/merlyn/bin/perl -Tw
	=2=	use strict;
	=3=	$|++;
	=4=	use LWP::Simple qw/get/;
	=5=	use URI::URL;
	=6=	use CGI qw/:form :html param header/;
	=7=	
	=8=	## configure
	=9=	my $TOP = "http://www.unitedmedia.com/comics/dilbert/archive/";
	=10=	my $HTML_RE = '/comics/dilbert/archive/dilbert\d+.html';
	=11=	my $GIF_RE = '/comics/dilbert/archive/images/dt\d+_\d+\.gif';
	=12=	my $KEEP = 99;
	=13=	## end configure
	=14=	
	=15=	## return $_[0] encoded for HTML entities
	=16=	sub ent {
	=17=	  local $_ = shift;
	=18=	  $_ =~ s/["<&>"]/"&#".ord($&).";"/ge;  # entity escape
	=19=	  $_;
	=20=	}
	=21=	sub td_center { td({ align => "center" }, @_); }
	=22=	
	=23=	BEGIN {
	=24=	  my $notes = "";
	=25=	  sub add_note { $notes .= join "", @_; }
	=26=	  sub get_notes { $notes; }
	=27=	  sub get_ent_notes { ent $notes; }
	=28=	}
	=29=	
	=30=	print header, start_html("Dilbert"), h1("Recent Dilberts"), "\n";
	=31=	my $max = param("max");
	=32=	if (defined $max and $max =~ /^\d+$/) {
	=33=	  $max = $KEEP if $max > $KEEP;
	=34=	  my $top = get $TOP;
	=35=	  my @gif_urls = ();
	=36=	  if (not defined $top) {
	=37=	    add_note "cannot get $TOP";
	=38=	  } else {
	=39=	    my @old_urls = map url($_,$TOP)->abs, $top =~ m!($HTML_RE)!og;
	=40=	
	=41=	    @old_urls = @old_urls[-$max..-1] if @old_urls > $max;
	=42=	
	=43=	    for my $url (@old_urls) {
	=44=	      my $content = get $url or (add_note "cannot get $url\n"), next;
	=45=	      my ($gif) = $content =~ m!($GIF_RE)!o;
	=46=	      push @gif_urls, url($gif,$url)->abs;
	=47=	    }
	=48=	  }
	=49=	
	=50=	  print table(
	=51=	              (map { TR(td_center($_)) }
	=52=	               map { table(TR(td_center(ent $_)),
	=53=	                           TR(td_center(img{-src => $_})))
	=54=	                   } @gif_urls),
	=55=	              p(get_ent_notes())
	=56=	             );
	=57=	} else {
	=58=	  print hr, start_form;
	=59=	  print p(submit("get this many days of back-images:"),
	=60=	          popup_menu("max", [1..45], "14"));
	=61=	  print end_form, hr;
	=62=	}
	=63=	print "\n", end_html;

back issues of Dilbert (Jul 97)

Copyright Notice

Web Techniques Column 15 (July 1997)

Listing One

About Randal L. Schwartz