Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Download this listing!

Linux Magazine Column 22 (Mar 2001)

[suggested title: 'Headlines in the news']

One of the things I find myself spending a lot of time doing is participating in online discussion areas. Originally, all we had was Usenet, but lately, the concept of a ``web-based community'' has really taken hold. Usually, these communities provide some sort of message-based system (often with threading and separate discussion areas for topics), and perhaps an ``interactive chat'' area, either HTML or Java based.

I frequent such a web community called ``the Perl Monestary'' (online at perlmonks.org), which has quite an active community with dozens of messages posted every day, and some pretty sharp people to answer questions. A recent posting piqued my interest. A user who goes by ``jcwren'' online suggested a series of ``contests'' to get people thinking about new solutions, or just to show off. He decided to kick off the first contest himself, giving away a ``Perlmonks T-shirt'' to the winner, funded out of his pocket.

The contest was to last only a week, and we're midway through the week as I write this, so I can't tell you the winner. It won't be me, because jcwren deliberately disallowed entries from the senior participants of the Monestary (called ``saints''), of which I seem to be one. Oh well. Even though I wasn't eligible, I gave it a whack anyway, since it was a nice challenge about a problem that's becoming more and more cool about web-based solutions: the repackaging of information. I think you'll continue to see more and more ``middleware'' on the net (sites that act as brokers or meta-searchers), so I'm constantly researching this area to see what can be done to help.

The basic problem was to create a headline list for state-based headlines. CNN's interactive news ticker has this information as a far-too-flashy pop-up window, but the data file that it refreshes was easily reverse-engineered, and the URL and file format of that internal data file has apparently been stable for some months now, making the challenge possible.

So jcwren asked for a command-line program (not CGI) that fetch CNN's internal data file each time it was invoked. He expected to be doing this from cron, probably every 10 minutes or so. Any new headlines found there were to be remembered in a ``database'', with an unspecified structure, but the simpler the better, because he wanted to be able to run this easily on both Unix and Windows. As new headlines were found, they were to be timestamped on their first observation (there are no timestamps in the source data, so this is as close as we get to a freshness factor).

Now, to keep from being a history book, each headline was to be aged out of the database when it had not been seen in a specified amount of time (default one day). As long as CNN was still showing the headline, it'd stay alive for at least this much longer.

And, to make it even more fun, the headlines must be organized by state, with a clickable set of links at the top of the output. All 50 states (plus DC) needed to always be present, but only states for which news was present were to be active links, which would scroll down within the document to that state. Everything alphabetized, of course.

Further, the output is an HTML file (selectable, default index.html in the current directory) with a meta-refresh tag so that he could keep a browser window open on it.

I was curious about how long it would take me to do the program. I guessed around 90 minutes, and the first draft of the program was in fact completed in a bit under that, but I've done about a half hour of tweaking after that. And this program is present in [listing one, below], which I will now describe.

Lines 1 through 3 start nearly every program I write, enabling warnings, turning on the normal compiler restrictions for large programs, and disabling the normal buffering of standard output.

Lines 7 through 11 define the configurable constants used by this program that don't make sense to overridden from the command line. The $CNN_URL is the source of our information. This program depends on the URL providing consistent data, so if it moves or changes format, too bad. The $CNN_CACHE file is a local mirror of that remote URL. And $DB_MEMORY holds our ``database'' in whatever format dbmopen selects (most often Berkeley DB).

Line 13 pulls in the CGI module. No, this isn't a CGI program, but we are generating HTML so I'm using the HTML generation shortcuts, and it just so happens that the CNN input format is nearly identical with the format of a uploaded form data, which I recognized quickly to leverage off existing code. The CGI module as of this writing doesn't include the HTML 4.01 standard col, thead, and tbody generation methods, so I simply added them, since I needed them.

Line 14 pulls in the mirror routine from LWP::Simple (part of the CPAN-installable LWP suite). And line 15 brings in GetOptions from the standard Getopt::Long module.

Next comes the parsing of the command line arguments, in lines 17 to 23. Four variables are declared with initial values, and GetOptions alters those values if the right command-line args are present. See the documentation for GetOptions for details, but hopefully this is readable as-is.

And then we need a list of states beginning in line 25, which I got quickly from visiting the Yahoo State local information page, cutting and pasting it from there into my program, fixing the whitespace, then added commas between some of the states to make a nice chunk of names. Note the split which breaks the items on either the embedded comma or the ending newline of each line.

Lines 35 to 40 get the ``current'' information. Because we are maintaining a cache, we can use mirror, which minimizes the transfer cost. The request made to the server includes an ``if modified since'' header: if the information has not changed since this time, the server can return a quick ``304 error'' to say ``hey, you've got it already''. But if new information arrives, the timestamp on the file is set to the ``last modified'' header (if present), so that the next request has the right ``if modified since'' header to repeat the process. Slick. Normal expected returns are status 200 (we've got a new file) and 304 (we already have the data). Anything else is broken, so we abort quickly.

Line 42 opens the database. This is a simple ``hash on disk'' database, so we use dbmopen to let it pick the type and naming for us. (This database is cleared in line 48 if the right command-line parameter is present.)

Next, we'll set up the input and output streams (in lines 43 and 44). The input is the file fetched from CNN. The output is the HTML file, except that we don't want to overwrite the real file just yet, so we'll append a tilde to the filename (my editor's backup file convention, so I have scripts to clean those up). If all goes well, after finishing writing the file, I'll rename this temporary file over the top of the real data in one fell swoop, so that the browser will never see a partial content. Important strategy for uncooperating processes.

Line 46 processes STDIN using the CGI module's ability to parse a form. Assigning the output to $CGI::Q means that we get to use param and friends without having to use the ugly and nearly always unnecessary object-oriented form of invocation.

We then pass over the data three times. The first pass (beginning in line 50) looks for all ``parameters'' from the input data with the form headlinen, where n begins at 1 and goes up (to about 100 at most from the data I saw while I was testing). The headline is stuffed into $headline, then the corresponding state jumps into $state.

From that, we construct a hopefully unique key of the state, a newline, and the headline. The corresponding value in the database is two integers separated by a space, both timestamps in Unix internal time format. The first number is when we first saw the headline (for display purposes), and the second number is the most recent time we've seen it (for aging purposes). So, if the key already exists, we update the second number to now, but if it doesn't, we create a new entry with both numbers set to now.

On the second pass (beginning in line 62), we age out old data by looking at the second numbers of all the entries, deleting those who no longer qualify as fresh-enough news.

And finally, it's time to dump the data, starting in line 68. For each of the keys (line 75), we pull out the state, headline, and first-seen timestamps as a three-element arrayref, which is then sorted by state, timestamp, and headline order.

Line 77 introduces %states_seen which will be use to track the first appearance of each state in the sorted list, and figure out for what states to generate links at the top of the table.

And then comes the fun part: transforming the data into a table. For each element of the @data array (line 89), we break it apart into the three fields (line 81), then create a table row (line 82) consisting of three cells (lines 83, 86, and 87). The first cell is either the state name (fixed so that it can't wrap), or on first appearance, the statename with an internal anchor. The second cell is an abbreviated portion of localtime of the timestamp when the headline was first seen, and the final cell is the headline itself. Note the careful attention to properly encode this data as HTML entities if needed.

The next step is to generate the top of the HTML file (on STDOUT), with the right header, title, and meta-refresh information, handled in lines 91 to 93.

Then it's time to generate the table. The cellspacing and cellpadding are personal choices (in line 95). The next three lines give hints to standards-compliant browsers (unlike Netscape or IE) about the width and alignment of the three columns. Then comes the ``table header'', consisting of the one row, one cell (spanning three columns) of all the states. If a state was seen, a link to the proper internal anchor is generated; otherwise, a simple name is used. Again, the state names are guaranteed not to wrap. And finally, the table guts are dumped inside the ``table body'' tag.

Lines 108 to 110 finish out the HTML page. Once this is complete, we can rename the temporary output name to its final destination, in line 112.

The two subroutines starting in line 115 handle some of the needed transformations. escapeHTMLbreak calls the CGI-module-provided escapeHTML routine to fix all the HTML entities, but also changes all remaining spaces to non-breaking spaces. And fixname crunches a string so that it's a legal unique anchor name (for the expected dataset).

And that's it. Stick it into some filepath (not in a location for your webserver's CGI, and not necessarily in your PATH), and then run it frequently, and you too will have the latest headlines from CNN. Hopefully, you can see a few new gizmos and gadgets to steal for your own code. Enjoy!

Listings

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     ## begin config
        =6=     
        =7=     my $CNN_URL = "http://headlinenews.cnn.com/QUICKNEWS/virtual/swf.headline.txt";;
        =8=     my $CNN_CACHE = "contest.cnn-cache"; # flat file
        =9=     my $DB_MEMORY = "contest.memory"; # dbmopen
        =10=    
        =11=    ## end config
        =12=    
        =13=    use CGI qw(:all -no_debug col thead tbody);
        =14=    use LWP::Simple qw(mirror);
        =15=    use Getopt::Long;
        =16=    
        =17=    GetOptions(
        =18=               "refresh=i" => \ (my $REFRESH = 10), # meta refresh time in minutes
        =19=               "output=s" => \ (my $OUTPUT = "index.html"), # output file
        =20=               "expire=i" => \ (my $EXPIRE = 1440), # expire time in minutes
        =21=               "clear!" => \ (my $CLEAR = 0), # clear the cache
        =22=               "<>" => sub { $Getopt::Long::error++; warn "Unknown arg: $_[0]\n" },
        =23=    ) or die "see code for usage\n";
        =24=    
        =25=    my @STATES = split /, |\n/, <<'end';
        =26=    ALABAMA, ALASKA, ARIZONA, ARKANSAS, CALIFORNIA, COLORADO, CONNECTICUT, D.C.
        =27=    DELAWARE, FLORIDA, GEORGIA, HAWAII, IDAHO, ILLINOIS, INDIANA, IOWA, KANSAS
        =28=    KENTUCKY, LOUISIANA, MAINE, MARYLAND, MASSACHUSETTS, MICHIGAN, MINNESOTA
        =29=    MISSISSIPPI, MISSOURI, MONTANA, NEBRASKA, NEVADA, NEW HAMPSHIRE, NEW JERSEY
        =30=    NEW MEXICO, NEW YORK, NORTH CAROLINA, NORTH DAKOTA, OHIO, OKLAHOMA, OREGON
        =31=    PENNSYLVANIA, RHODE ISLAND, SOUTH CAROLINA, SOUTH DAKOTA, TENNESSEE, TEXAS
        =32=    UTAH, VERMONT, VIRGINIA, WASHINGTON, WEST VIRGINIA, WISCONSIN, WYOMING
        =33=    end
        =34=    
        =35=    {
        =36=      my $s = mirror($CNN_URL, $CNN_CACHE);
        =37=      last if $s == 200;            # we got new data
        =38=      last if $s == 304;            # no new data, but we have to expire things
        =39=      die "status is $s, aborting\n";
        =40=    }
        =41=    
        =42=    dbmopen(my %DB, $DB_MEMORY, 0644) or die "Cannot dbmopen $DB_MEMORY: $!";
        =43=    open STDIN, $CNN_CACHE or die "Cannot open $CNN_CACHE: $!";
        =44=    open STDOUT, ">$OUTPUT~" or die "Cannot create $OUTPUT~: $!";
        =45=    
        =46=    $CGI::Q = CGI->new(\*STDIN) or die "Cannot parse $CNN_CACHE\n";
        =47=    
        =48=    %DB = () if $CLEAR;             # bye bye all that we know
        =49=    
        =50=    ## first pass: add the new headlines
        =51=    
        =52=    for (my $i = 1; my $headline = param("headline$i"); $i++) {
        =53=      my $state = param("state$i");
        =54=      my $key = "$state\n$headline";
        =55=      if (defined $DB{$key}) {      # just update modtime
        =56=        $DB{$key} =~ s/\s\d+/" " . time/e;
        =57=      } else {                      # add the entry
        =58=        $DB{$key} = time . " " . time;
        =59=      }
        =60=    }
        =61=    
        =62=    ## second pass: expire the old headlines
        =63=    
        =64=    for my $key (keys %DB) {
        =65=      delete $DB{$key} if $DB{$key} =~ /\s(\d+)/ and $1 < time - $EXPIRE * 60;
        =66=    }
        =67=    
        =68=    ## final pass: generate the report
        =69=    
        =70=    my @data =
        =71=      sort {
        =72=        $a->[0] cmp $b->[0] or $a->[2] <=> $b->[2] or $a->[1] cmp $b->[1]
        =73=      } map {
        =74=        [ (split /\n/), (split /\s+/, $DB{$_})[0] ]
        =75=      } keys %DB;
        =76=    
        =77=    my %states_seen;
        =78=    
        =79=    my @table_guts =
        =80=      map {
        =81=        my ($state, $headline, $stamp) = @$_;
        =82=        Tr(
        =83=           td($states_seen{$state}++ ?
        =84=              escapeHTMLnobreak($state) :
        =85=              a({-name => fixname($state)}, escapeHTMLnobreak($state))),
        =86=           td(escapeHTMLnobreak(((localtime $stamp) =~ /(.*)\s/)[0])),
        =87=           td(escapeHTML($headline)),
        =88=          )."\n";
        =89=      } @data;
        =90=    
        =91=    print start_html(-title => "CNN Headline News",
        =92=                     -head => meta({-http_equiv => 'refresh',
        =93=                                    -content => $REFRESH * 60}));
        =94=    
        =95=    print table({-border => 1, -cellspacing => 0, -cellpadding => 4},
        =96=                col({-width => "0*", -align => 'right'}), # state
        =97=                col({-width => "0*"}), # date
        =98=                col({-width => "*"}), # item
        =99=                thead(Tr(th({-colspan => 3, -align => 'center'},
        =100=                           join " | ",
        =101=                           map { $states_seen{$_} ?
        =102=                                   a({-href => fixname("#$_")},
        =103=                                     escapeHTMLnobreak($_)) :
        =104=                                       escapeHTMLnobreak($_);
        =105=                               } @STATES)."\n")),
        =106=               tbody(@table_guts));
        =107=   
        =108=   print end_html;
        =109=   
        =110=   close STDOUT;
        =111=   
        =112=   rename "$OUTPUT~", $OUTPUT or die "Cannot rename $OUTPUT~ to $OUTPUT: $!";
        =113=   exit 0;
        =114=   
        =115=   sub escapeHTMLnobreak {
        =116=     local $_ = escapeHTML("@_");
        =117=     s/ /&nbsp;/g;
        =118=     $_;
        =119=   }
        =120=   
        =121=   sub fixname {
        =122=     local $_ = shift;
        =123=     tr/a-zA-Z\#/_/cs;
        =124=     $_;
        =125=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Linux Magazine Column 22 (Mar 2001)

Listings