Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 45 (Jan 2000)

[suggested title: So, really, what's new?]

Well, the big news is that we survived that big new year, so let's take a look at what other news we can also turn up. Many sites on the net are now publishing frequently (even hourly) updated news information, and making summaries of that news machine parsable using the XML Document Type Definition called RSS format. This format was originally developed by Netscape to be able to add news channels to ``My Netscape'' sites, but has since become adopted as the defacto standard for distributing headlines and brief descriptions of things that change around the net.

Now, XML isn't hard to parse with Perl's massive text wrangling abilities, but it's even easier to parse if we don't have to write much code. As if just in time, my buddy Jonathan Eisenzopf hacked out a nice module called XML::RSS that could parse the various RSS files, as well as create ones of your own. So, I decided to whip up a little demonstration of this module.

Imagine a tool that goes out to various news sites once an hour or so (driven from cron), downloads new RSS data if something has changed, then pulls out only those headlines that you haven't seen already, sending you mail with just those changes. This tool brings together web mirroring technology (via the LWP library), parsing of RSS files via XML::RSS, and then sending mail via Net::SMTP, all found in the CPAN (at www.cpan.org, and hundreds of other places). That's what we've got in [listing one, below].

Line 1 turns on warnings, letting me know where I've made obvious detectable mistakes. Line 2 turns on the standard compiler restrictions, disabling the use of soft references, forcing the declaration of variables, and disabling those troublesome barewords from Perl Poetry Mode. Line 3 unbuffers standard out, handy in testing, and not so awful for programs that produce minimal output (as this does).

Line 5 brings in the XML::RSS module, found in the CPAN. If you don't have it, and you are on a reasonably well connected machine, you can simply use:

        $ perl -MCPAN -eshell
        [wait for some output, answer any questions]
        cpan> install XML::RSS
        [wait for some more output, answer other questions]
        cpan> quit
        $

to install it. This module parses RSS files nicely, and also generates them (we won't be generating any RSS here, but that doesn't matter).

Line 6 similarly brings in the LWP::Simple module from the LWP library, also found in the CPAN. If you have to install this module, this must be your first web program, because nearly everything cool uses LWP at some point. If you're installing using the CPAN.pm module, as the above example shows, be sure to get the Bundle::LWP, as in:

        cpan> install Bundle::LWP

This ensures you get all the pieces of the LWP, even though it is now broken out into separate distributions.

Lines 8 through 38 delineate what is likely to be changed for this program from time to time. Now, as always, my programs are not meant as ``ready to run'' scripts, but rather ``proof of concept'' demonstrations, to inspire you to write your own code based on my examples. However, having said that, this is likely the only parts you'll need to change to get the program working on your system.

Line 10 defines the recipient of the email of the new headline news. I've carefully added extra XX characters around parts of this email address because I'm tired of lame people just downloading my past scripts and running them without applying configuration changes. So, you'll get a lot of bounced email if you run this script as-is. But this is where your email addres should go.

Similarly, line 11 defines the sender address. Mostly, this should also be you, but if you want to have a special sender for incoming mail filtering, you can put whatever you want here. Maybe even president@whitehouse.gov or something, just to impress your friends.

Line 12 defines the subject line of the email message. Again, you can put something distinct here if you are using incoming mail filtering, or just leave it as is.

Line 13 must be the name of a reachable host running SMTP mail, since I decided to use Net::SMTP to deliver the mail. This is not an arbitrary choice, since the mail host must permit you to connect and specify your required sender and recipient, and most anti-spam mailers are pretty picky these days.

Line 14 is the storage directory. This directory will contain two or three files for each news source: the RSS file as gathered from the remote site, and one or two files for a DBM database. (Berkeley DB uses a single .db file, while the other DBM modules use a pair of files that end in .dir and .pag.)

Lines 15 through 36 define the news sources, as a simple array of pairs of items. The first item is a URL to fetch the RSS file from, and the second item is a simple filename that will define the names within the $DIR directory.

Now, where do we get these names? The most comprehensive source of RSS URLs I could find came from the www.xmltree.com website, by typing news into the search box. Now, I've also been told that my.netscape.com and my.userland.com also have lists, but they were harder to find. Naturally, new RSS files are being brought online all the time, so any index will be out of date, so check around for further locations.

After scanning through the list from www.xmltree.com, I came up with a list of things that looked interesting at me at first glance. I copied each of the RDF (RSS Data Format) URLs as the first item, then made up a short distinct filename for each second item. Often, it was just the final path component of the URL, but for some of those that wouldn't make sense.

Now that we've told the program everything we need to know, it's time to do the job. Line 40 ensures that we're in the right directory. Line 41 declares the output cache: we'll append everything that we would like to send to the end of the @output array. At the end of the program, if there's anything in the array, we'll open a mail connection and send it. If it's empty, the world hasn't changed much since we last looked, and there's no point in firing up a mailer to tell us that, is there?

Lines 43 through 72 loop through all the news sources, as long as we have items in the @NEWS array. Line 44 extracts the first two items from the array into $url and $localname, destroying the array as we go along. That's OK, because we do the value once, then exit.

Line 45 is where we start to get a memory for previous runs. Using the dbmopen call, we'll connect the %SAW hash together with an external DBM database. The keys of this hash tell us the headlines and URLs that we've seen already on the previous run, while the values are simply a timestamp that we can use for documentation. This prevents us from seeing the same headline and URL twice. The DBM name is based on the requested local file name, with an added .db or .dir and .pag.

Line 47 is the magic that keeps our queries as inexpensive as possible for the news providers. The mirror function takes a URL and a local filename. If the file doesn't exist, LWP fetches the URL and stores it into the file. Additionally, LWP sets the modification time on the file to be the same as the ``last modified'' time as described by the remote web server. On subsequent runs, LWP can note the modification time in an ``if-modified-since'' header to the web server, and the web server can return a quick ``not modified'' code if nothing has changed since the last check. If the file cannot be found, or it hasn't been modified since the last time we looked, those are both failures, and the outer is_success routine returns false. That means we can skip any more processing on this news source.

Line 49 creates a new XML::RSS object to parse this fetched file. Lines 50 and 51 use this object to parse the file. Because XML::RSS invokes XML::Parser, which in turn calls XML::Parser::Expat, which can die on a bad parse, we wrap the call inside an eval block, and note the death by checking $@. If we make it past here, we've seen a good new file, and it's time to see if we also have new news instead of old news.

Line 53 sets up a hash to catch new headlines. Line 54 holds the output for this particular news source. If it's empty after the scan, there's no need for the item to appear in the mail, and also no need for the title identifying the news source.

Lines 55 through 62 pick out the news items. Each item comes back as a hashref as an element of the dereferenced arrayref. Line 56 picks out the title, url link, and the description (if any). Line 57 ensures that even if there's no description, we still have a non-undef value.

Line 58 constructs a unique ``tag'' for each item as a single string which can be the key in a hash. I arbitrarily presumed that no headline or link would have a NUL byte in it, so I could use that as a clean delimiter.

The tag is the headline joined with a URL, and I presume that I will never want to see such a pair more than once. And lines 59 and 60 handle that, by storing this tag into a new hash item, and seeing if such a tag appeared on the previous run. Note that I'm using the DBM hash for that history. If we make it to line 61, we're looking at a never-seen-before headline and URL, and it's time to note that for sending in the mail.

Line 63 leaves the bread crumbs behind so we know where we've been. We'll save all the headlines we've seen down into the DBM.

Lines 64 through 71 dump out the head lines and descriptions if we've seen headlines for this news source. The call to get the channel title begins the output, and the individual items of @item_output are pulled apart and formatted via the map in line 65.

After we've walked through all the news sources, it's time to possibly send some email, starting in line 74. If @output has items in it, we'll pull in the Net::SMTP module in line 75. (Using a require here means that we don't load the code unless it is needed.)

Lines 76 and 77 establish a connection to the designated mailserver, aborting on failure. (Hopefully, cron will let us know that something broke.) Lines 78 through 81 establish the mail from and to addresses. Lines 82 through 88 send the body of the message, including the headers, which duplicate the sender, receiver, and subject from above. Lines 89 and 90 terminate the mailer connection, and we're done!

So, to get your own individualized news, configure the configuration parameters, make an empty directory, point to it with $DIR, select your news sources, and then ask cron to invoke this program every hour or two. The first invocation will give you a full dump of all the headlines for each source in email, but all subsequent invocations will have minimal output (if any), giving you just the newest news.

If you want to reset a particular news source, delete the DBM file (or files) and the RSS file, which will cause the next invocation of the program to dump all available headlines.

And now that you have this technology, there's no reason to ever be uninformed again. And that's the way it is, January 1st, 2000. I'm Randal Schwartz. Good night.

Listings

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use XML::RSS;
        =6=     use LWP::Simple qw(mirror is_success);
        =7=     
        =8=     ## BEGIN config
        =9=     
        =10=    my $MAIL_TO = 'merlyn@XXstonehenge.comXX';
        =11=    my $MAIL_FROM = 'merlyn@XXstonehenge.comXX';
        =12=    my $MAIL_SUBJECT = "All the news that fits, we print!";
        =13=    my $MAILHOST = "localhost";
        =14=    my $DIR = "/home/merlyn/.rssnews";
        =15=    my @NEWS =
        =16=      (
        =17=       ## Perl News
        =18=       "http://www.news.perl.org/perl-news-short.rdf"; => "perl-news-short.rdf",
        =19=       ## Slashdot
        =20=       "http://slashdot.org/slashdot.rdf"; => "slashdot.rdf",
        =21=       ## CNET computing news
        =22=       "http://alchemy.openjava.org/rss/news-com.rdf"; => "cnet-computing-news.rdf",
        =23=       ## Freshmeat
        =24=       "http://freshmeat.net/backend/fm.rdf"; => "freshmeat.rdf",
        =25=       ## Linux Today
        =26=       "http://linuxtoday.com/backend/my-netscape.rdf"; => "linuxtoday.rdf",
        =27=       ## Linux Planet
        =28=       "http://www.linuxplanet.com/rss"; => "linuxplanet.rdf",
        =29=       ## Mac Central
        =30=       "http://www.maccentral.com/mnn.cgi"; => "maccentral.rdf",
        =31=       ## Macweek.com
        =32=       "http://macweek.zdnet.com/macweek.xml"; => "macweek.rdf",
        =33=       ## Moreover Computer Security
        =34=       "http://www.moreover.com/cgi-local/page?index_computersecurity+rss";
        =35=       => "moreover-computer-security.rdf",
        =36=      );
        =37=    
        =38=    ## END config
        =39=    
        =40=    chdir $DIR or die "Cannot chdir $DIR: $!";
        =41=    my @output;
        =42=    
        =43=    while (@NEWS >= 2) {
        =44=      my ($url, $localname) = splice @NEWS, 0, 2;
        =45=      dbmopen my %SAW, $localname, 0644 or warn "Cannot open %SAW for $localname: $!";
        =46=    
        =47=      next unless is_success(mirror($url, $localname));
        =48=    
        =49=      my $rss = XML::RSS->new or die "can't create XML::RSS?";
        =50=      eval {$rss->parsefile($localname)} and not $@
        =51=        or (warn "cannot parse $localname: $@"), next;
        =52=    
        =53=      my %seen;
        =54=      my @item_output;
        =55=      for my $item (@{$rss->{items}}) {
        =56=        my ($title, $link, $description) = @$item{qw(title link description)};
        =57=        $description = "" unless defined $description;
        =58=        my $tag = "$title\0$link";
        =59=        $seen{$tag} = time;
        =60=        next if $SAW{$tag};
        =61=        push @item_output, "$tag\0$description";
        =62=      }
        =63=      %SAW = %seen;
        =64=      if (@item_output) {
        =65=        push @output, "== ".($rss->channel("title"))." ==\n", map {
        =66=          my ($title, $link, $description) = split /\0/;
        =67=          "$title\n",
        =68=          "  <URL:$link>\n",
        =69=          (length $description ? "  $description\n" : ());
        =70=        } @item_output;
        =71=      }
        =72=    }
        =73=    
        =74=    if (@output) {
        =75=      require Net::SMTP;
        =76=      my $m = Net::SMTP->new($MAILHOST)
        =77=        or die "Cannot connect to mail $MAILHOST: $!";
        =78=      $m->mail($MAIL_FROM)
        =79=        or die "Cannot set mail from: $!";
        =80=      $m->to($MAIL_TO)
        =81=        or die "Cannot set mail to: $!";
        =82=      $m->data("From: $MAIL_FROM\n",
        =83=               "To: $MAIL_TO\n",
        =84=               "Subject: $MAIL_SUBJECT\n",
        =85=               "\n",
        =86=               @output,
        =87=               )
        =88=        or die "Cannot send contents: $!";
        =89=      $m->quit
        =90=        or die "Cannot close mail: $!";
        =91=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.