Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
![]() |
Download this listing! | ![]() |
![]() |
![]() |
Web Techniques Column 45 (Jan 2000)
[suggested title: So, really, what's new?]
Well, the big news is that we survived that big new year, so let's
take a look at what other news we can also turn up. Many sites on the
net are now publishing frequently (even hourly) updated news
information, and making summaries of that news machine parsable using
the XML Document Type Definition called RSS
format. This format
was originally developed by Netscape to be able to add news channels
to ``My Netscape'' sites, but has since become adopted as the defacto
standard for distributing headlines and brief descriptions of things
that change around the net.
Now, XML isn't hard to parse with Perl's massive text wrangling
abilities, but it's even easier to parse if we don't have to write
much code. As if just in time, my buddy Jonathan Eisenzopf hacked out
a nice module called XML::RSS
that could parse the various RSS
files, as well as create ones of your own. So, I decided to whip up a
little demonstration of this module.
Imagine a tool that goes out to various news sites once an hour or so
(driven from cron), downloads new RSS data if something has
changed, then pulls out only those headlines that you haven't seen
already, sending you mail with just those changes. This tool brings
together web mirroring technology (via the LWP
library), parsing of
RSS
files via XML::RSS
, and then sending mail via Net::SMTP
,
all found in the CPAN (at www.cpan.org, and hundreds of other places).
That's what we've got in [listing one, below].
Line 1 turns on warnings, letting me know where I've made obvious detectable mistakes. Line 2 turns on the standard compiler restrictions, disabling the use of soft references, forcing the declaration of variables, and disabling those troublesome barewords from Perl Poetry Mode. Line 3 unbuffers standard out, handy in testing, and not so awful for programs that produce minimal output (as this does).
Line 5 brings in the XML::RSS
module, found in the CPAN. If
you don't have it, and you are on a reasonably well connected machine,
you can simply use:
$ perl -MCPAN -eshell [wait for some output, answer any questions] cpan> install XML::RSS [wait for some more output, answer other questions] cpan> quit $
to install it. This module parses RSS files nicely, and also generates them (we won't be generating any RSS here, but that doesn't matter).
Line 6 similarly brings in the LWP::Simple
module from the LWP
library, also found in the CPAN. If you have to install this module,
this must be your first web program, because nearly everything cool
uses LWP at some point. If you're installing using the CPAN.pm
module,
as the above example shows, be sure to get the Bundle::LWP
, as in:
cpan> install Bundle::LWP
This ensures you get all the pieces of the LWP, even though it is now broken out into separate distributions.
Lines 8 through 38 delineate what is likely to be changed for this program from time to time. Now, as always, my programs are not meant as ``ready to run'' scripts, but rather ``proof of concept'' demonstrations, to inspire you to write your own code based on my examples. However, having said that, this is likely the only parts you'll need to change to get the program working on your system.
Line 10 defines the recipient of the email of the new headline news.
I've carefully added extra XX
characters around parts of this email
address because I'm tired of lame people just downloading my past
scripts and running them without applying configuration changes. So,
you'll get a lot of bounced email if you run this script as-is. But
this is where your email addres should go.
Similarly, line 11 defines the sender address. Mostly, this should
also be you, but if you want to have a special sender for incoming
mail filtering, you can put whatever you want here. Maybe even
president@whitehouse.gov
or something, just to impress your
friends.
Line 12 defines the subject line of the email message. Again, you can put something distinct here if you are using incoming mail filtering, or just leave it as is.
Line 13 must be the name of a reachable host running SMTP mail, since
I decided to use Net::SMTP
to deliver the mail. This is not an
arbitrary choice, since the mail host must permit you to connect and
specify your required sender and recipient, and most anti-spam mailers
are pretty picky these days.
Line 14 is the storage directory. This directory will contain two or
three files for each news source: the RSS file as gathered from the
remote site, and one or two files for a DBM database. (Berkeley DB
uses a single .db
file, while the other DBM modules use a pair of
files that end in .dir
and .pag
.)
Lines 15 through 36 define the news sources, as a simple array of
pairs of items. The first item is a URL to fetch the RSS file from,
and the second item is a simple filename that will define the names
within the $DIR
directory.
Now, where do we get these names? The most comprehensive source of
RSS
URLs I could find came from the www.xmltree.com
website, by
typing news
into the search box. Now, I've also been told that
my.netscape.com
and my.userland.com
also have lists, but they
were harder to find. Naturally, new RSS files are being brought
online all the time, so any index will be out of date, so check around
for further locations.
After scanning through the list from www.xmltree.com
, I came up
with a list of things that looked interesting at me at first glance.
I copied each of the RDF (RSS Data Format) URLs as the first item,
then made up a short distinct filename for each second item. Often,
it was just the final path component of the URL, but for some of those
that wouldn't make sense.
Now that we've told the program everything we need to know, it's time
to do the job. Line 40 ensures that we're in the right directory.
Line 41 declares the output cache: we'll append everything that we
would like to send to the end of the @output
array. At the end of
the program, if there's anything in the array, we'll open a mail
connection and send it. If it's empty, the world hasn't changed much
since we last looked, and there's no point in firing up a mailer to
tell us that, is there?
Lines 43 through 72 loop through all the news sources, as long as we
have items in the @NEWS
array. Line 44 extracts the first two
items from the array into $url
and $localname
, destroying the
array as we go along. That's OK, because we do the value once, then
exit.
Line 45 is where we start to get a memory for previous runs. Using
the dbmopen
call, we'll connect the %SAW
hash together with an
external DBM database. The keys of this hash tell us the headlines
and URLs that we've seen already on the previous run, while the values
are simply a timestamp that we can use for documentation. This
prevents us from seeing the same headline and URL twice. The DBM name
is based on the requested local file name, with an added .db
or
.dir
and .pag
.
Line 47 is the magic that keeps our queries as inexpensive as possible
for the news providers. The mirror
function takes a URL and a
local filename. If the file doesn't exist, LWP fetches the URL and
stores it into the file. Additionally, LWP sets the modification time
on the file to be the same as the ``last modified'' time as described by
the remote web server. On subsequent runs, LWP can note the
modification time in an ``if-modified-since'' header to the web server,
and the web server can return a quick ``not modified'' code if nothing
has changed since the last check. If the file cannot be found, or it
hasn't been modified since the last time we looked, those are both
failures, and the outer is_success
routine returns false. That
means we can skip any more processing on this news source.
Line 49 creates a new XML::RSS object to parse this fetched file.
Lines 50 and 51 use this object to parse the file. Because XML::RSS
invokes XML::Parser, which in turn calls XML::Parser::Expat, which can
die
on a bad parse, we wrap the call inside an eval
block, and
note the death by checking $@
. If we make it past here, we've seen
a good new file, and it's time to see if we also have new news instead
of old news.
Line 53 sets up a hash to catch new headlines. Line 54 holds the output for this particular news source. If it's empty after the scan, there's no need for the item to appear in the mail, and also no need for the title identifying the news source.
Lines 55 through 62 pick out the news items. Each item comes back as a hashref as an element of the dereferenced arrayref. Line 56 picks out the title, url link, and the description (if any). Line 57 ensures that even if there's no description, we still have a non-undef value.
Line 58 constructs a unique ``tag'' for each item as a single string which can be the key in a hash. I arbitrarily presumed that no headline or link would have a NUL byte in it, so I could use that as a clean delimiter.
The tag is the headline joined with a URL, and I presume that I will never want to see such a pair more than once. And lines 59 and 60 handle that, by storing this tag into a new hash item, and seeing if such a tag appeared on the previous run. Note that I'm using the DBM hash for that history. If we make it to line 61, we're looking at a never-seen-before headline and URL, and it's time to note that for sending in the mail.
Line 63 leaves the bread crumbs behind so we know where we've been. We'll save all the headlines we've seen down into the DBM.
Lines 64 through 71 dump out the head lines and descriptions if we've
seen headlines for this news source. The call to get the channel
title begins the output, and the individual items of @item_output
are pulled apart and formatted via the map
in line 65.
After we've walked through all the news sources, it's time to possibly
send some email, starting in line 74. If @output
has items in it,
we'll pull in the Net::SMTP
module in line 75. (Using a require
here means that we don't load the code unless it is needed.)
Lines 76 and 77 establish a connection to the designated mailserver, aborting on failure. (Hopefully, cron will let us know that something broke.) Lines 78 through 81 establish the mail from and to addresses. Lines 82 through 88 send the body of the message, including the headers, which duplicate the sender, receiver, and subject from above. Lines 89 and 90 terminate the mailer connection, and we're done!
So, to get your own individualized news, configure the configuration
parameters, make an empty directory, point to it with $DIR
, select
your news sources, and then ask cron to invoke this program every
hour or two. The first invocation will give you a full dump of all
the headlines for each source in email, but all subsequent invocations
will have minimal output (if any), giving you just the newest news.
If you want to reset a particular news source, delete the DBM file (or files) and the RSS file, which will cause the next invocation of the program to dump all available headlines.
And now that you have this technology, there's no reason to ever be uninformed again. And that's the way it is, January 1st, 2000. I'm Randal Schwartz. Good night.
Listings
=1= #!/usr/bin/perl -w =2= use strict; =3= $|++; =4= =5= use XML::RSS; =6= use LWP::Simple qw(mirror is_success); =7= =8= ## BEGIN config =9= =10= my $MAIL_TO = 'merlyn@XXstonehenge.comXX'; =11= my $MAIL_FROM = 'merlyn@XXstonehenge.comXX'; =12= my $MAIL_SUBJECT = "All the news that fits, we print!"; =13= my $MAILHOST = "localhost"; =14= my $DIR = "/home/merlyn/.rssnews"; =15= my @NEWS = =16= ( =17= ## Perl News =18= "http://www.news.perl.org/perl-news-short.rdf" => "perl-news-short.rdf", =19= ## Slashdot =20= "http://slashdot.org/slashdot.rdf" => "slashdot.rdf", =21= ## CNET computing news =22= "http://alchemy.openjava.org/rss/news-com.rdf" => "cnet-computing-news.rdf", =23= ## Freshmeat =24= "http://freshmeat.net/backend/fm.rdf" => "freshmeat.rdf", =25= ## Linux Today =26= "http://linuxtoday.com/backend/my-netscape.rdf" => "linuxtoday.rdf", =27= ## Linux Planet =28= "http://www.linuxplanet.com/rss" => "linuxplanet.rdf", =29= ## Mac Central =30= "http://www.maccentral.com/mnn.cgi" => "maccentral.rdf", =31= ## Macweek.com =32= "http://macweek.zdnet.com/macweek.xml" => "macweek.rdf", =33= ## Moreover Computer Security =34= "http://www.moreover.com/cgi-local/page?index_computersecurity+rss" =35= => "moreover-computer-security.rdf", =36= ); =37= =38= ## END config =39= =40= chdir $DIR or die "Cannot chdir $DIR: $!"; =41= my @output; =42= =43= while (@NEWS >= 2) { =44= my ($url, $localname) = splice @NEWS, 0, 2; =45= dbmopen my %SAW, $localname, 0644 or warn "Cannot open %SAW for $localname: $!"; =46= =47= next unless is_success(mirror($url, $localname)); =48= =49= my $rss = XML::RSS->new or die "can't create XML::RSS?"; =50= eval {$rss->parsefile($localname)} and not $@ =51= or (warn "cannot parse $localname: $@"), next; =52= =53= my %seen; =54= my @item_output; =55= for my $item (@{$rss->{items}}) { =56= my ($title, $link, $description) = @$item{qw(title link description)}; =57= $description = "" unless defined $description; =58= my $tag = "$title\0$link"; =59= $seen{$tag} = time; =60= next if $SAW{$tag}; =61= push @item_output, "$tag\0$description"; =62= } =63= %SAW = %seen; =64= if (@item_output) { =65= push @output, "== ".($rss->channel("title"))." ==\n", map { =66= my ($title, $link, $description) = split /\0/; =67= "$title\n", =68= " <URL:$link>\n", =69= (length $description ? " $description\n" : ()); =70= } @item_output; =71= } =72= } =73= =74= if (@output) { =75= require Net::SMTP; =76= my $m = Net::SMTP->new($MAILHOST) =77= or die "Cannot connect to mail $MAILHOST: $!"; =78= $m->mail($MAIL_FROM) =79= or die "Cannot set mail from: $!"; =80= $m->to($MAIL_TO) =81= or die "Cannot set mail to: $!"; =82= $m->data("From: $MAIL_FROM\n", =83= "To: $MAIL_TO\n", =84= "Subject: $MAIL_SUBJECT\n", =85= "\n", =86= @output, =87= ) =88= or die "Cannot send contents: $!"; =89= $m->quit =90= or die "Cannot close mail: $!"; =91= }