Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
![]() |
Download this listing! | ![]() |
![]() |
![]() |
Linux Magazine Column 47 (Apr 2003)
Even though the web is getting to be roughly a decade old, Perl is still regarded by many as ``the darling language of web programming''. Perl's text-wrangling abilities still exceed that of any other popular open-source language, and the wealth of core and CPAN modules to deal with web protocols makes applications a snap to construct and maintain.
One category of frequent tasks is ``web scraping'': getting data from browser-facing websites. While the ``web services'' camp is slowly gaining a foothold, the ``web scraping'' tools will always be necessary to get access to information that isn't yet or never will be offered through some SOAP interface.
One emerging tool in the ``web scraping'' camp is WWW::Mechanize
by
Andy Lester, building on an earlier work by Kirrily 'Skud' Robert,
called WWW::Automate
. With this module, you get a ``virtual
browser'' that can load pages, fill out form elements by name, ``click''
on submit buttons or image maps, follow links by name or position, and
even press ``back'' when needed. Although Andy is primarily developing
the module as a means of automating web-site testing, the actions are
precisely what we need to scrape the output of existing web sites.
As an example of this interesting tool, I picked a problem that I found myself facing the other day. I frequently pop over to the Yahoo! news pages to perform searches on the news photos, looking for photos with particular keywords. As the search progresses, I pick out the pictures in which I'm interested, and then go through a series of more-or-less routine keystrokes and clicks to save those to my hard drive for later access. (Of course, I respect the copyright of these images.) For example, I'll search on the keyword ``Oregon'' to find all the images related to my home state (usually pictures of our sports teams).
So, having been introduced to WWW::Mechanize
, I thought that this
would be a perfect opportunity to reduce the amount of time I spend in
a day doing repetitive tasks. And that's what programming is really
all about: making sure that we are presented with an overwhelming
array of non-repetitive tasks, leaving the overwhelming array of
repetitive tasks for the overburdened CPU.
The strategy for the automation is straightforward. I let the ``virtual browser'' visit the ``advanced search'' page, entering the keywords, selecting photos rather than news stories, and asking for 100 images per response. (Leaving that at the default 20 will still work, but we end up making more round-trips to the web site.) Then, the virtual browser clicks the submit button, and we're off.
For each response page, we'll look for any links that are primarily an image and link to the news detail page. For each of the URLs for those links, we follow the link, then locate the full-sized image URL in the response content.
Two optimizations are performed to make this easy on both our
bandwidth and the Yahoo! news engine. First, every news link is noted
in a DBM cache, and not followed again for another 30 days. Second,
the image is downloaded using ``mirror'' logic, which means that if the
image already exists and is current enough, no data is actually
transferred. Using these optimizations, a ``no new images'' run will
come back in a second or two: fast enough to run hourly from a cron
job. An additional benefit is that the images have their timestamp
altered to match the source data, so we can quickly see what images
have been added recently according to the source, regardless of when
we actually downloaded the data.
So, let's take a look at the program, presented in [listing one,
below]. Lines 1 through 3 start nearly every program I write,
enabling warnings, turning on compiler restrictions regarding variable
declaration, barewords, and references, and also disabling the
buffering of STDOUT
.
Lines 7 through 13 hold the configuration parameters for this program.
As we are typically being invoked from a cron
job, command-line
parameters just won't do. $BASEDIR
gives the top directory in
which all the images will be saved. $SEARCHES
defines the various
searches. Each line consists of a directory name, and then one or
more keywords. For example, the line beginning with shania
defines
a subdirectory called shania
, and then selects the keywords
shania
and twain
for the search.
Line 17 defines a less-likely-changed constant of the DBM name within
the given directory to hold the database letting us know we've already
drilled down into a specific subpage. Note that it begins with a dot,
so that we won't see it in a normal ls
command.
Lines 21 and 22 pull in the WWW::Mechanize
and File::Basename
modules, the latter of which is a core module. You can install
WWW::Mechanize
from the CPAN, if you don't already have it.
Line 24 creates the WWW::Mechanize
object: our ``virtual browser''.
This object class inherits from LWP::UserAgent
, so we have full
control over such things as proxies, user-agent names, and cookies.
Line 27 begins the outer loop. For each line in the $SEARCHES
configuration string that doesn't begin with a hash (commented out),
we extract the subdirectory name and the keywords in line 28. Line 30
traces our progress for the impatient invoker.
Lines 32 and 33 establish the directory to receive the images,
creating it if necessary. Line 35 opens up our DBM database within
this directory, as the tied hash %seen
.
Lines 38 to 43 remove any stale entries in the %seen
hash. While
this doesn't affect the outcome of the algorithm, letting the stale
entries accumulate will cause unbounded growth of the DBM files. Each
key is a subpage URL we obtained from a search: the values are the
expiration time expressed as Unix internal time (seconds since the
epoch). If the value is older than the current time, we nuke the
entry.
Line 45 instructs our ``virtual browser'' to go fetch the given URL. I got this URL from pressing ``advanced search'' on the opening page. If I wanted to be clever, I could have simply gone to ``news.yahoo.com'' and followed the ``advanced search'' link, which might have been safer in the long run, in case this particular URL ever changes.
Lines 47 to 49 ``fill in'' various parts of the first form present on the page. The first form is form we want, even though there's another form later on the page. Had we wanted something other than the first form, we could have requested that before starting to fill in the fields.
The names c
and p
and n
came from staring at a ``view source''
of the page in question. This is where screen scraping takes a bit of
talent: we need to figure out exactly what gets set when a user fills
in the various form elements, including the names as given in the form
description, not necessarily as presented to the user. Field c
is
``what kind of search'', and p
is the keyword blank. Field n
is
``how many responses per page''.
Once we have the form elements updated to our requested values, we'll
``click'' on the submit button in line 50. This causes the
WWW::Mechanize
object to encode the form and submit a GET
or
POST
request as needed, noting any response from the web server.
Lines 52 to 82 process a single response page, advancing to the next page and repeating as necessary. A trace message is printed in line 53, again for the impatient observer.
Line 54 extracts the links of the page. The WWW::Mechanize
object
scans the response automatically for us, looking for A and FRAME
elements. We'll look at this array to see if there are any links to
stories in lines 56 to 59. Each link is checked to see if the URL
begins with story.news.yahoo.com
and is merely an image. If so, we
save the array indices for those links of interest.
The resulting image links are processed in the loop in lines 61 to 79.
For each image link (a small integer, indexing into the @links
array), we extract the subpage URL to follow in line 62. Lines 63 to
66 skip over the subpage URLs that we've already seen, noting them as
such. This is an important step, because if we've already visited the
subpage, we've already extracted the image from that page, and thus
there's no new information.
Line 68 has our ``virtual browser'' follow the link indicated by the numeric index. Because we've pulled the links from the same place that the virtual browser is looking, we know that the numbers are synchronized. The subpage is then visited and parsed, as reported in line 70.
Lines 71 through 76 look for the image URL, using a convention that I
``reverse engineered'' by staring at the web page HTML (obtained by
calling $m->res->content
). If an image element appeared as:
<img src=http://some.place/some/path/imagename.jpg align=middle
then it was definitely the large image for the news story. It's important to distinguish between the image of interest and any other incidental images on the page, since there are almost always other images leading to other stories on the page as well. This will break if Yahoo! changes the page layout, but that's the price of screen scraping.
If we can find an image URL in line 71, we announce it in line 72.
Line 73 uses the LWP::UserAgent
method to ``mirror'' the URL to a
local filename. The local filename is the ``basename'' of the URL
preceded by the directory path. This use of basename
was a
quick-and-dirty shortcut, valid within the Unix world. A more
portable method would have been to create a URI
object from the
path, and then extracting the final path step, as in:
my $basename = (URI->new($image_url)->path_segments)[-1];
but this seemed like overkill for me in my Unix-like environment.
The result of the mirror ends up as a HTTP::Response
object in
$response
in line 73. Line 74 shows the response text as part of
the tracing messages. Line 75 puts a ``do not visit this URL for 30
days'' flag into the DBM database. The 30 days figure comes from
knowing that Yahoo! keeps only 30 days of historical stories and
images. If that figure ever increases, I'll bump this value up as
well.
Line 78 pushes our virtual browser's ``back'' button, taking us back to
the previous page. A WWW::Mechanize
object remembers all pages
from an initial get
as a stack, so this takes us back to the query
result page.
Line 81 follows any link that has a text matching next \d
as a
regular expression. If there are more pictures, the search result
page contains exactly such a link. If such a link is found, the
method returns a true value, and we loop around again to line 52.
Otherwise, we drop out of the block, and loop again to the next search
keywords.
Now that I've set this up, and configured it, I can run it from
cron
once a day (although I probably want to make it a bit less
noisy), or from the command line when I know that I've got good
Internet connectivity. The only maintenance required will be making
up my mind about the keywords to search, or perhaps some slight
changes to the regular expressions if Yahoo! changes their page
layout. Hope you have fun scraping the web! Until next time, enjoy.
Listings
=1= #!/usr/bin/perl -w =2= use strict; =3= $|++; =4= =5= ## user configurable parts =6= =7= my $BASEDIR = "/home/merlyn/Yahoo-news-images"; =8= =9= my $SEARCHES = <<'END'; =10= oregon oregon =11= camel camel =12= shania shania twain =13= END =14= =15= ## tinker parts =16= =17= my $INDEX = ".index"; =18= =19= ## no servicable parts below =20= =21= use WWW::Mechanize 0.33; =22= use File::Basename; =23= =24= my $m = WWW::Mechanize->new; =25= $m->quiet(1); # I'll handle my own errors, thank you =26= =27= for (grep !/^\#/, split /\n/, $SEARCHES) { =28= my ($subdir, @keywords) = split; =29= =30= print "--- updating $subdir from a search for @keywords ---\n"; =31= =32= $subdir = "$BASEDIR/$subdir" unless $subdir =~ m{^/}; =33= -d $subdir or mkdir $subdir, 0755 or die "Cannot mkdir $subdir: $!"; =34= =35= dbmopen(my %seen, "$subdir/$INDEX", 0644) or die "cannot create index: $!"; =36= =37= ## clean any expired %seen tags =38= { =39= my $now = time; =40= for (keys %seen) { =41= delete $seen{$_} if $seen{$_} < $now; =42= } =43= } =44= =45= $m->get("http://search.news.yahoo.com/search/news/options?p="); =46= =47= $m->field("c", "news_photos"); =48= $m->field("p", "@keywords"); =49= $m->field("n", 100); =50= $m->click(); =51= =52= { =53= print "looking at ", $m->uri, "\n"; =54= my @links = @{$m->extract_links}; =55= =56= my @image_links = grep { =57= $links[$_][0] =~ m{^http://story\.news\.yahoo\.com/} and =58= $links[$_][1] eq "[IMG]"; =59= } 0..$#links; =60= =61= for my $image_link (@image_links) { =62= my $seen_key = "$links[$image_link][0]"; =63= if ($seen{$seen_key}) { =64= print " saw $seen_key\n"; =65= next; =66= } =67= =68= $m->follow($image_link); =69= =70= print " looking at ", $m->uri, "\n"; =71= if (my ($image_url) = $m->res->content =~ m{<img src=(http:\S+) align=middle}) { =72= print " mirroring $image_url... "; =73= my $response = $m->mirror($image_url, "$subdir/".basename($image_url)); =74= print $response->message, "\n"; =75= $seen{"$seen_key"} = time + 30 * 86400; # ignore for 30 days =76= } =77= =78= $m->back; =79= } =80= =81= redo if $m->follow(qr{next \d}); =82= } =83= =84= }