Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 47 (Apr 2003)

Even though the web is getting to be roughly a decade old, Perl is still regarded by many as ``the darling language of web programming''. Perl's text-wrangling abilities still exceed that of any other popular open-source language, and the wealth of core and CPAN modules to deal with web protocols makes applications a snap to construct and maintain.

One category of frequent tasks is ``web scraping'': getting data from browser-facing websites. While the ``web services'' camp is slowly gaining a foothold, the ``web scraping'' tools will always be necessary to get access to information that isn't yet or never will be offered through some SOAP interface.

One emerging tool in the ``web scraping'' camp is WWW::Mechanize by Andy Lester, building on an earlier work by Kirrily 'Skud' Robert, called WWW::Automate. With this module, you get a ``virtual browser'' that can load pages, fill out form elements by name, ``click'' on submit buttons or image maps, follow links by name or position, and even press ``back'' when needed. Although Andy is primarily developing the module as a means of automating web-site testing, the actions are precisely what we need to scrape the output of existing web sites.

As an example of this interesting tool, I picked a problem that I found myself facing the other day. I frequently pop over to the Yahoo! news pages to perform searches on the news photos, looking for photos with particular keywords. As the search progresses, I pick out the pictures in which I'm interested, and then go through a series of more-or-less routine keystrokes and clicks to save those to my hard drive for later access. (Of course, I respect the copyright of these images.) For example, I'll search on the keyword ``Oregon'' to find all the images related to my home state (usually pictures of our sports teams).

So, having been introduced to WWW::Mechanize, I thought that this would be a perfect opportunity to reduce the amount of time I spend in a day doing repetitive tasks. And that's what programming is really all about: making sure that we are presented with an overwhelming array of non-repetitive tasks, leaving the overwhelming array of repetitive tasks for the overburdened CPU.

The strategy for the automation is straightforward. I let the ``virtual browser'' visit the ``advanced search'' page, entering the keywords, selecting photos rather than news stories, and asking for 100 images per response. (Leaving that at the default 20 will still work, but we end up making more round-trips to the web site.) Then, the virtual browser clicks the submit button, and we're off.

For each response page, we'll look for any links that are primarily an image and link to the news detail page. For each of the URLs for those links, we follow the link, then locate the full-sized image URL in the response content.

Two optimizations are performed to make this easy on both our bandwidth and the Yahoo! news engine. First, every news link is noted in a DBM cache, and not followed again for another 30 days. Second, the image is downloaded using ``mirror'' logic, which means that if the image already exists and is current enough, no data is actually transferred. Using these optimizations, a ``no new images'' run will come back in a second or two: fast enough to run hourly from a cron job. An additional benefit is that the images have their timestamp altered to match the source data, so we can quickly see what images have been added recently according to the source, regardless of when we actually downloaded the data.

So, let's take a look at the program, presented in [listing one, below]. Lines 1 through 3 start nearly every program I write, enabling warnings, turning on compiler restrictions regarding variable declaration, barewords, and references, and also disabling the buffering of STDOUT.

Lines 7 through 13 hold the configuration parameters for this program. As we are typically being invoked from a cron job, command-line parameters just won't do. $BASEDIR gives the top directory in which all the images will be saved. $SEARCHES defines the various searches. Each line consists of a directory name, and then one or more keywords. For example, the line beginning with shania defines a subdirectory called shania, and then selects the keywords shania and twain for the search.

Line 17 defines a less-likely-changed constant of the DBM name within the given directory to hold the database letting us know we've already drilled down into a specific subpage. Note that it begins with a dot, so that we won't see it in a normal ls command.

Lines 21 and 22 pull in the WWW::Mechanize and File::Basename modules, the latter of which is a core module. You can install WWW::Mechanize from the CPAN, if you don't already have it.

Line 24 creates the WWW::Mechanize object: our ``virtual browser''. This object class inherits from LWP::UserAgent, so we have full control over such things as proxies, user-agent names, and cookies.

Line 27 begins the outer loop. For each line in the $SEARCHES configuration string that doesn't begin with a hash (commented out), we extract the subdirectory name and the keywords in line 28. Line 30 traces our progress for the impatient invoker.

Lines 32 and 33 establish the directory to receive the images, creating it if necessary. Line 35 opens up our DBM database within this directory, as the tied hash %seen.

Lines 38 to 43 remove any stale entries in the %seen hash. While this doesn't affect the outcome of the algorithm, letting the stale entries accumulate will cause unbounded growth of the DBM files. Each key is a subpage URL we obtained from a search: the values are the expiration time expressed as Unix internal time (seconds since the epoch). If the value is older than the current time, we nuke the entry.

Line 45 instructs our ``virtual browser'' to go fetch the given URL. I got this URL from pressing ``advanced search'' on the opening page. If I wanted to be clever, I could have simply gone to ``news.yahoo.com'' and followed the ``advanced search'' link, which might have been safer in the long run, in case this particular URL ever changes.

Lines 47 to 49 ``fill in'' various parts of the first form present on the page. The first form is form we want, even though there's another form later on the page. Had we wanted something other than the first form, we could have requested that before starting to fill in the fields.

The names c and p and n came from staring at a ``view source'' of the page in question. This is where screen scraping takes a bit of talent: we need to figure out exactly what gets set when a user fills in the various form elements, including the names as given in the form description, not necessarily as presented to the user. Field c is ``what kind of search'', and p is the keyword blank. Field n is ``how many responses per page''.

Once we have the form elements updated to our requested values, we'll ``click'' on the submit button in line 50. This causes the WWW::Mechanize object to encode the form and submit a GET or POST request as needed, noting any response from the web server.

Lines 52 to 82 process a single response page, advancing to the next page and repeating as necessary. A trace message is printed in line 53, again for the impatient observer.

Line 54 extracts the links of the page. The WWW::Mechanize object scans the response automatically for us, looking for A and FRAME elements. We'll look at this array to see if there are any links to stories in lines 56 to 59. Each link is checked to see if the URL begins with story.news.yahoo.com and is merely an image. If so, we save the array indices for those links of interest.

The resulting image links are processed in the loop in lines 61 to 79. For each image link (a small integer, indexing into the @links array), we extract the subpage URL to follow in line 62. Lines 63 to 66 skip over the subpage URLs that we've already seen, noting them as such. This is an important step, because if we've already visited the subpage, we've already extracted the image from that page, and thus there's no new information.

Line 68 has our ``virtual browser'' follow the link indicated by the numeric index. Because we've pulled the links from the same place that the virtual browser is looking, we know that the numbers are synchronized. The subpage is then visited and parsed, as reported in line 70.

Lines 71 through 76 look for the image URL, using a convention that I ``reverse engineered'' by staring at the web page HTML (obtained by calling $m->res->content). If an image element appeared as:

  <img src=http://some.place/some/path/imagename.jpg align=middle

then it was definitely the large image for the news story. It's important to distinguish between the image of interest and any other incidental images on the page, since there are almost always other images leading to other stories on the page as well. This will break if Yahoo! changes the page layout, but that's the price of screen scraping.

If we can find an image URL in line 71, we announce it in line 72. Line 73 uses the LWP::UserAgent method to ``mirror'' the URL to a local filename. The local filename is the ``basename'' of the URL preceded by the directory path. This use of basename was a quick-and-dirty shortcut, valid within the Unix world. A more portable method would have been to create a URI object from the path, and then extracting the final path step, as in:

  my $basename = (URI->new($image_url)->path_segments)[-1];

but this seemed like overkill for me in my Unix-like environment.

The result of the mirror ends up as a HTTP::Response object in $response in line 73. Line 74 shows the response text as part of the tracing messages. Line 75 puts a ``do not visit this URL for 30 days'' flag into the DBM database. The 30 days figure comes from knowing that Yahoo! keeps only 30 days of historical stories and images. If that figure ever increases, I'll bump this value up as well.

Line 78 pushes our virtual browser's ``back'' button, taking us back to the previous page. A WWW::Mechanize object remembers all pages from an initial get as a stack, so this takes us back to the query result page.

Line 81 follows any link that has a text matching next \d as a regular expression. If there are more pictures, the search result page contains exactly such a link. If such a link is found, the method returns a true value, and we loop around again to line 52. Otherwise, we drop out of the block, and loop again to the next search keywords.

Now that I've set this up, and configured it, I can run it from cron once a day (although I probably want to make it a bit less noisy), or from the command line when I know that I've got good Internet connectivity. The only maintenance required will be making up my mind about the keywords to search, or perhaps some slight changes to the regular expressions if Yahoo! changes their page layout. Hope you have fun scraping the web! Until next time, enjoy.

Listings

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     ## user configurable parts
        =6=     
        =7=     my $BASEDIR = "/home/merlyn/Yahoo-news-images";
        =8=     
        =9=     my $SEARCHES = <<'END';
        =10=    oregon oregon
        =11=    camel camel
        =12=    shania shania twain 
        =13=    END
        =14=    
        =15=    ## tinker parts
        =16=    
        =17=    my $INDEX = ".index";
        =18=    
        =19=    ## no servicable parts below
        =20=    
        =21=    use WWW::Mechanize 0.33;
        =22=    use File::Basename;
        =23=    
        =24=    my $m = WWW::Mechanize->new;
        =25=    $m->quiet(1);                   # I'll handle my own errors, thank you
        =26=    
        =27=    for (grep !/^\#/, split /\n/, $SEARCHES) {
        =28=      my ($subdir, @keywords) = split;
        =29=    
        =30=      print "--- updating $subdir from a search for @keywords ---\n";
        =31=    
        =32=      $subdir = "$BASEDIR/$subdir" unless $subdir =~ m{^/};
        =33=      -d $subdir or mkdir $subdir, 0755 or die "Cannot mkdir $subdir: $!";
        =34=    
        =35=      dbmopen(my %seen, "$subdir/$INDEX", 0644) or die "cannot create index: $!";
        =36=    
        =37=      ## clean any expired %seen tags
        =38=      {
        =39=        my $now = time;
        =40=        for (keys %seen) {
        =41=          delete $seen{$_} if $seen{$_} < $now;
        =42=        }
        =43=      }
        =44=    
        =45=      $m->get("http://search.news.yahoo.com/search/news/options?p=";);
        =46=    
        =47=      $m->field("c", "news_photos");
        =48=      $m->field("p", "@keywords");
        =49=      $m->field("n", 100);
        =50=      $m->click();
        =51=    
        =52=      {
        =53=        print "looking at ", $m->uri, "\n";
        =54=        my @links = @{$m->extract_links};
        =55=    
        =56=        my @image_links = grep {
        =57=          $links[$_][0] =~ m{^http://story\.news\.yahoo\.com/} and
        =58=            $links[$_][1] eq "[IMG]";
        =59=        } 0..$#links;
        =60=    
        =61=        for my $image_link (@image_links) {
        =62=          my $seen_key = "$links[$image_link][0]";
        =63=          if ($seen{$seen_key}) {
        =64=            print "  saw $seen_key\n";
        =65=            next;
        =66=          }
        =67=    
        =68=          $m->follow($image_link);
        =69=    
        =70=          print "  looking at ", $m->uri, "\n";
        =71=          if (my ($image_url) = $m->res->content =~ m{<img src=(http:\S+) align=middle}) {
        =72=            print "  mirroring $image_url... ";
        =73=            my $response = $m->mirror($image_url, "$subdir/".basename($image_url));
        =74=            print $response->message, "\n";
        =75=            $seen{"$seen_key"} = time + 30 * 86400; # ignore for 30 days
        =76=          }
        =77=    
        =78=          $m->back;
        =79=        }
        =80=    
        =81=        redo if $m->follow(qr{next \d});
        =82=      }
        =83=    
        =84=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.