Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 42 (Oct 1999)

[suggested title: Have you ever Meta-Index like this?]

HTML permits the inclusion of ``meta'' data in the header of the document. This metadata is not meant for direct human consumption, but is instead meant for programs to get additional data about the web page for automated indexing or other collation. Because the metadata categories are not defined by the HTML specification, the metadata is merely by convention. But in particular, two kinds of metadata are understood by most of the search engines: keywords and description.

Most of the spidering search engines will note a metadata description entry like the following, and use it when displaying a hit on your page:

        <meta name=description
          content="The Home Page of Randal L. Schwartz">

And most of the spidering search engines will also note a metadata listing one or more keywords separated by commas, as follows:

        <meta name=keywords
          content="Perl, Perl training, JAPH, Unix, Karaoke">

The intent of these keywords are to inform the search engine spider that these are the most important topics discussed on this page, even if the exact words are not present on the page. Abuse of both of these metadata types has lead to interesting levels of games between the spider-spammers and the spider searchers, so don't count on being listed solely based on your keyword list.

Other metadata is also in use; I found a nice chart at http://vancouver-webpages.com/META/ that appears to be fairly comprehensive. If the page moves, try searching for meta tags in any convenient search engine.

But suppose you've gone to the trouble to make good descriptions and keyword lists for all your major entry points. Why wait for a spider to direct people there? Can you use the information yourself, and build an index table? Sure! Of course, you could do it all by hand, but that might get out of date. So, let's write a program that spiders our own site, and generates a nice index of all the keywords. Such a program is presented in [listing one, below].

Lines 1 through 3 start nearly every program I write, turning on warnings, enabling compiler restrictions, and unbuffering standard output.

Lines 5 through 26 define the boundaries of the configurable parts of this program. While I really don't promote my column programs as ``ready to run'', I still like to isolate the parts people will likely want to tweak in a small section at the top. This program is meant to be an idea (as in ``web technique'', cute eh?), not the end product. That's for you to create.

Line 7 gives a list of all the URLs to start spidering at. For a well-connected site (where every page is linked from some other page somehow), there's generally only one URL in this list. Here, I've mangled my top-level URL slightly so that silly people won't go spidering my site when they run downloaded programs without even reading the docs (as has happened frequently in the past with my sample programs).

Lines 9 through 24 define a subroutine called OK_TO_FOLLOW. This subroutine will be passed a URI object, known to be some http link from an existing web page that was already scanned. The subroutine must return 1 if the link should also be scanned, or 0 if not. I've configured this particular subroutine for my site, knowing exactly where meta-data might be. As they say, your mileage may vary, and most certainly will, in this case.

Lines 11 to 13 keep my spider in only my site. Lines 14 through 16 disallow any CGI-query URLs to be used, because I don't have anything interesting that's reachable only from a query-string URL. Lines 17 to 22 skip over URLs that point at useless things, like CGI scripts, column text, pictures, and non-HTML files. We really want to keep out as many things as we can here to avoid searching needlessly, but we don't want to miss any useful pages either.

Note that for lines 11 to 13, I'm not really doing a loop, but rather aliasing $_ to a computed value temporarily. This is a nice weird-but-idiomatic use of foreach, spelled f-o-r here.

Line 23 confirms the ``ok to follow'' if it made it through all the little hoops earlier.

Now for the good stuff. Lines 28 through 31 bring in the modules I need. CGI::Pretty is in the standard distribution, and from the CPAN we'll need the LWP bundle (for LWP::UserAgent and HTML::Entities) and WWW::Robot.

Lines 33 to 35 declare my global variables, used to communicate between the spider and the table-generator. %description maps canonical URLs to text descriptions (from the description metadata). %keywords maps keywords (from the keywords metadata) to the URLs that contain them, always made lowercase. And %keyword_caps records the original case (upper or lower) of a keyword, or at least one instance of that keyword.

Lines 37 to 45 set up the spider. The docs for WWW::Robot go into this a great deal better than I have room for here. I'm setting up a spider that identifies itself as MetaBot for a user-agent string, version 0.15, and an email address of something like my email address. I'll get to USERAGENT in a moment. I'm also turning off the checking of MIME types, which I found did some unnecessary probes on my site, and if you want to see what's happening, uncomment the VERBOSE line.

Now, about that USERAGENT. The default USERAGENT is a LWP::RobotUA, which I found during experimentation to be buggy in its fetching and parsing of robots.txt, at least in version 1.15 of LWP::RobotUA. I'll be reporting the bug to Gisle Aas, but in the meanwhile, I don't really care. Since I'm spidering my own site, I don't need it to respect robots.txt. Of course, if you're reading someone else's site, you should be a good net neighbor and wait for LWP::RobotUA to get fixed.

Line 47 enables the scanning of the proxy environment variables if needed. Oddly enough, I don't need to do this, and you shouldn't either if you are spidering your own site. Not sure why I have this there then. Hmm.

Lines 49 to 54 define one of the two ``callbacks'' from the spider. As each URL is found (either from the initial list, or from the links on a page), the follow-url-test callback is invoked. We'll need to return a true value from this hook for every URL of interest, and false otherwise. Line 53 invokes the subroutine defined earlier to do the bulk of this test.

Lines 55 to 76 do the tough job of extracting the useful information on every web page. The invoke-on-contents callback gets invoked on each HTML page. Fortunately, we have access to the HTTP::Response object as the fourth argument, which gives us the meta-data because of the nice parsing already done for the response.

Lines 58 to 61 extract any meta-data of interest if present. Note that LWP puts the description metadata into a header-pseudo-field called X-Meta-Description. Line 62 returns quickly if there's no metadata of interest. Note that you could just as easily add site-specific meta-data fields here if you didn't want to preempt the spider-significant meta-data fields, leaving them for their original purpose but giving you precise control for your index.

Lines 63 through 67 clean up the description, and store it into the %description hash. I found a number of sites have newlines and other junk inside their returned descriptions, so we squeeze all spaces, tabs, and newlines into single spaces.

Lines 68 to 75 similarly grab the keywords. I see the spiders want comma-separated keyword lists, but a number of sites I found used space-separated keywords. So, as always, you'll probably want to adjust this portion to whatever suits your site, or whatever floats your fleet, or whatever. The result here comes out in %keywords and %keyword_caps, defined earlier.

Line 77 is where all the spidering gets done. This call doesn't return until all the pages, and all the pages they point at (recursively), get processed. So, we could be here for a while for a large site.

After the spider has traversed the portion of interest of our web, we can dump the data. To keep things simple for me, I elected to use the CGI.pm HTML shortcuts, because they stack nicely as Perl code even though they generate tons of HTML.

Line 79 defines an empty hash %seen_letter, used to generate the index-to-the-index links at the top of the index.

From line 81 to the end of the program is a giant print operation, printing an HTML table from the shortcut began in line 82, ending in line 114. I set some of the table visual parameters in the anonymous hash on line 82, tweaking these until the table looked pretty.

The first row of the table comes from the Tr shortcut in lines 87 through 89. I want to generate an index for this index that looks like:

        Jump to: A B K L P R S W Z

with each of the letters being a link to an anchor on the generated page. Of course, I want to use only the letters for which I have keywords, so I track that in the hash defined in line 84. And if that hash is empty in line 86, I won't even generate an index row. Cool. I'll let you work through the HTML shortcuts for this; see the CGI.pm docs for more details.

The guts of the table come from lines 92 through 113. For each of the keywords, we'll generate one or more rows of the table. The left column (of three columns total) is the keyword column. The middle column is the link, and the right column is the description (if any). If there are multiple links for a given keyword, we'll span the keyword down the right number of rows.

The keyword ends up in $key in line 93. The corresponding value (the URL link collection) gets set up in lines 94 to 103. For each link, we extract a description (or create a default), and then provid a two-element arrayref with the second- and third- column data in it (not yet wrapped into a td shortcut, because that comes later).

So, for a given keyword, @value (from line 94) has one or more arrayrefs, each being a separate table row for that keyword. Lines 104 to 107 figure out if this needs to be a ``come here'' target from the letter index at the top of the table, and if so, wrap the target in an A tag.

Lines 109 to 112 dump out @value and the $key_text in an appropriate way. The $_ value is a small integer for the index of each element in @value. For the first element only, we'll dump out the keyword and the right ROWSPAN attribute so that it goes across all the rows. For later elements, we skip that. And then the element itself is dumped inside a td shortcut, which makes two table cells from the arrayref. (HTML shortcuts automatically distribute themselves onto arrayrefs -- another nice feature.)

And that's it. Hairy code at the end, but I just had to keep thinking of ``what is inside what'' to come up with the nesting. map and do-blocks are definitely useful in constructing HTML shortcuts, as is judicious use of temporary variables to hold and name the parts of the calculations.

To use the program, first add appropriate descriptions and keywords to all your key pages, then run the script, putting the output into a file like /your/index/file/path. Create an HTML index page that wraps that up into an HTML index, like this:

        <html><head><title>Site index</title></head><body>
        <!--#include file="/your/index/file/path" -->
        </body></html>

Or you can just make this script output all the extra stuff directly, and output it directly to the HTML file. You can even run the script from a cronjob to keep it up to date automatically.

So, never be without a site index again. Let Perl build it for you. By the way, according to my notes, this is my 42nd column for WebTechniques, and I hope that answers all of your questions about Life, the Universe, and Everything for now. Until next time, enjoy!

Listings

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     ## config
        =6=     
        =7=     my @URL = qw(http://www.stonehenge.Xcom/);
        =8=     
        =9=     sub OK_TO_FOLLOW {
        =10=      my $uri = shift;              # URI object, known to be http only
        =11=      for ($uri->host) {
        =12=        return 0 unless /\.stonehenge\.Xcom$/i;
        =13=      }
        =14=      for ($uri->query) {
        =15=        return 0 if defined $_ and length;
        =16=      }
        =17=      for ($uri->path) {
        =18=        return 0 if /^\/(cgi|fors|-)/;
        =19=        return 0 if /col\d\d|index/;
        =20=        return 0 if /Pictures/;
        =21=        return 0 unless /(\.html?|\/)$/;
        =22=      }
        =23=      return 1;
        =24=    }
        =25=    
        =26=    ## end config
        =27=    
        =28=    use WWW::Robot;
        =29=    use LWP::UserAgent;
        =30=    use CGI::Pretty qw(-no_debug :html);
        =31=    use HTML::Entities;
        =32=    
        =33=    my %description;
        =34=    my %keywords;
        =35=    my %keyword_caps;
        =36=    
        =37=    my $robot = WWW::Robot->new
        =38=      (
        =39=       NAME => 'MetaBot',
        =40=       VERSION => '0.15',
        =41=       EMAIL => 'merlyn@stonehenge.Xcom',
        =42=       USERAGENT => LWP::UserAgent->new,
        =43=       CHECK_MIME_TYPES => 0,
        =44=       ## VERBOSE => 1,
        =45=       );
        =46=    
        =47=    $robot->env_proxy;
        =48=    
        =49=    $robot->addHook
        =50=      ("follow-url-test" => sub {
        =51=         my ($robot, $hook, $url) = @_;
        =52=         return 0 unless $url->scheme eq 'http';
        =53=         OK_TO_FOLLOW($url);
        =54=       });
        =55=    $robot->addHook
        =56=      ("invoke-on-contents" => sub {
        =57=         my ($robot, $hook, $url, $response, $structure) = @_;
        =58=         my %meta = map {
        =59=           my $header = $response->header("X-Meta-$_");
        =60=           defined $header ? ($_, $header) : ();
        =61=         } qw(Description Keywords);
        =62=         return unless %meta;
        =63=         if (exists $meta{Description}) {
        =64=           $_ = $meta{Description};
        =65=           tr/ \t\n/ /s;
        =66=           $description{$url} = $_;
        =67=         }
        =68=         if (exists $meta{Keywords}) {
        =69=           for (split /,/, $meta{Keywords}) {
        =70=             s/^\s+//;
        =71=             s/\s+$//;
        =72=             $keywords{lc $_}{$url}++;
        =73=             $keyword_caps{lc $_} = $_;
        =74=           }
        =75=         }
        =76=       });
        =77=    $robot->run(@URL);
        =78=    
        =79=    my %seen_letter;
        =80=    
        =81=    print
        =82=      table({ Cellspacing => 0, Cellpadding => 10, Border => 2 },
        =83=            do {
        =84=              my %letters;
        =85=              @letters{map /^([a-z])/, keys %keywords} = ();
        =86=              %letters ? 
        =87=                Tr(td({Colspan => 3},
        =88=                      p("Jump to:",
        =89=                        map a({Href => "#index_$_"}, uc $_), sort keys %letters)))
        =90=                  : 0;
        =91=            },
        =92=            map {
        =93=              my $key = $_;
        =94=              my @value =
        =95=                map {
        =96=                  my $url = $_;
        =97=                  my $text = exists $description{$url} ?
        =98=                    $description{$url} : "(no description provided)";
        =99=    
        =100=                 [a({Href => encode_entities($url)}, encode_entities($url)),
        =101=                  encode_entities($text),
        =102=                 ];
        =103=               } sort keys %{$keywords{$key}};
        =104=             my $key_text = $keyword_caps{$key};
        =105=             if ($key =~ /^([a-z])/ and not $seen_letter{$1}++ ) {
        =106=               $key_text = a({ Name => "index_$1" }, $key_text);
        =107=             }
        =108=   
        =109=             map {
        =110=               Tr(($_ > 0 ? () : td({Rowspan => scalar @value}, $key_text)),
        =111=                  td($value[$_]));
        =112=               } 0..$#value;
        =113=           } sort keys %keywords
        =114=          );

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.