Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
![]() |
Download this listing! | ![]() |
![]() |
![]() |
Web Techniques Column 42 (Oct 1999)
[suggested title: Have you ever Meta-Index like this?]
HTML permits the inclusion of ``meta'' data in the header of the document. This metadata is not meant for direct human consumption, but is instead meant for programs to get additional data about the web page for automated indexing or other collation. Because the metadata categories are not defined by the HTML specification, the metadata is merely by convention. But in particular, two kinds of metadata are understood by most of the search engines: keywords and description.
Most of the spidering search engines will note a metadata description entry like the following, and use it when displaying a hit on your page:
<meta name=description content="The Home Page of Randal L. Schwartz">
And most of the spidering search engines will also note a metadata listing one or more keywords separated by commas, as follows:
<meta name=keywords content="Perl, Perl training, JAPH, Unix, Karaoke">
The intent of these keywords are to inform the search engine spider that these are the most important topics discussed on this page, even if the exact words are not present on the page. Abuse of both of these metadata types has lead to interesting levels of games between the spider-spammers and the spider searchers, so don't count on being listed solely based on your keyword list.
Other metadata is also in use; I found a nice chart at
http://vancouver-webpages.com/META/
that appears to be fairly
comprehensive. If the page moves, try searching for meta tags
in
any convenient search engine.
But suppose you've gone to the trouble to make good descriptions and keyword lists for all your major entry points. Why wait for a spider to direct people there? Can you use the information yourself, and build an index table? Sure! Of course, you could do it all by hand, but that might get out of date. So, let's write a program that spiders our own site, and generates a nice index of all the keywords. Such a program is presented in [listing one, below].
Lines 1 through 3 start nearly every program I write, turning on warnings, enabling compiler restrictions, and unbuffering standard output.
Lines 5 through 26 define the boundaries of the configurable parts of this program. While I really don't promote my column programs as ``ready to run'', I still like to isolate the parts people will likely want to tweak in a small section at the top. This program is meant to be an idea (as in ``web technique'', cute eh?), not the end product. That's for you to create.
Line 7 gives a list of all the URLs to start spidering at. For a well-connected site (where every page is linked from some other page somehow), there's generally only one URL in this list. Here, I've mangled my top-level URL slightly so that silly people won't go spidering my site when they run downloaded programs without even reading the docs (as has happened frequently in the past with my sample programs).
Lines 9 through 24 define a subroutine called OK_TO_FOLLOW
. This
subroutine will be passed a URI
object, known to be some http
link from an existing web page that was already scanned. The
subroutine must return 1 if the link should also be scanned, or 0 if
not. I've configured this particular subroutine for my site, knowing
exactly where meta-data might be. As they say, your mileage may vary,
and most certainly will, in this case.
Lines 11 to 13 keep my spider in only my site. Lines 14 through 16 disallow any CGI-query URLs to be used, because I don't have anything interesting that's reachable only from a query-string URL. Lines 17 to 22 skip over URLs that point at useless things, like CGI scripts, column text, pictures, and non-HTML files. We really want to keep out as many things as we can here to avoid searching needlessly, but we don't want to miss any useful pages either.
Note that for lines 11 to 13, I'm not really doing a loop, but rather
aliasing $_ to a computed value temporarily. This is a nice
weird-but-idiomatic use of foreach
, spelled f-o-r
here.
Line 23 confirms the ``ok to follow'' if it made it through all the little hoops earlier.
Now for the good stuff. Lines 28 through 31 bring in the modules I
need. CGI::Pretty
is in the standard distribution, and from the
CPAN we'll need the LWP
bundle (for LWP::UserAgent
and
HTML::Entities
) and WWW::Robot
.
Lines 33 to 35 declare my global variables, used to communicate
between the spider and the table-generator. %description
maps
canonical URLs to text descriptions (from the description
metadata). %keywords
maps keywords (from the keywords
metadata)
to the URLs that contain them, always made lowercase. And
%keyword_caps
records the original case (upper or lower) of a
keyword, or at least one instance of that keyword.
Lines 37 to 45 set up the spider. The docs for WWW::Robot
go into
this a great deal better than I have room for here. I'm setting up a
spider that identifies itself as MetaBot
for a user-agent string,
version 0.15, and an email address of something like my email address.
I'll get to USERAGENT
in a moment. I'm also turning off the
checking of MIME types, which I found did some unnecessary probes on
my site, and if you want to see what's happening, uncomment the
VERBOSE
line.
Now, about that USERAGENT
. The default USERAGENT
is a
LWP::RobotUA
, which I found during experimentation to be buggy in
its fetching and parsing of robots.txt
, at least in version 1.15 of
LWP::RobotUA
. I'll be reporting the bug to Gisle Aas, but in the
meanwhile, I don't really care. Since I'm spidering my own site, I
don't need it to respect robots.txt
. Of course, if you're
reading someone else's site, you should be a good net neighbor and
wait for LWP::RobotUA
to get fixed.
Line 47 enables the scanning of the proxy environment variables if needed. Oddly enough, I don't need to do this, and you shouldn't either if you are spidering your own site. Not sure why I have this there then. Hmm.
Lines 49 to 54 define one of the two ``callbacks'' from the spider. As
each URL is found (either from the initial list, or from the links on
a page), the follow-url-test
callback is invoked. We'll need to
return a true value from this hook for every URL of interest, and
false otherwise. Line 53 invokes the subroutine defined earlier to do
the bulk of this test.
Lines 55 to 76 do the tough job of extracting the useful information
on every web page. The invoke-on-contents
callback gets invoked on
each HTML page. Fortunately, we have access to the HTTP::Response
object as the fourth argument, which gives us the meta-data because of
the nice parsing already done for the response.
Lines 58 to 61 extract any meta-data of interest if present. Note
that LWP
puts the description metadata into a header-pseudo-field
called X-Meta-Description
. Line 62 returns quickly if there's no
metadata of interest. Note that you could just as easily add
site-specific meta-data fields here if you didn't want to preempt the
spider-significant meta-data fields, leaving them for their original
purpose but giving you precise control for your index.
Lines 63 through 67 clean up the description, and store it into the
%description
hash. I found a number of sites have newlines and
other junk inside their returned descriptions, so we squeeze all
spaces, tabs, and newlines into single spaces.
Lines 68 to 75 similarly grab the keywords. I see the spiders want
comma-separated keyword lists, but a number of sites I found used
space-separated keywords. So, as always, you'll probably want to
adjust this portion to whatever suits your site, or whatever floats
your fleet, or whatever. The result here comes out in %keywords
and %keyword_caps
, defined earlier.
Line 77 is where all the spidering gets done. This call doesn't return until all the pages, and all the pages they point at (recursively), get processed. So, we could be here for a while for a large site.
After the spider has traversed the portion of interest of our web, we
can dump the data. To keep things simple for me, I elected to use the
CGI.pm
HTML shortcuts, because they stack nicely as Perl code even
though they generate tons of HTML.
Line 79 defines an empty hash %seen_letter
, used to generate the
index-to-the-index links at the top of the index.
From line 81 to the end of the program is a giant print
operation,
printing an HTML table from the shortcut began in line 82, ending in
line 114. I set some of the table visual parameters in the anonymous
hash on line 82, tweaking these until the table looked pretty.
The first row of the table comes from the Tr
shortcut in lines 87
through 89. I want to generate an index for this index that looks
like:
Jump to: A B K L P R S W Z
with each of the letters being a link to an anchor on the generated
page. Of course, I want to use only the letters for which I have
keywords, so I track that in the hash defined in line 84. And if that
hash is empty in line 86, I won't even generate an index row. Cool.
I'll let you work through the HTML shortcuts for this; see the
CGI.pm
docs for more details.
The guts of the table come from lines 92 through 113. For each of the keywords, we'll generate one or more rows of the table. The left column (of three columns total) is the keyword column. The middle column is the link, and the right column is the description (if any). If there are multiple links for a given keyword, we'll span the keyword down the right number of rows.
The keyword ends up in $key
in line 93. The corresponding value
(the URL link collection) gets set up in lines 94 to 103. For each
link, we extract a description (or create a default), and then provid
a two-element arrayref with the second- and third- column data in it
(not yet wrapped into a td
shortcut, because that comes later).
So, for a given keyword, @value
(from line 94) has one or more
arrayrefs, each being a separate table row for that keyword. Lines
104 to 107 figure out if this needs to be a ``come here'' target from
the letter index at the top of the table, and if so, wrap the target
in an A
tag.
Lines 109 to 112 dump out @value
and the $key_text
in an
appropriate way. The $_
value is a small integer for the index of
each element in @value
. For the first element only, we'll dump out
the keyword and the right ROWSPAN
attribute so that it goes across
all the rows. For later elements, we skip that. And then the element
itself is dumped inside a td
shortcut, which makes two table cells
from the arrayref. (HTML shortcuts automatically distribute
themselves onto arrayrefs -- another nice feature.)
And that's it. Hairy code at the end, but I just had to keep thinking
of ``what is inside what'' to come up with the nesting. map
and
do
-blocks are definitely useful in constructing HTML shortcuts, as
is judicious use of temporary variables to hold and name the parts of
the calculations.
To use the program, first add appropriate descriptions and keywords to
all your key pages, then run the script, putting the output into a
file like /your/index/file/path
. Create an HTML index page that
wraps that up into an HTML index, like this:
<html><head><title>Site index</title></head><body> <!--#include file="/your/index/file/path" --> </body></html>
Or you can just make this script output all the extra stuff directly, and output it directly to the HTML file. You can even run the script from a cronjob to keep it up to date automatically.
So, never be without a site index again. Let Perl build it for you. By the way, according to my notes, this is my 42nd column for WebTechniques, and I hope that answers all of your questions about Life, the Universe, and Everything for now. Until next time, enjoy!
Listings
=1= #!/usr/bin/perl -w =2= use strict; =3= $|++; =4= =5= ## config =6= =7= my @URL = qw(http://www.stonehenge.Xcom/); =8= =9= sub OK_TO_FOLLOW { =10= my $uri = shift; # URI object, known to be http only =11= for ($uri->host) { =12= return 0 unless /\.stonehenge\.Xcom$/i; =13= } =14= for ($uri->query) { =15= return 0 if defined $_ and length; =16= } =17= for ($uri->path) { =18= return 0 if /^\/(cgi|fors|-)/; =19= return 0 if /col\d\d|index/; =20= return 0 if /Pictures/; =21= return 0 unless /(\.html?|\/)$/; =22= } =23= return 1; =24= } =25= =26= ## end config =27= =28= use WWW::Robot; =29= use LWP::UserAgent; =30= use CGI::Pretty qw(-no_debug :html); =31= use HTML::Entities; =32= =33= my %description; =34= my %keywords; =35= my %keyword_caps; =36= =37= my $robot = WWW::Robot->new =38= ( =39= NAME => 'MetaBot', =40= VERSION => '0.15', =41= EMAIL => 'merlyn@stonehenge.Xcom', =42= USERAGENT => LWP::UserAgent->new, =43= CHECK_MIME_TYPES => 0, =44= ## VERBOSE => 1, =45= ); =46= =47= $robot->env_proxy; =48= =49= $robot->addHook =50= ("follow-url-test" => sub { =51= my ($robot, $hook, $url) = @_; =52= return 0 unless $url->scheme eq 'http'; =53= OK_TO_FOLLOW($url); =54= }); =55= $robot->addHook =56= ("invoke-on-contents" => sub { =57= my ($robot, $hook, $url, $response, $structure) = @_; =58= my %meta = map { =59= my $header = $response->header("X-Meta-$_"); =60= defined $header ? ($_, $header) : (); =61= } qw(Description Keywords); =62= return unless %meta; =63= if (exists $meta{Description}) { =64= $_ = $meta{Description}; =65= tr/ \t\n/ /s; =66= $description{$url} = $_; =67= } =68= if (exists $meta{Keywords}) { =69= for (split /,/, $meta{Keywords}) { =70= s/^\s+//; =71= s/\s+$//; =72= $keywords{lc $_}{$url}++; =73= $keyword_caps{lc $_} = $_; =74= } =75= } =76= }); =77= $robot->run(@URL); =78= =79= my %seen_letter; =80= =81= print =82= table({ Cellspacing => 0, Cellpadding => 10, Border => 2 }, =83= do { =84= my %letters; =85= @letters{map /^([a-z])/, keys %keywords} = (); =86= %letters ? =87= Tr(td({Colspan => 3}, =88= p("Jump to:", =89= map a({Href => "#index_$_"}, uc $_), sort keys %letters))) =90= : 0; =91= }, =92= map { =93= my $key = $_; =94= my @value = =95= map { =96= my $url = $_; =97= my $text = exists $description{$url} ? =98= $description{$url} : "(no description provided)"; =99= =100= [a({Href => encode_entities($url)}, encode_entities($url)), =101= encode_entities($text), =102= ]; =103= } sort keys %{$keywords{$key}}; =104= my $key_text = $keyword_caps{$key}; =105= if ($key =~ /^([a-z])/ and not $seen_letter{$1}++ ) { =106= $key_text = a({ Name => "index_$1" }, $key_text); =107= } =108= =109= map { =110= Tr(($_ > 0 ? () : td({Rowspan => scalar @value}, $key_text)), =111= td($value[$_])); =112= } 0..$#value; =113= } sort keys %keywords =114= );