Search This Site (Dec 99)

Web Techniques Column 44 (Dec 1999)

[suggested title: Search This Site]

Back in this column in April 1997, I provided a simple script that searched the text of the programs I've written for this column over the years. Recently, I've been hacking my overall web site design, and thought it would be cool to be able to search my entire site. The program of the April column could do the trick, but only if I never planned on getting anything else done with my web server box again, because it would be expensive to search everything.

But I thought to myself, hey, the big search engines have already come to my site, fetched all the pages I want to have searched, and indexed them for me. Furthermore, they have more spare CPU cycles than me, and it'd be nice to just take advantage of that.

And then I remembered that many of the search engines provide a way to insist that the returned values have a specific URL or site value. I could use this to my advantage to create a wrapper that uses the big search engine to return hits only on my site!

The upside of this approach is that I leverage off of existing work, and someone else's disk and CPU. The downside is that the spiders don't visit very often, so new material is likely to be missed in such an index. But for mostly static or old pages, the tradeoff is often interesting.

Of course, Perl can pass the proper values into the search engine's form-response CGI programs, but the answer comes back as HTML. It looks like a mess to figure out what part of that HTML is a link to some hit, and what part is simply a link to an ad or something.

Luckily, we don't have to figure that out, because the continually maintained WWW::Search package in the CPAN lets us access the output from these engines in a sane way, and all I have to do is interface to that code. My first attempt resulted in the program in [listing 1 below].

Line 1 enables warnings and taint checking. I like taint checking on CGI scripts, because a CGI program is essentially acting on someone else's behalf using the (hopefully limited) privileges of the web server. Perl normally enables taint checking automatically on setuid programs, but we need to let Perl know that we want taint checking explicitly.

Line 2 turns on the compiler restrictions, requiring me to declare my variables, disabling the use of soft references, and preventing me from accidentally using a string where I meant a subroutine invocation. I use this on any program that is more than 10 lines long that I use for more than 10 minutes (what I call my "10 - 10" rule).

Line 3 disables output buffering. I was toying for a while about making this program an NPH program that first shoved a "working..." page to the browser (using server push), and then returned the real page later. For that, unbuffering is essential. Here, it's just a line I type frequently without thinking.

Line 5 pulls in Lincoln Stein's wonderful CGI.pm module, including all the shortcuts for generating HTML and handling forms.

Lines 7 through 12 define the configuration section, hopefully with all the things one would want to change to move this to a different site. Line 9 gives the domain name for which we will ask the search engines about, and line 10 defines the number of hits of interest. If you leave the settings as they are in the listing, you'll be searching live information about my site. The bigger the number of hits, the more time the connection will be tied up, possibly resulting in a timeout, so keep it appropriate.

Lines 14 through 22 define the search engines that conform with my needs (ones that can have some site-narrowing in the query string). For each of the elements of %ENGINES, the key gives the WWW::Search search engine name, and the value is a coderef to transform the search data into a query string. Note that AltaVista, HotBot, and Infoseek are the easiest: an additional restriction to the user's requested query is enough. NorthernLight was a little more odd, requiring some extra syntax to make it a full boolean query. (I also noticed that WWW::Search hasn't stayed in sync with NorthernLight's output, and it sends out an erroneous link. Hmm.)

Lines 24 through 27 create the top of the CGI response, including a nice CGI header (roughly the same as an HTTP header) and a title of "Search this site" and a similar H1.

Lines 29 through 42 create the search form, regardless of whether we're searching this time or not. Thanks to CGI.pm's sticky fields, the default values in this form will be the same as the query being acted upon, if any, allowing slightly modified queries or perhaps even the same query from different engines (something I was doing frequently while testing this program).

Lines 30 and 42 put horizontal rules around the form, one of the things I do conventionally to visually delimit a set of related input features. Lines 31 and 41 generate the HTML for the start and end of the form. I force the method to be GET rather than POST (the default) so that I can bookmark the resulting query. CGI.pm doesn't care if it's a GET or POST, but bookmarking does.

Lines 32 to 40 generate a layout table to get everything to line up nicely. The table has one row with four parts:

A submit button with a label of "Search stonehenge.com for"

a text input field, with a name of search_for.

the word "using" (just to fill out the sentence properly), and

a radio-button group with selections for each of the search engines.

The radio-button group is laid out vertically (using tables once again), thanks to the -columns parameter to radio_group.

When the submit button is pressed, or when return is typed in the text field (for most browsers), our script will be reinvoked with the search_for and engine parameters. Line 44 detects this, and invokes the actual search. By putting this code into a subroutine, I can clearly see what gets done every time, and what gets done only when parameters are present.

Lines 46 and 47 finish up the CGI output, ending the HTML and exiting the program with a good status.

Lines 51 to 82 handle the hard work of calling the search engine with valid parameters, and displaying the search results. Hard only in the sense that we have to get stuff validated and then interpret the results from the nifty WWW::Search family of modules, but I'm getting ahead of myself.

Lines 53 to 64 validate the form values. If anything goes awry, we exit the subroutine immediately. Line 53 gets the search string, simply fetching the value.

Lines 55 to 61 extract the search engine. If the engine is present, we ensure that's it's a Perl symbol, and extract that symbol. This is needed because WWW::Search uses this engine string in a way that trips up taint checking if left tainted. If the engine is absent, we'll pretend they use AltaVista all the time. That lets me sprinkle the rest of key pages with something like:

      
col44.pl

	<form action="/cgi/sitesearch">
	<submit name="Search this site for:">
	<input type=text name="search_for">
	</form>

and let it default to AltaVista properly.

Lines 63 and 64 validate the engine name one more time, ensuring that it is a key in the %ENGINES hash. The value of that element is a coderef, which we now invoke to turn the user's query into a query for the selected engine as $engine_search_for.

If we make it past all those treacherous return operations, it's time for the actual engine interaction. The require in line 66 brings in the WWW::Search module. Note that this module is not compiled if we never make to here, so we'll be saving compile time on those invocations that are merely putting up the search form and not getting the results.

Line 68 creates the search object, passing the engine name to the new method of WWW::Search. This also compiles the appropriate code for that search engine.

Line 69 sets the number of items in which we're interested. The default is a fairly large number -- not something I want to wait for while it's being fetched.

Line 70 establishes the query. We pass the search string through an escape_query method for reasons that are not quite clear to me from reading the documentation. But once that is done, it's handed to the search engine interfacer, and we're off and running.

Lines 72 through 81 dump out a table of the results (again, using a table for some layout control). For grins, I've centered the table using an attribute in line 73.

Lines 74 through 77 label the table using a TH cell, using an internal function of CGI.pm to escape the HTML in the search string. This isn't exactly proper, but I doubt that the function will change much in future releases of CGI.pm, and if it does, it's just a five-line routine anyway.

Similarly, the map operation in lines 78 to 80 create a table row for every result. The results start from the return value in line 80. Each of these ends up in $_ in line 78. The url method is called on the result to get the URL string, held for a moment in local variable $url. Line 79 generates an anchor link, with the text being the same as the place to which it sends the user. And that's it!

So, you can drop this program into your CGI area, change the $SITE parameter, and there you have it, instant searchability with very little CPU power required.

I hope you find what you're looking for. And if Y2K doesn't turn us all into characters from the Mad Max movies, I'll see you next month right here. Enjoy.

Listings

      
col44.pl

	=1=	#!/usr/bin/perl -wT
	=2=	use strict;
	=3=	$|++;
	=4=	
	=5=	use CGI ":all";
	=6=	
	=7=	## CONFIG
	=8=	
	=9=	my $SITE = "stonehenge.com";
	=10=	my $MAXIMUM_HITS = 32;
	=11=	
	=12=	## END CONFIG
	=13=	
	=14=	## table of search engines
	=15=	my %ENGINES =
	=16=	  (
	=17=	   AltaVista => sub { "+host:$SITE @_" },
	=18=	   HotBot => sub { "+domain:$SITE @_" },
	=19=	   Infoseek => sub { "+site:$SITE | @_" },
	=20=	   NorthernLight => sub { "URL:$SITE AND TEXT:\"@_\"" },
	=21=	  );
	=22=	## end table
	=23=	
	=24=	print
	=25=	  header,
	=26=	  start_html("Search this site"),
	=27=	  h1("Search this site");
	=28=	
	=29=	print
	=30=	  hr,
	=31=	  start_form(-method => 'GET'),
	=32=	  table(Tr(
	=33=	           td(submit("Search $SITE for")),
	=34=	           td(textfield(-name => 'search_for')),
	=35=	           td("using"),
	=36=	           td(radio_group(-name => 'engine',
	=37=	                          -values => [sort keys %ENGINES],
	=38=	                          -columns => 1,
	=39=	                         )),
	=40=	           )),
	=41=	  end_form,
	=42=	  hr;
	=43=	
	=44=	do_search() if param;
	=45=	
	=46=	print end_html;
	=47=	exit 0;
	=48=	
	=49=	## subroutines
	=50=	
	=51=	sub do_search {
	=52=	
	=53=	  return unless defined (my $search_for = param('search_for'));
	=54=	
	=55=	  my $engine = param('engine');
	=56=	  if (defined $engine) {
	=57=	    return unless $engine =~ /^(\w+)$/;
	=58=	    $engine = $1;
	=59=	  } else {
	=60=	    $engine = "AltaVista";
	=61=	  }
	=62=	
	=63=	  return unless defined (my $engine_sub = $ENGINES{$engine});
	=64=	  my $engine_search_for = $engine_sub->($search_for);
	=65=	
	=66=	  require WWW::Search;
	=67=	
	=68=	  my $search = WWW::Search->new($engine);
	=69=	  $search->maximum_to_retrieve($MAXIMUM_HITS);
	=70=	  $search->native_query(WWW::Search::escape_query($engine_search_for));
	=71=	
	=72=	  print
	=73=	    table({-align => 'center'},
	=74=	          Tr(th("results for ",
	=75=	                code(CGI::escapeHTML($search_for)),
	=76=	                " from $engine on $SITE\n",
	=77=	               )),
	=78=	          (map { my $url = $_->url;
	=79=	                 Tr(td(a({-href => $url}, CGI::escapeHTML($url))));
	=80=	               } $search->results)
	=81=	         );
	=82=	}

Search This Site (Dec 99)

Copyright Notice

Web Techniques Column 44 (Dec 1999)

Listings

About Randal L. Schwartz