Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
![]() |
Download this listing! | ![]() |
![]() |
![]() |
Web Techniques Column 26 (Jun 1998)
Last month's column could have been titled ``Where did they go?'', because I explored tracking the outbound links from my site to the interesting URLs I had provided on my pages. In this month's column, I'm looking at ``Where did they come from?''.
In particular, much of the web's content is now found these days, not by interesting URLs posted in other sites, but by users typing in search queries to the big indexing engines like Altavista and Lycos and Infoseek. If you're maintaining a ``referer log'', you may have noticed that the query strings typed in by the user sometimes shows up when that user follows a search results link to your page. This happens because the indexer's search page is often a GET form, and the parameters of the search are therefore encoded into the URL of the search results page.
And having noticed that, I decided to write a program that would go through my referer log and extract just the search strings. This is more than an idle curiousity; it tells me exactly what people are looking for that brought them to my page, and what I should be providing more of if I want to have my site be popular. Especially if I'm selling ads or wanting to be famous.
The ``referer log'' (available with some configuration parameter for
most popular web servers) is merely a record of the HTTP Referer
header (yes, it's spelled that way for historical reasons), which will
frequently point to the URL from which the URL request is being made.
The referer is not necessarily supported on all browsers, and will be
messed up on a bookmarked entry. But for the majority of hits, the
referer can give valuable information (as you can see by looking at
the results of this program on your site).
The program to extract the search strings from the referer log is given in [listing 1, below].
Line 1 contains the path to Perl, along with the command-line switches that enable ``taint'' mode and warnings. Taint mode doesn't make much sense here, but I turned it on in case I decide later to make this a CGI script. Warnings are useful, but they can occasionally get in the way.
Line 2 turns on the compiler restrictions useful for all programs
greater than ten lines or so. This includes disabling soft references
(almost always a good idea), turning off ``Perl poetry mode'', and (most
importantly) requiring all non-package variables to be declared.
Variables will thus need to be introduced with an appropriate my
directive.
Line 3 unbuffers STDOUT
, causing all output to happen at the time
it is print
ed, not when the STDIO buffer fills up. This is handy
because it lets me see the output nearly immediately for a large log
file, rather than having to wait until program exit time for the
automatic buffer flush.
Line 5 pulls in the URI::URL
module from the LWP
library. This
library is the all-singing, all-dancing, everything-you-wanted library
to handle nearly all web-ish stuff in Perl, and can be found in the
CPAN at [insert location here]. Of course, if you're doing anything
with Perl and the web, you've probably already got this installed. We
need this library to pull apart the referer URL.
Line 7 defines the result hash as %count
, which will ultimately
hold a hash of hashes of counts of how many times each query string
was used from a particular search engine. Initially, it needs to be
empty, so we set it to the empty list (becoming the empty hash).
Lines 8 through 53 define the data gathering loop. For each line in the referer log, we'll go through this loop once, with the line in $_. The data will be taked either from standard input, or from the list of files specified on the command line.
Line 9 pulls out the referer information from the line. For a standard RefererLog-style log, this'll look like:
there -> here
And since we're only interested in there
, it's simple enough to
just pull out all the whitespace-separated items, and grab the first
one, here kept in $ref. If you have a different logfile format,
you'll have to adjust this line to pull out the field you need.
Line 10 turns the referer string in $ref into a URI::URL
object,
using the subroutine url
defined in that module. If $ref is empty
or not a valid URL, the object may be malformed, but that'll be caught
in the next step.
Line 11 verifies that we have a valid http:
URL. The scheme
method on the URL object returns back either a string or undef. If
it's not defined, the or operator (two vertical bars) select the
empty string as an alternative, to prevent the use of an undef
value in a further calculation, which triggers a warning under -w
.
If this URL is not an http
URL, then we skip it.
Line 12 extracts the portion of the URL string after the ?
as a
query form, if it is at all possible. The eval block protects this
program from an exception in the query_form method, which throws up
a die if there isn't a valid form. The result of the eval
creates a new hash, %form. The keys of this hash are the query
field names, and the corresponding values are the field values.
Lines 13 through 39 create a value for @search_fields, specifying for a particular search engine host what we're guessing is the search query string. This list can have many kinds of values:
-
If the list is empty, then we ignore this particular search engine. (Either it's not a search engine, or we can't find anything useful to note as a search string.)
-
If the list consists of only uppercase words, then all fields of the query will be dumped (used for the catchall entry at the end).
-
In the common case, if the list consists of one or more lowercase words, these represent form fields of interest, probably with the search string that brought the client here.
To construct this list, I started with a very small list, and ran it over my referer log of a few months. For every search engine that was dumped out as an other, I figured out which of the fields looked like a search list, and added them in. I also got a bit of help from Teratogen on IRC (known in ``real life'' as Anthony Nemmer of EdelSys Consulting), who had apparently tackled a similar problem before, and identified a significantly larger portion of the list from his own data.
The list is incomplete, and evolves over time, so the names here are merely a good cross section. And, there are search engines that don't use a GET method to go from the search page to the results, and thus their parameters won't show up in the URL. But as you can see, a good number of the popular ones (Altavista, Excite, Hotbot, Infoseek, Lycos, Search.com, and Webcrawler) do.
Line 14 extracts the hostname from the referer URL, and makes it lowercase. (We could have made all the comparisons case insensitive, but this alternative was much faster).
Lines 15 through 38 form a long if..elsif..elsif..else
structure.
Note that it begins with if 0
, which will always be false, but
permits all the remaining cases to be symmetrical. This is nice
because it allows me to swap the order of the checking trivially (by
exchanging lines in a text editor) or even sorting them if I wish.
The hostname is compared with each of the regular expressions in turn.
Note that some of the hostnames are looking only for a particular
hostname portion, while others are bounded by the complete suffix to
the end of the string. In particular, I found many different hosts
with altavista
, and they all seemed to use the same query field, so
writing the test for it this way made sense.
Note that they are tested in the order presented. I found some form
being used in edit.my.yahoo.com
that was nothing like the query
form in yahoo.com
(and friends), so I placed a special blocking
entry ahead of the Yahoo main entry, saying ``don't bother with this
one, it's not the same''. Otherwise, the ordering of this list is
somewhat arbitrary, and for efficiency reasons should probably be
placed with the most likely one first.
The multiway if statement is within a do block, meaning that the
last expression evaluated will be the return value. If you don't like
the structure requiring the use of elsif
chunks, you can write
other switch statements enclosed in sub blocks, like so:
my @search_fields = "UNKNOWN"; { local $_ = lc $url->host; (@search_fields = "q"), last if /\baltavista\b/; (@search_fields = qw(s search)), last if /\bnetfind\.aol\.com$/; ...; (@search_fields = "p"), last if /\byahoo\b/; }
But I didn't like the number of times I'd have to say @search_fields
,
and went with the do-block structure instead. Another alternative
might be to call a subroutine, like:
my @search_fields = &map_to_engine($url); sub map_to_engine { local $_ = lc shift->host; return "q" if /\baltavista\b/; return qw(s search) if /\bnetfind\.aol\.com$/; ...; return "p" if /\byahoo\b/; return "UNKNOWN"; }
And in fact, to some, that make look cleaner than what I wrote. Your choice, however. After all, the Perl motto is ``There's More Than One Way To Do It.''
In line 40, we check the result of that multiway test. If @search_fields is empty, it's the signal that this line is noisy, and we can skip it. Otherwise, in line 41, we'll translate this list into a hash to do a fast lookup. The map operator takes the elements of the list in @search_fields, interposes a single 1 after each element, and turns that into the %wanted hash, with keys being the original elements of the list.
Line 42 scans the form fields from %form, keeping only those elements that match the keys of %wanted in a case-insensitive manner. This is accomplished through the clever use of lowercasing the value of $_ before doing the lookup. Thus, @show_fields will be a list of all the form fields of interest, if any.
If @show_fields has one or more elements, we found a valid search site along with an interesting field (hopefully a search string). In that case, we'll save the search string for later dumping. Lines 44 through 46 store the information into a hash-of-hashrefs, with the first level being the host, and the second level being the particular search string used at that host. A count is maintained, and for the most part will be just an increment from undef to 1. Occasionally, when the same search string is used (or repeated), we'll get multiple hits.
On the other hand, if @show_fields is empty, we were either looking at a referer URL that had a form from an unknown site, or somehow one of the known sites didn't have the proper field. In that case, we'll dump out the entire form immediately, so that you can consider it manually to locate a search string for a future run. That's handled in lines 48 through 51, which simply dump the %form variable preceded by the search host.
Lines 55 through 63 dump the search string hash-of-hashrefs. Each of
the hostnames ends up in $host in line 55. (If you don't have a
relatively modern version of Perl, the for my
syntax will not work.
Upgrade now, because it's free and less buggy than the version you're
running.)
Line 56 extracts the hashref value from the top level hash, which is then dereferenced in line 57 to get the individual searchtext items into $text. Lines 58 to 62 dump out the hostname, textstring, and number of times each item was found (if more than once).
And there you have it. To use this program, adjust the ``referer field'' parsing line according to the format of your referer log, and then pass the name of the log on the command line to this program. You could even wrap this up into a nightly job, and with a little work generate an HTML output file that creates links back to the search engines in question! (Sounds like an interesting additional project if I've got another hour or two.) Enjoy!
Listings
=1= #!/usr/bin/perl -Tw =2= use strict; =3= $|++; =4= =5= use URI::URL; =6= =7= my %count = (); =8= while (<>) { =9= my ($ref) = split; ## may require adjustment =10= my $url = url $ref; =11= next unless ($url->scheme || "") eq "http"; =12= next unless my %form = eval { $url->query_form }; =13= my @search_fields = do { =14= local $_ = lc $url->host; =15= if (0) { () } =16= elsif (/\baltavista\b/) { "q" } =17= elsif (/\bnetfind\.aol\.com$/) { qw(s search) } =18= elsif (/\baskjeeves\.com$/) { "ask" } =19= elsif (/\bdejanews\.com$/) { () } =20= elsif (/\bdigiweb\.com$/) { "string" } =21= elsif (/\bdogpile\.com$/) { "q" } =22= elsif (/\bexcite\.com$/) { qw(s search) } =23= elsif (/\bhotbot\.com$/) { "mt" } =24= elsif (/\binference\.com$/) { "query" } =25= elsif (/\binfoseek\.com$/) { qw(oq qt) } =26= elsif (/\blooksmart\.com$/) { "key" } =27= elsif (/\blycos\b/) { "query" } =28= elsif (/\bmckinley\.com$/) { "search" } =29= elsif (/\bmetacrawler\b/) { "general" } =30= elsif (/\bnlsearch\.com$/) { "qr" } =31= elsif (/\bprodigy\.net$/) { "query" } =32= elsif (/\bsearch\.com$/) { qw(oldquery query) } =33= elsif (/\bsenrigan\.ascii\.co\.jp$/) { "word" } =34= elsif (/\bswitchboard\.com$/) { "sp" } =35= elsif (/\bwebcrawler\.com$/) { qw(search searchtext text) } =36= elsif (/\bedit\.my\.yahoo\.com$/) { () } ## must come before yahoo.com =37= elsif (/\byahoo\b/) { "p" } =38= else { "UNKNOWN" } =39= }; =40= next unless @search_fields; =41= my %wanted = map { $_, 1 } @search_fields; =42= my @show_fields = grep { $wanted{lc $_} } keys %form; =43= if (@show_fields) { =44= for (@show_fields) { =45= $count{$url->host}{$form{$_}}++; =46= } =47= } else { =48= print $url->host, "\n"; =49= for (sort keys %form) { =50= print "?? $_ => $form{$_}\n"; =51= } =52= } =53= } =54= =55= for my $host (sort keys %count) { =56= my $hostinfo = $count{$host}; =57= for my $text (sort keys %$hostinfo) { =58= my $times = $hostinfo->{$text}; =59= print "$host: $text"; =60= print " ($times times)" if $times > 1; =61= print "\n"; =62= } =63= }