Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Unix Review Column 42 (Jul 2002)
[suggested title: ``Parsing and summarizing a logfile'']
I've recently put www.stonehenge.com
behind a caching
reverse-proxy, and rather than switch technologies, I'm using another
instance of a stripped down Apache server to do the job. But what
kind of job was it doing? How many of my hits were being cached and
delivered by the lightweight front servers, instead of going all the
way through to the heavy mod_perl_and_everything_else backend servers?
Luckly, I've included the caching information in the access log file,
thanks to the CustomLog
and LogFormat
directives:
LogFormat "[virt=%v] %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{X-Cache}o\"" combined CustomLog var/log/access_log combined
I've added a virtual host entry (for tracking) to the front of the
line, and the X-Cache
header of the response to the end of the
line. Of course, doing so means my access log is not in a standard
format any more, so I can't use off-the-shelf tools for log analysis.
That's OK, because I'm pretty good at writing my own data reduction
tools. A typical output line looks like this:
[virt=www.stonehenge.com] 192.168.42.69 - - [10/May/2002:01:51:50 -0700] "GET /merlyn/UnixReview/ HTTP/1.0" 200 101324 "-" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" "MISS from www.stonehenge.com"
For my analysis, I wanted to see how many of those X-cache
fields
began with HIT
or MISS
, indicating that the mod_proxy
module
had gone all the way to the backend server, and either gotten a good
cachable hit, or had to regenerate it. And I wanted the data
summarized on an hour-by-hour basis, in a CSV-style file so I could
pull it in to my favorite spreadsheet to do graphs and formulas.
So, I had a few problems to solve. I had to locate the data. I had to read it line-by-line, parsing it into fields of interest. And then I had to figure out what hour of the day the hit represented, so I could collect it into the right bucket. And finally, I had to spit out a CSV file for all the data from the first to last entry.
Locating the data was easy:
my $DIR = "/web/stonehenge-proxy/var/log"; @ARGV = <$DIR/access_log*> unless @ARGV; while (<>) { ...
My access logs are all named access_log
-something: I use a nightly
log-roller that leaves it in the same directory with a timestamp
appended. By storing the names into the @ARGV
array, a simple
``diamond'' loop will read them nicely.
Next, I needed to parse each line:
my ($server, $stamp, $meth_url, $status, $bytes, $proxy) = / ^ \[virt=(\S+)\] \s+ # virtual servername \S+ \s+ # source IP \S+ \s+ \S+ \s+ # basicauth \[(.*?)\] \s+ # request timestamp "(.*?)" \s+ # method and URL and optional protocol (\d+) \s+ # status (\S+) \s+ # bytes ".*?" \s+ # referer ".*?" \s+ # useragent "(.*?)" # proxy status \s* \z # end of string /x or warn("cannot parse $ARGV $.: $_"), next;
Wow. That's a lot. Compare this regex to the sample line I presented earlier, and the correlation should become clear. I'm using an ``extended format'' regular expression so I can comment each piece. I'm extracting more fields than I needed, mostly because I was playing around with the different combinations of types of hits to the proxy server. Doesn't hurt to have more than you need, sometimes.
Next, the date thing needs to be sorted out. I wanted a key that identified the particular hour of the hit. I did this by delegating that job to a subroutine:
my $key = date_string_to_epoch_key($stamp);
My plan was to have a Unix epoch integer as a response from this subroutine. That was a pain, as I'll show shortly. And finally, we'll increment this hit into one of three buckets:
if ($proxy =~ /^HIT/) { $sums{$key}{hit}++; } elsif ($proxy =~ /^MISS/) { $sums{$key}{miss}++; } else { $sums{$key}{other}++; }
And this creates a ``hash of hashes'' (actually, a hash containing
hashrefs). The first level is the timestamp, while the second level
is really a C-like struct with constant names. Missing first-level
entries are created automatically as needed, thanks to the very cool
autovivification. The ``other'' bucket is for entries that don't use
mod_proxy
, such as redirects or static images served directly by
the proxy server.
To round out the loop at the bottom, I want to report progress
and also keep the $.
variable in sync. Simple enough, using
the recipe from the manpages:
} continue { if (eof) { close ARGV; warn "done with $ARGV\n"; } };
There. After that loop, the file processing is done. Now to generate the report. First, I want to step through all of the hourly timestamps, regardless of whether a particular hour has a hit or not. Of course, if my webserver has a zero hit hour, it's very likely that something is seriously wrong, so I want to show a row with zeroes rather than skipping that row. The quick and dirty way to get the minimum and maximum keys was to sort the keys:
my ($minkey, $maxkey) = (sort { $a <=> $b} keys %sums)[0, -1];
And now we put the CSV header for easy labeling and import:
print "Date,Miss,Hit,Other\n";
And then we'll dump. Hmm. I have a Unix epoch time value, but I want
a human-readable format acceptable to my spreadsheet program. The
handiest tool for that job is the heavy-hitting Date::Manip
package, found in the CPAN, and its nice UnixDate
formatting
routine. Yeah, there's probably 18 other ways to do that, but
Date::Manip
usually does the job for me nicely.
use Date::Manip;
for (my $key = $minkey; $key <= $maxkey; $key += 3600) { my $h = $sums{$key};
print join(",",UnixDate("epoch $key","%m/%d/%Y %H:%M:%S"), map { $h->{$_} || 0 } qw(miss hit other)), "\n"; }
Note that I use || 0
to ensure that a missing entry actually shows
up as a zero entry.
And finally, for that sticky part. I have to translate the timestamp
into an hour epoch value. Again, Data::Manip
is helpful here.
First, I edit the string so that the minutes and seconds are zeroed
out, then pass that string to ParseDateString
to extract the value.
However, if I did that on every hit, the time would be prohibitively
expensive. So I'll cache the result like so:
BEGIN { my %cache;
sub date_string_to_epoch_key { my $stamp = shift; $stamp =~ s/(\d\d):\d\d:\d\d /$1:00:00 /; # reduce to hour $cache{$stamp} ||= UnixDate(ParseDateString($stamp), "%s"); } }
If a given $stamp
exists as a key in %cache
hash, then the
||=
operator stops, returning that value. However, if the key is
missing, then the (expensive) expression on the right produces the
value, which is both stored into the hash element and returned!
Very nice idiom, although it fails if the cached value just happens to
be zero, since the expensive expression will be recalculated each
time.
And the end result is a file on standard output looking a lot like:
Date,Miss,Hit,Other 05/10/2002 01:00:00,30,2,26 05/10/2002 02:00:00,273,15,183 05/10/2002 03:00:00,224,14,108 05/10/2002 04:00:00,201,7,122 05/10/2002 05:00:00,558,28,474 05/10/2002 06:00:00,615,74,329 05/10/2002 07:00:00,819,61,352 05/10/2002 08:00:00,605,85,450
which my spreadsheet program pulls up just fine. Or, I could even import it into a database and perform more complex queries. But at least now I know how successful my proxy cache is at avoiding hits to the fat backend processes. Until next time, enjoy!