Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 42 (Jul 2002)

[suggested title: ``Parsing and summarizing a logfile'']

I've recently put www.stonehenge.com behind a caching reverse-proxy, and rather than switch technologies, I'm using another instance of a stripped down Apache server to do the job. But what kind of job was it doing? How many of my hits were being cached and delivered by the lightweight front servers, instead of going all the way through to the heavy mod_perl_and_everything_else backend servers?

Luckly, I've included the caching information in the access log file, thanks to the CustomLog and LogFormat directives:

    LogFormat "[virt=%v] %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{X-Cache}o\"" combined
    CustomLog var/log/access_log combined

I've added a virtual host entry (for tracking) to the front of the line, and the X-Cache header of the response to the end of the line. Of course, doing so means my access log is not in a standard format any more, so I can't use off-the-shelf tools for log analysis. That's OK, because I'm pretty good at writing my own data reduction tools. A typical output line looks like this:

    [virt=www.stonehenge.com] 192.168.42.69 - - [10/May/2002:01:51:50 -0700] "GET /merlyn/UnixReview/ HTTP/1.0" 200 101324 "-" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" "MISS from www.stonehenge.com"

For my analysis, I wanted to see how many of those X-cache fields began with HIT or MISS, indicating that the mod_proxy module had gone all the way to the backend server, and either gotten a good cachable hit, or had to regenerate it. And I wanted the data summarized on an hour-by-hour basis, in a CSV-style file so I could pull it in to my favorite spreadsheet to do graphs and formulas.

So, I had a few problems to solve. I had to locate the data. I had to read it line-by-line, parsing it into fields of interest. And then I had to figure out what hour of the day the hit represented, so I could collect it into the right bucket. And finally, I had to spit out a CSV file for all the data from the first to last entry.

Locating the data was easy:

  my $DIR = "/web/stonehenge-proxy/var/log";
  @ARGV = <$DIR/access_log*> unless @ARGV;
  while (<>) { ...

My access logs are all named access_log-something: I use a nightly log-roller that leaves it in the same directory with a timestamp appended. By storing the names into the @ARGV array, a simple ``diamond'' loop will read them nicely.

Next, I needed to parse each line:

    my ($server, $stamp, $meth_url, $status, $bytes, $proxy) = /
      ^
      \[virt=(\S+)\] \s+ # virtual servername
      \S+ \s+ # source IP
      \S+ \s+ \S+ \s+ # basicauth
      \[(.*?)\] \s+ # request timestamp
      "(.*?)" \s+ # method and URL and optional protocol
      (\d+) \s+ # status
      (\S+) \s+ # bytes
      ".*?" \s+ # referer
      ".*?" \s+ # useragent
      "(.*?)" # proxy status
      \s* \z # end of string
      /x or warn("cannot parse $ARGV $.: $_"), next;

Wow. That's a lot. Compare this regex to the sample line I presented earlier, and the correlation should become clear. I'm using an ``extended format'' regular expression so I can comment each piece. I'm extracting more fields than I needed, mostly because I was playing around with the different combinations of types of hits to the proxy server. Doesn't hurt to have more than you need, sometimes.

Next, the date thing needs to be sorted out. I wanted a key that identified the particular hour of the hit. I did this by delegating that job to a subroutine:

    my $key = date_string_to_epoch_key($stamp);

My plan was to have a Unix epoch integer as a response from this subroutine. That was a pain, as I'll show shortly. And finally, we'll increment this hit into one of three buckets:

    if ($proxy =~ /^HIT/) {
      $sums{$key}{hit}++;
    } elsif ($proxy =~ /^MISS/) {
      $sums{$key}{miss}++;
    } else {
      $sums{$key}{other}++;
    }

And this creates a ``hash of hashes'' (actually, a hash containing hashrefs). The first level is the timestamp, while the second level is really a C-like struct with constant names. Missing first-level entries are created automatically as needed, thanks to the very cool autovivification. The ``other'' bucket is for entries that don't use mod_proxy, such as redirects or static images served directly by the proxy server.

To round out the loop at the bottom, I want to report progress and also keep the $. variable in sync. Simple enough, using the recipe from the manpages:

  } continue {
    if (eof) {
      close ARGV;
      warn "done with $ARGV\n";
    }
  };

There. After that loop, the file processing is done. Now to generate the report. First, I want to step through all of the hourly timestamps, regardless of whether a particular hour has a hit or not. Of course, if my webserver has a zero hit hour, it's very likely that something is seriously wrong, so I want to show a row with zeroes rather than skipping that row. The quick and dirty way to get the minimum and maximum keys was to sort the keys:

  my ($minkey, $maxkey) = (sort { $a <=> $b} keys %sums)[0, -1];

And now we put the CSV header for easy labeling and import:

  print "Date,Miss,Hit,Other\n";

And then we'll dump. Hmm. I have a Unix epoch time value, but I want a human-readable format acceptable to my spreadsheet program. The handiest tool for that job is the heavy-hitting Date::Manip package, found in the CPAN, and its nice UnixDate formatting routine. Yeah, there's probably 18 other ways to do that, but Date::Manip usually does the job for me nicely.

  use Date::Manip;
  for (my $key = $minkey; $key <= $maxkey; $key += 3600) {
    my $h = $sums{$key};
    print join(",",UnixDate("epoch $key","%m/%d/%Y %H:%M:%S"),
               map { $h->{$_} || 0 } qw(miss hit other)), "\n";
  }

Note that I use || 0 to ensure that a missing entry actually shows up as a zero entry.

And finally, for that sticky part. I have to translate the timestamp into an hour epoch value. Again, Data::Manip is helpful here. First, I edit the string so that the minutes and seconds are zeroed out, then pass that string to ParseDateString to extract the value. However, if I did that on every hit, the time would be prohibitively expensive. So I'll cache the result like so:

  BEGIN {
    my %cache;
    sub date_string_to_epoch_key {
      my $stamp = shift;
      $stamp =~ s/(\d\d):\d\d:\d\d /$1:00:00 /; # reduce to hour
      $cache{$stamp} ||= UnixDate(ParseDateString($stamp), "%s");
    }
  }

If a given $stamp exists as a key in %cache hash, then the ||= operator stops, returning that value. However, if the key is missing, then the (expensive) expression on the right produces the value, which is both stored into the hash element and returned! Very nice idiom, although it fails if the cached value just happens to be zero, since the expensive expression will be recalculated each time.

And the end result is a file on standard output looking a lot like:

    Date,Miss,Hit,Other
    05/10/2002 01:00:00,30,2,26
    05/10/2002 02:00:00,273,15,183
    05/10/2002 03:00:00,224,14,108
    05/10/2002 04:00:00,201,7,122
    05/10/2002 05:00:00,558,28,474
    05/10/2002 06:00:00,615,74,329
    05/10/2002 07:00:00,819,61,352
    05/10/2002 08:00:00,605,85,450

which my spreadsheet program pulls up just fine. Or, I could even import it into a database and perform more complex queries. But at least now I know how successful my proxy cache is at avoiding hits to the fat backend processes. Until next time, enjoy!


Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.