Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 37 (Jun 2002)

[suggested title: Looking for lossage]

In last month's column, I showed how to use my webserver's database-driven statistics log to figure out peak and average transfer rates. Having a database for a logging engine is great, but such a log consumes more space than I can reasonably afford. So, I tend to keep only a few months of backlogs in the database, and roll the rest out to a flat file for easy minimal storage and archival. Which made the task I faced a few days ago a bit harder.

The Linux box currently hosting stonehenge.com is in a rented space at a co-location facility. As with a lot of the internet shakeout everywhere, the co-lo facility was bought out by a larger networking company, and we've been having network interruptions including complete loss of service from time to time. The administrator of the box came to me looking for evidence that these outages had been for some extended periods of time, so that he could take that to the new owner and get some of his money back and pass the savings along to me.

It occurred to me that I could use the web logs from my webserver, and it would have been nice to use the database-driven logs as I did in last month's column, but I needed data for a year ago. I did have the database dumps to restore the data, but not enough disk space to do so. Fortunately, I also had the older standard Apache access_log files, broken up by day, compressed separately, going all the way back to my first hosting machine. It was from these files that I was able to glean the data.

First, let's consider the requirements. I wanted an hour-by-hour hit count, for a given range of dates. I also wanted to filter out any local hits, because I run a link checker on my webserver nightly from a process on the same box, and those hits don't reflect any network connectivity issues. I also wanted to be reasonably efficient about date parsing. That completely ruled out running Date::Manip's general date recognition routines, or anything else in the inner loop of the analyzer. The output also didn't need to be very pretty: just the hit count for an hour, from which we would apply an interpretation. And a program meeting all of my administrator's requirements is presented in [the listing below].

Lines 1 through 3 start nearly every Perl program I write, enabling warnings for development, turning on the compiler restrictions, and unbuffering standard output.

Line 5 pulls in a few selected routines from the Date::Calc module (found in the CPAN). Specifically, I needed to convert a month name to a number, a time stamp to an epoch time, and a human-readable date to a month-day-year value. Line 6 pulls in the standard File::Find module.

Lines 10 through 12 provide a few configuration constants that I was tweaking during development. The @DIRS variable gives the top-level directories that contain my web logs, as starting points for a ``find'' operation. $START and $END give the approximate range endpoints for the analysis. They aren't precisely accurate because my log-roller doesn't run precisely at midnight, and the endpoints are merely used to determine which logfiles to even begin parsing. The endpoints can be given in any format interpretable by Date::Calc's date format, which is pretty broad.

Lines 16 to 19 convert the human-readable string into an epoch time, first by calling Decode_Date_US, and if that works, by turning that year-month-day value into a Unix-epoch-based value. I'm using the trick here of assigning to the foreach loop control variable to alter the original variable being examined.

Lines 21 to 28 locate all of the logfiles to be parsed. If the name matches the right pattern (line 23), and it's a file (line 24), then the stat buffer's modification time value is saved (line 25). If this value is within range (line 26), we save the full pathname to @files (line 27).

Line 30 is the report accumulator, counting the number of hits in a given hour. The data structure within %count will be described shortly.

Line 32 is a cache of monthname-to-number, as returned by Decode_Month. This is necessary because I don't want to call that subroutine for every single line, so I cache the results for speed.

Line 34 begins a loop that operates on 10 of the files at a time, pulling them out into a ``chunk''. The final chunk might be less than 10 files, however. The number 10 is somewhat arbitrary: a compromise between firing off zcat for every single file, versus constructing a command line with roughly 365 files for the year, possibly overflowing the maximum length of a command.

Line 35 launches the zcat command on the chunk of files. The ordering and clustering of the files is not relevant, since we're ultimately going to process all the files and update %count before generating any reports.

Lines 36 to 50 process each Apache access_log-style line from the logs within range. They look a bit like this:

  foo.example.com - - [14/Mar/2002:13:18:55 -0800] "GET / HTTP/1.0" 200 3340

Using an extended-syntax regular expression in lines 38 to 46, I locate the lines that do not begin with my webhost name (to ensure the hit is a remote hit), but contain a date stamp in the normal form. The six memory references (pulling out the day, month, year, hour, minute, and second) are returned as a list, assigned to the variables in line 37.

Lines 47 and 48 turn the month name into a number for easier sorting. First, an attempt is made to locate it in the cache hash. If it's there, that value overwrites the previous $month value. Otherwise, Decode_Month is called, and the result padded to two digits for easy string-based sorting.

Line 49 creates a multi-level data structure. At the top-level, the key is the year and month number. Next we have the day number, converted to a pure number by adding a zero, and finally the hour number, once again converted to a pure number by adding a zero. The appropriate entry is incremented to show one hit during this period of interest. Had I been interested in say, 15 minute intervals, I could have incremented a final key of int(($hour * 60 + $minute)/4) instead.

I'm using hashes for all the levels because I believe that for at least part of the values, I'd have ``sparse'' data. As it turns out, nearly every hash element in sequence was being used, so a hash of arrayrefs of arrayrefs would probably have been a better choice. But this was ``fast enough'', so I stopped optimizing.

Finally, it's time to show the results, using three nested foreach loops. The outer loop walks through a string-sorted list of keys, representing the year/month pairs for which data has been made available. The middle loop (starting in line 54) figures out which days of that month are present, and sorts them numerically. The inner loop (starting in line 55) forces a scan for every hour of that day. Since some of the hours have 0 hits, we don't want to skip over them in the output, so by forcing every hour, we can see the zero hit results. I knew the outage was never longer than a day, or I'd have already moved my hosting service to another location. Line 57 displays the data, and we're done!

With the report generated in a timely fashion, my administrator is off to the co-lo owner, hopefully resulting in some returned monies. And the task is complete. Until next time, enjoy!

Listing

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use Date::Calc qw(Decode_Month Mktime Decode_Date_US);
        =6=     use File::Find;
        =7=     
        =8=     ## begin config
        =9=     
        =10=    my @DIRS = glob "/home/merlyn/Logs/OLD.*/.";
        =11=    my $START = "1/1/2001";
        =12=    my $END = "1/1/2002";
        =13=    
        =14=    ## end config
        =15=    
        =16=    for ($START, $END) {
        =17=      my @y_m_d = Decode_Date_US($_) or die "Cannot decode $_";
        =18=      $_ = Mktime(@y_m_d, 12, 0, 0); # noon on that day
        =19=    }
        =20=    
        =21=    my @files;
        =22=    find sub {
        =23=      return unless /^access_log.*gz$/;
        =24=      return unless -f;
        =25=      my $mtime = (stat(_))[9];
        =26=      return unless $mtime >= $START and $mtime <= $END;
        =27=      push @files, $File::Find::name;
        =28=    }, @DIRS;
        =29=    
        =30=    my %count;
        =31=    
        =32=    my %month_name_to_number;
        =33=    
        =34=    while (my @chunk = splice @files, 0, 10) {
        =35=      open IN, "zcat @chunk |" or die;
        =36=      while (<IN>) {
        =37=        my ($day, $month, $year, $hour, $minute, $second) =
        =38=          m{
        =39=            ^
        =40=            (?!web\.stonehenge\.comm)
        =41=            [^\[]*
        =42=            \[
        =43=            (\d+) \/ (\w+) \/ (\d+)
        =44=            :
        =45=            (\d+) : (\d+) : (\d+)
        =46=           }x or next;
        =47=        $month = $month_name_to_number{$month}
        =48=          ||= sprintf "%02d", Decode_Month($month);
        =49=        $count{"$year $month"}{0+$day}{0+$hour}++;
        =50=      }
        =51=    }
        =52=    
        =53=    for my $year_month (sort keys %count) {
        =54=      for my $day (sort {$a <=> $b} keys %{$count{$year_month}}) {
        =55=        for my $hour (0..23) {
        =56=          my $hits = $count{$year_month}{$day}{$hour} || 0;
        =57=          printf "%s %02d %02d: %d\n", $year_month, $day, $hour, $hits;
        =58=        }
        =59=      }
        =60=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.