Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
![]() |
Download this listing! | ![]() |
![]() |
![]() |
Linux Magazine Column 37 (Jun 2002)
[suggested title: Looking for lossage]
In last month's column, I showed how to use my webserver's database-driven statistics log to figure out peak and average transfer rates. Having a database for a logging engine is great, but such a log consumes more space than I can reasonably afford. So, I tend to keep only a few months of backlogs in the database, and roll the rest out to a flat file for easy minimal storage and archival. Which made the task I faced a few days ago a bit harder.
The Linux box currently hosting stonehenge.com
is in a rented space
at a co-location facility. As with a lot of the internet shakeout
everywhere, the co-lo facility was bought out by a larger networking
company, and we've been having network interruptions including
complete loss of service from time to time. The administrator of the
box came to me looking for evidence that these outages had been for
some extended periods of time, so that he could take that to the new
owner and get some of his money back and pass the savings along to me.
It occurred to me that I could use the web logs from my webserver, and
it would have been nice to use the database-driven logs as I did in
last month's column, but I needed data for a year ago. I did have
the database dumps to restore the data, but not enough disk space to
do so. Fortunately, I also had the older standard Apache
access_log
files, broken up by day, compressed separately, going
all the way back to my first hosting machine. It was from these files
that I was able to glean the data.
First, let's consider the requirements. I wanted an hour-by-hour hit
count, for a given range of dates. I also wanted to filter out any
local hits, because I run a link checker on my webserver nightly from
a process on the same box, and those hits don't reflect any network
connectivity issues. I also wanted to be reasonably efficient about
date parsing. That completely ruled out running Date::Manip
's
general date recognition routines, or anything else in the inner loop
of the analyzer. The output also didn't need to be very pretty: just
the hit count for an hour, from which we would apply an
interpretation. And a program meeting all of my administrator's
requirements is presented in [the listing below].
Lines 1 through 3 start nearly every Perl program I write, enabling warnings for development, turning on the compiler restrictions, and unbuffering standard output.
Line 5 pulls in a few selected routines from the Date::Calc
module
(found in the CPAN). Specifically, I needed to convert a month name
to a number, a time stamp to an epoch time, and a human-readable date
to a month-day-year value. Line 6 pulls in the standard File::Find
module.
Lines 10 through 12 provide a few configuration constants that I was
tweaking during development. The @DIRS
variable gives the
top-level directories that contain my web logs, as starting points for
a ``find'' operation. $START
and $END
give the approximate range
endpoints for the analysis. They aren't precisely accurate because my
log-roller doesn't run precisely at midnight, and the endpoints are
merely used to determine which logfiles to even begin parsing. The
endpoints can be given in any format interpretable by Date::Calc
's
date format, which is pretty broad.
Lines 16 to 19 convert the human-readable string into an epoch time,
first by calling Decode_Date_US
, and if that works, by turning that
year-month-day value into a Unix-epoch-based value. I'm using the
trick here of assigning to the foreach
loop control variable to
alter the original variable being examined.
Lines 21 to 28 locate all of the logfiles to be parsed. If the name
matches the right pattern (line 23), and it's a file (line 24), then
the stat buffer's modification time value is saved (line 25). If
this value is within range (line 26), we save the full pathname to
@files
(line 27).
Line 30 is the report accumulator, counting the number of hits in a
given hour. The data structure within %count
will be described
shortly.
Line 32 is a cache of monthname-to-number, as returned by
Decode_Month
. This is necessary because I don't want to call that
subroutine for every single line, so I cache the results for speed.
Line 34 begins a loop that operates on 10 of the files at a time, pulling them out into a ``chunk''. The final chunk might be less than 10 files, however. The number 10 is somewhat arbitrary: a compromise between firing off zcat for every single file, versus constructing a command line with roughly 365 files for the year, possibly overflowing the maximum length of a command.
Line 35 launches the zcat command on the chunk of files. The
ordering and clustering of the files is not relevant, since we're
ultimately going to process all the files and update %count
before
generating any reports.
Lines 36 to 50 process each Apache access_log
-style line from the
logs within range. They look a bit like this:
foo.example.com - - [14/Mar/2002:13:18:55 -0800] "GET / HTTP/1.0" 200 3340
Using an extended-syntax regular expression in lines 38 to 46, I locate the lines that do not begin with my webhost name (to ensure the hit is a remote hit), but contain a date stamp in the normal form. The six memory references (pulling out the day, month, year, hour, minute, and second) are returned as a list, assigned to the variables in line 37.
Lines 47 and 48 turn the month name into a number for easier sorting.
First, an attempt is made to locate it in the cache hash. If it's
there, that value overwrites the previous $month
value. Otherwise,
Decode_Month
is called, and the result padded to two digits for
easy string-based sorting.
Line 49 creates a multi-level data structure. At the top-level, the
key is the year and month number. Next we have the day number,
converted to a pure number by adding a zero, and finally the hour
number, once again converted to a pure number by adding a zero. The
appropriate entry is incremented to show one hit during this period of
interest. Had I been interested in say, 15 minute intervals, I could
have incremented a final key of int(($hour * 60 + $minute)/4)
instead.
I'm using hashes for all the levels because I believe that for at least part of the values, I'd have ``sparse'' data. As it turns out, nearly every hash element in sequence was being used, so a hash of arrayrefs of arrayrefs would probably have been a better choice. But this was ``fast enough'', so I stopped optimizing.
Finally, it's time to show the results, using three nested foreach
loops. The outer loop walks through a string-sorted list of keys,
representing the year/month pairs for which data has been made
available. The middle loop (starting in line 54) figures out which
days of that month are present, and sorts them numerically. The inner
loop (starting in line 55) forces a scan for every hour of that day.
Since some of the hours have 0 hits, we don't want to skip over them
in the output, so by forcing every hour, we can see the zero hit
results. I knew the outage was never longer than a day, or I'd have
already moved my hosting service to another location. Line 57
displays the data, and we're done!
With the report generated in a timely fashion, my administrator is off to the co-lo owner, hopefully resulting in some returned monies. And the task is complete. Until next time, enjoy!
Listing
=1= #!/usr/bin/perl -w =2= use strict; =3= $|++; =4= =5= use Date::Calc qw(Decode_Month Mktime Decode_Date_US); =6= use File::Find; =7= =8= ## begin config =9= =10= my @DIRS = glob "/home/merlyn/Logs/OLD.*/."; =11= my $START = "1/1/2001"; =12= my $END = "1/1/2002"; =13= =14= ## end config =15= =16= for ($START, $END) { =17= my @y_m_d = Decode_Date_US($_) or die "Cannot decode $_"; =18= $_ = Mktime(@y_m_d, 12, 0, 0); # noon on that day =19= } =20= =21= my @files; =22= find sub { =23= return unless /^access_log.*gz$/; =24= return unless -f; =25= my $mtime = (stat(_))[9]; =26= return unless $mtime >= $START and $mtime <= $END; =27= push @files, $File::Find::name; =28= }, @DIRS; =29= =30= my %count; =31= =32= my %month_name_to_number; =33= =34= while (my @chunk = splice @files, 0, 10) { =35= open IN, "zcat @chunk |" or die; =36= while (<IN>) { =37= my ($day, $month, $year, $hour, $minute, $second) = =38= m{ =39= ^ =40= (?!web\.stonehenge\.comm) =41= [^\[]* =42= \[ =43= (\d+) \/ (\w+) \/ (\d+) =44= : =45= (\d+) : (\d+) : (\d+) =46= }x or next; =47= $month = $month_name_to_number{$month} =48= ||= sprintf "%02d", Decode_Month($month); =49= $count{"$year $month"}{0+$day}{0+$hour}++; =50= } =51= } =52= =53= for my $year_month (sort keys %count) { =54= for my $day (sort {$a <=> $b} keys %{$count{$year_month}}) { =55= for my $hour (0..23) { =56= my $hits = $count{$year_month}{$day}{$hour} || 0; =57= printf "%s %02d %02d: %d\n", $year_month, $day, $hour, $hits; =58= } =59= } =60= }