Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
![]() |
Download this listing! | ![]() |
![]() |
![]() |
Web Techniques Column 48 (Apr 2000)
[suggested title: Web access logs with DBI]
A few months back, I was flipping through the wonderful Writing
Apache Modules with Perl and C by Doug MacEachern and fellow
WebTechniques columnist Lincoln Stein while trying to think of more
bells and whistles to throw at my mod_perl
-enabled webserver for
http://PerlTraining.Stonehenge.com
. I stumbled across the section
that talks about using a database instead of a flat file for the
access log. Now, I had just recently upgraded and cleaned up the
MySQL database for my ISP, and thought it would be nice to have
more advanced statistics and reporting. So, I jumped in.
Now this was quite a jump, because I hadn't even used any of Apache's custom log features. I had a simple NCSA-compatible access log, and a few specialized logs in a few directories via SSI loggers. So, I decided to go whole hog the other way. Not only am I logging the usual things (host, user agent, bytes sent, and so on), but I even log the CPU times for each transaction! This lets me see which of my URLs are burning up more CPU so I can figure out ways to cache them better, especially important on a machine that is shared with other users.
And the log is being written immediately to a DBI-based MySQL database. The power here is that I can use SQL queries to generate ad-hoc reports on an up-to-the-minute basis, as well as get the general reports canned up into a Perl data structure with relative ease. No more parsing flat files (not that this was much trouble with Perl anyway). The downside is that I'm now burning 1 megabyte per day for storage inside the MySQL database, but my sysadmin said just to let them burn, since disk is cheap.
Part 0: What I needed to have already in place
First, I needed to have an Apache server with mod_perl
installed,
including handling of all the phases. This means I said something
like EVERYTHING=1
during the configuration (see the mod_perl
documentation for details).
Then, I needed to have access to a live MySQL database, with a valid user and password. That's documented somewhat fuzzily in the MySQL documentation, and a little clearer in the MySQL and mSQL book from O'Reilly and Associates.
Third, I needed to have the DBI
modules installed, along with
Apache::DBI
adaptor. The CPAN holds these, and they're relatively
painless to install.
Finally, I had to have the DBD::mysql
modules installed, again from
the CPAN. This was somewhat painless to install, although the
installation asks a lot of questions and wants a live MySQL database
to do testing, and doesn't listen to the username and password that I
gave it for all the testing steps.
Part 1: Gathering the Statistics
In order to have statistics to analyze, we have to have Apache generate them. Based on the code I saw in Doug and Lincoln's book, I wrote my DBI logger, presented in [listing one, below].
Line 1 sets up the unique package for this handler. I prefix all my
local packages with Stonehenge::
(my company name) to completely
distinguish them from any distribution. Line 2 enables the normal
compiler restrictions. Line 4 is a comment to remind me that this
handler is meant to be a PerlInitHandler
, the first possible phase
under which we can take control for each request.
Lines 8 through 10 pull in some useful modules for this
handler. Apache::Constants
gives us useful values for OK
and
DECLINED
, used at the end of a handler to inform the server of the
outcome of our work. DBI
gets pulled in as well, although not
importing anything. And Apache::Util
gives us the ht_time
function: a very fast way to convert a Unix timestamp into a variably
formatted string.
Lines 12 through 14 give the DBI-specific information, including the DBI database specifier, the table being used, and the database username and password. Obviously, I've put some fake stuff in here for the heck of it, but you can hopefully see the format.
Lines 16 through 18 define the order in which the fields to be stored are generated, and the names of the columns for those fields.
Lines 19 through 24 define an SQL INSERT
command to store a new
record. The question-mark placeholder will let us not have to worry
about quoting.
Lines 26 to 45 are actually a giant comment as far as Perl is
concerned. I'm using a Perl ``POD'' directive to store an SQL statement
inside the Perl program. This statement is meant to be cut and pasted
into the mysql
interactive command to create the table needed for
this log. Otherwise, it's ignored.
Lines 47 to 102 define the handler. This subroutine will be invoked
at each request, very early in the request handling phases. The first
parameter (picked off in line 48) is the Apache::Request
object.
Line 49 ensures that we are processing only the initial request (one
that should be logged), and not the later redirects or subrequests
that happen in include files.
Line 51 starts the counter ticking for wall-clock time and CPU time. This variable is not referenced further in the main subroutine, but is held as a ``closure'' variable for the subroutine reference created in the lines starting at line 56. The five elements are wall-clock timestamp, user cpu, system cpu, child user cpu, and child system cpu.
Lines 53 through 99 form a log handler. This log handler is not
installed using the normal PerlLogHandler
directive, but simply by
having been ``pushed'' here. The nice thing about doing it this way is
that this log handler can now share the closure variables created
above. So, the outer handler runs before anything of interest has
happened for a request, and the inner pushed handler happens as just
about the last possible thing in a request. That's how we are sure to
get the right resources spent on this request as accurately as
possible.
Lines 58 to 64 ensure that any child processes are reaped, so that the child CPU times are properly updated. Without this, the child processes don't get reaped until the cleanup phase (after the logging phase), and the child CPU values are almost always zero.
Line 67 cleverly replaces the values in @times
with their delta
values.
Lines 69 and 70 grab the two interesting Apache::Request
objects.
If a request has been redirected internally, the ``original'' request is
not necessarily the request to which all the loggable activity has
been charged, so we need to know both.
Lines 72 to 85 construct the new record to be added to the database.
The values must be in the same order as the @FIELDS
were given
earlier. Most of the elements are just straight extractions from the
appropriate Apache::Request
objects, with two exceptions. The
request time is a MySQL-compatible date-time value, created with the
assistance of the ht_time
function. And the request URL comes from
the request string, not the uri
value, because the request string
contains the parameter items as well.
Lines 87 to 93 connect to the database, and store the data. Now, this
looks expensive to do on each request, but there's some caching going
on here. First, I'm using Apache::DBI
(described in a moment),
which caches the connect
requests, so that this happens only once
per Apache server process. Second, the prepare_cached
method
enables the prepared SQL to be reused on subsequent hits. (Thanks to
Tim Bunce for pointing that out to me while reviewing the code.) And
the disconnect
doesn't actually happen, but I include it for consistency
if I ever choose not to use Apache::DBI
.
If there's a would-be fatal error of any kind in lines 88 to 92, lines 94 to 96 catch that and throw the error in the log. For example, if the database could not connected, or the database was full, we'd see that in the error log.
So, I stuck this module somewhere in the @INC
path for my embedded
mod_perl
interpreter, and then added the following lines to my main
configuration file for Apache:
## DBI logging PerlModule Apache::DBI PerlModule DBI PerlInitHandler Stonehenge::DBILog
The Apache::DBI
module ensures a mostly-transparent caching of all
DBI
objects created in the server. The current design requires
that Apache::DBI
be loaded before the DBI
module, but I
understand discussion is underway to remove this restriction. In any
event, the DBI
module is also pulled in, and finally a
PerlInitHandler
is established with our logging module as the
handler definition.
After restarting the server, our database begins to fill up. At this
point, we can use the mysql
command to perform simple ad-hoc
requests, but it's more interesting to let Perl do the canned reports.
Part 2: Analyzing the results
So, now that I have more than a week's worth of data, I want to see just how crazy my server has been working over the past week. I'll do this by making a single HTML page that has some of the most interesting statistics. This is just a demo, so I don't have to be comprehensive. In real life, I'd probably have a series of pages with different views and time periods.
To grab the statistics, I'll run a Perl program periodically to update
an HTML file. I do this with cron
on this Linux box, with an entry
that looks like:
52 0-23/3 * * * /home/merlyn/bin/AnalogDB
And cron
interprets this as ``run this command every 3 hours between
midnight and 11pm, at 52 minutes past the hour''. So it runs at 00:52,
03:52, 06:52 and so on up to 21:52. Running it more often would make
it more fresh, but at some point, you'll be burning more CPU talking
about how you burn CPU than you actually burned doing the real work.
The contents of AnalogDB
are present in [listing two, below]. I'll
point out just the highlights, since it's basically a preamble, four
different reports, and a postamble.
The preamble in lines 1 to 44, set up this program to generate an HTML
file on STDOUT
. STDOUT
has been redirected to a temporary file
defined in line 32, which lives in the same directory as the ultimate
output file named in line 12. The purpose of the temporary file is so
that we don't confuse any existing readers until we have a complete
HTML file to show, since this execution will probably take 30 seconds
or so.
Lines 14 through 27 are also worth noting, as they define the subset of the records that I'm willing to describe in public. For various reasons, I have private areas of my webserver, and private accesses to the webserver for testing, and this SQL helps remove those from the public statistics.
Also note the SQL control in line 37. This MySQL-specific operation permits some queries to succeed where they might previous fail, at the expense of having all queries be temporarily saved to disk on the server. Seems like a fine tradeoff to me, but it's not the default.
And finally note that the HTML is being generated with the CGI.pm
shortcuts. I like that much better than I like typing less-than
and greater-than.
A typical report is shown in lines 45 through 70. Lines 45 through 47
show the description of the report. Lines 49 through 63 dump an HTML
table with the result contents. The table header gets created by
lines 51 through 55, while the body is the result of the map
beginning in line 56.
To read this, we go to the data source in line 63, which generates a
query to the database, and results in an arrayref of rows, each of
which is an arrayref of the fields in the row. This arrayref is
dereferenced in line 63 (the outer @{...}
) and then spoonfed into
map
. Each $_
inside the map is therefore an arrayref for one
row, which gets dereferenced and stored into named variables in line
57. Since two of the three columns are URLs, I wrap the data cell in
a call to show_link
(defined later) to turn it into a clickable
link.
The postamble in lines 176 to 181 finish up the HTML file, and rename
it over the top of the previous HTML file, meaning that users will now
see the updated statistics. Lines 183 to 188 define the show_link
routine, returning an appropriately escaped HTML string for a link to
the given URL.
And that's all there is to it! If you want, you can use PNGgraph or GNUplot to make pictures from the data. I may even show that in a future column if I get ambitious. But until next month, enjoy!
Listings
=0= #### LISTING ONE #### =1= package Stonehenge::DBILog; =2= use strict; =3= =4= ## usage: PerlInitHandler Stonehenge::DBILog =5= =6= =7= =8= use Apache::Constants qw(OK DECLINED); =9= use DBI (); =10= use Apache::Util qw(ht_time); =11= =12= my $DSN = 'dbi:mysql:stonehenge_httpd'; =13= my $DB_TABLE = 'requests'; =14= my $DB_AUTH = 'sekretuser:sekretpassword'; =15= =16= my @FIELDS = =17= qw(when host method url user referer browser status bytes =18= wall cpuuser cpusys cpucuser cpucsys); =19= my $INSERT = =20= "INSERT INTO $DB_TABLE (". =21= (join ",", @FIELDS). =22= ") VALUES(". =23= (join ",", ("?") x @FIELDS). =24= ")"; =25= =26= =for SQL =27= =28= create table requests ( =29= when datetime not null, =30= host varchar(255) not null, =31= method varchar(8) not null, =32= url varchar(255) not null, =33= user varchar(50), =34= referer varchar(255), =35= browser varchar(255), =36= status smallint(3) default 0, =37= bytes int(8), =38= wall smallint(5), =39= cpuuser float(8), =40= cpusys float(8), =41= cpucuser float(8), =42= cpucsys float(8) =43= ); =44= =45= =cut =46= =47= sub handler { =48= my $r = shift; =49= return DECLINED unless $r->is_initial_req; =50= =51= my @times = (time, times); # closure =52= =53= $r->push_handlers =54= ( =55= PerlLogHandler => =56= sub { =57= ## first, reap any zombies so child CPU is proper: =58= { =59= my $kid = waitpid(-1, 1); =60= if ($kid > 0) { =61= # $r->log->warn("found kid $kid"); # DEBUG =62= redo; =63= } =64= } =65= =66= ## delta these times: =67= @times = map { $_ - shift @times } time, times; =68= =69= my $orig = shift; =70= my $r = $orig->last; =71= =72= my @data = =73= ( =74= ht_time($orig->request_time, '%Y-%m-%d %H:%M:%S', 0), =75= $r->get_remote_host, =76= $r->method, =77= # $orig->uri, =78= ($r->the_request =~ /^\S+\s+(\S+)/)[0], =79= $r->connection->user, =80= $r->header_in('Referer'), =81= $r->header_in('User-agent'), =82= $orig->status, =83= $r->bytes_sent, =84= @times, =85= ); =86= =87= eval { =88= my $dbh = DBI->connect($DSN, (split ':', $DB_AUTH), =89= { RaiseError => 1 }); =90= my $sth = $dbh->prepare_cached($INSERT); =91= $sth->execute(@data); =92= $dbh->disconnect; =93= }; =94= if ($@) { =95= $r->log->error("dbi: $@"); =96= } =97= =98= return DECLINED; =99= }); =100= =101= return DECLINED; =102= } =103= =104= 1; =0= #### LISTING TWO #### =1= #!/usr/bin/perl =2= use strict; =3= $|++; =4= =5= use DBI (); =6= use CGI::Pretty qw(:all -no_debug); =7= =8= ## BEGIN CONFIG ## =9= =10= my $DSN = 'dbi:mysql:stonehenge_httpd'; =11= my $DB_AUTH = 'sekretuser:sekretpassword'; =12= my $OUTPUT = "/home/merlyn/Html/stats.html"; =13= my $DAY = 7; =14= my $COMMON = <<END_COMMON; =15= ( =16= ( =17= Url not like '/%/%' =18= or Url like '/perltraining/%' =19= or Url like '/merlyn/%' =20= or Url like '/cgi/%' =21= or Url like '/perl/%' =22= or Url like '/icons/%' =23= or Url like '/books/%' =24= ) =25= and Host not like '%.stonehenge.%' =26= and When > date_sub(now(), interval $DAY day) =27= ) =28= END_COMMON =29= =30= ## END CONFIG ## =31= =32= my $TMP = "$OUTPUT~NEW~"; =33= open STDOUT, ">$TMP" or die "Cannot create $TMP: $!"; =34= chmod 0644, $TMP or warn "Cannot chmod $TMP: $!"; =35= =36= my $dbh = DBI->connect($DSN, (split ':', $DB_AUTH), { RaiseError => 1 }); =37= $dbh->do("SET OPTION SQL_BIG_TABLES = 1"); =38= =39= print =40= start_html("Web server activity"), =41= h1("Web server activity at ".localtime), =42= p("This page gives web server activity viewed in various ways,", =43= "updated frequently for information over the prior seven days."); =44= =45= print =46= h2("Incoming links"), =47= p("The following links were the most frequent ways that people found to this site."); =48= =49= print =50= table({Cellspacing => 0, Cellpadding => 2, Border => 1}, =51= Tr( =52= th("Hits in<br>past $DAY days"), =53= th("Source of link"), =54= th("Target of link"), =55= ), =56= map { =57= my ($hits, $referer, $url) = @$_; =58= Tr( =59= td($hits), =60= td(show_link($referer)), =61= td(show_link($url)), =62= ); =63= } @{$dbh->selectall_arrayref(<<END)}); =64= select count(*) as Hits, Referer, Url =65= from requests =66= where $COMMON and Referer not like '%.stonehenge.%' =67= group by Referer, Url =68= order by Hits desc =69= limit 30 =70= END =71= =72= print =73= h2("Outgoing links"), =74= p("The following links were the most frequent ways that people left this site."); =75= =76= print =77= table({Cellspacing => 0, Cellpadding => 2, Border => 1}, =78= Tr( =79= th("Hits in<br>past $DAY days"), =80= th("Source of link"), =81= th("Target of link"), =82= ), =83= map { =84= my ($hits, $referer, $url) = @$_; =85= $url =~ s#^/cgi/go/##; =86= Tr( =87= td($hits), =88= td(show_link($referer)), =89= td(show_link($url)), =90= ); =91= } @{$dbh->selectall_arrayref(<<END)}); =92= select count(*) as Hits, Referer, Url =93= from requests =94= where $COMMON and Url like '/cgi/go/%' =95= group by Referer, Url =96= order by Hits desc =97= limit 30 =98= END =99= =100= print =101= h2("CPU Burning"), =102= p("The following hosts burnt the most cumulative CPU on the server."); =103= =104= print =105= table({Cellspacing => 0, Cellpadding => 2, Border => 1}, =106= Tr( =107= th("Total CPU seconds<br>in past $DAY days"), =108= th("Host making the request"), =109= ), =110= map { =111= my ($cpu, $host) = @$_; =112= Tr( =113= td($cpu), =114= td($host), =115= ); =116= } @{$dbh->selectall_arrayref(<<END)}); =117= select sum(cpuuser+cpusys+cpucuser+cpucsys) as Cpu, Host =118= from requests =119= where $COMMON =120= group by Host =121= order by Cpu desc =122= limit 30 =123= END =124= =125= print =126= h2("CPU Hogging"), =127= p("The following periods were the busiest in terms of total CPU used."); =128= =129= print =130= table({Cellspacing => 0, Cellpadding => 2, Border => 1}, =131= Tr( =132= th("15-minute period beginning<br>(localtime)"), =133= th("Total CPU seconds<br>burnt in the period"), =134= ), =135= map { =136= my ($period, $cpu) = @$_; =137= Tr( =138= td($period), =139= td($cpu), =140= ); =141= } @{$dbh->selectall_arrayref(<<END)}); =142= select =143= from_unixtime(15*60*floor(unix_timestamp(when)/(15*60))) as Period, =144= sum(cpuuser+cpusys+cpucuser+cpucsys) as Cpu =145= from requests =146= where $COMMON group by Period =147= order by Cpu desc =148= limit 30 =149= END =150= =151= print =152= h2("User Agent Bytesucking"), =153= p("The following User Agents sucked the most cumulative bytes on the server."); =154= =155= print =156= table({Cellspacing => 0, Cellpadding => 2, Border => 1}, =157= Tr( =158= th("Total Bytes<br>in past $DAY days"), =159= th("User Agent making the request"), =160= ), =161= map { =162= my ($sent, $agent) = @$_; =163= Tr( =164= td($sent), =165= td($agent), =166= ); =167= } @{$dbh->selectall_arrayref(<<END)}); =168= select sum(Bytes) as Sent, Browser =169= from requests =170= where $COMMON =171= group by Browser =172= order by Sent desc =173= limit 30 =174= END =175= =176= print end_html; =177= =178= $dbh->disconnect; =179= =180= close STDOUT; =181= rename $TMP, $OUTPUT or die "Cannot rename $TMP to $OUTPUT: $!"; =182= =183= sub show_link { =184= use HTML::Entities (); =185= my $url = shift; =186= my $html_escaped_url = HTML::Entities::encode_entities($url); =187= a({Href => $html_escaped_url}, $html_escaped_url); =188= }