Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 48 (Apr 2000)

[suggested title: Web access logs with DBI]

A few months back, I was flipping through the wonderful Writing Apache Modules with Perl and C by Doug MacEachern and fellow WebTechniques columnist Lincoln Stein while trying to think of more bells and whistles to throw at my mod_perl-enabled webserver for http://PerlTraining.Stonehenge.com. I stumbled across the section that talks about using a database instead of a flat file for the access log. Now, I had just recently upgraded and cleaned up the MySQL database for my ISP, and thought it would be nice to have more advanced statistics and reporting. So, I jumped in.

Now this was quite a jump, because I hadn't even used any of Apache's custom log features. I had a simple NCSA-compatible access log, and a few specialized logs in a few directories via SSI loggers. So, I decided to go whole hog the other way. Not only am I logging the usual things (host, user agent, bytes sent, and so on), but I even log the CPU times for each transaction! This lets me see which of my URLs are burning up more CPU so I can figure out ways to cache them better, especially important on a machine that is shared with other users.

And the log is being written immediately to a DBI-based MySQL database. The power here is that I can use SQL queries to generate ad-hoc reports on an up-to-the-minute basis, as well as get the general reports canned up into a Perl data structure with relative ease. No more parsing flat files (not that this was much trouble with Perl anyway). The downside is that I'm now burning 1 megabyte per day for storage inside the MySQL database, but my sysadmin said just to let them burn, since disk is cheap.

Part 0: What I needed to have already in place

First, I needed to have an Apache server with mod_perl installed, including handling of all the phases. This means I said something like EVERYTHING=1 during the configuration (see the mod_perl documentation for details).

Then, I needed to have access to a live MySQL database, with a valid user and password. That's documented somewhat fuzzily in the MySQL documentation, and a little clearer in the MySQL and mSQL book from O'Reilly and Associates.

Third, I needed to have the DBI modules installed, along with Apache::DBI adaptor. The CPAN holds these, and they're relatively painless to install.

Finally, I had to have the DBD::mysql modules installed, again from the CPAN. This was somewhat painless to install, although the installation asks a lot of questions and wants a live MySQL database to do testing, and doesn't listen to the username and password that I gave it for all the testing steps.

Part 1: Gathering the Statistics

In order to have statistics to analyze, we have to have Apache generate them. Based on the code I saw in Doug and Lincoln's book, I wrote my DBI logger, presented in [listing one, below].

Line 1 sets up the unique package for this handler. I prefix all my local packages with Stonehenge:: (my company name) to completely distinguish them from any distribution. Line 2 enables the normal compiler restrictions. Line 4 is a comment to remind me that this handler is meant to be a PerlInitHandler, the first possible phase under which we can take control for each request.

Lines 8 through 10 pull in some useful modules for this handler. Apache::Constants gives us useful values for OK and DECLINED, used at the end of a handler to inform the server of the outcome of our work. DBI gets pulled in as well, although not importing anything. And Apache::Util gives us the ht_time function: a very fast way to convert a Unix timestamp into a variably formatted string.

Lines 12 through 14 give the DBI-specific information, including the DBI database specifier, the table being used, and the database username and password. Obviously, I've put some fake stuff in here for the heck of it, but you can hopefully see the format.

Lines 16 through 18 define the order in which the fields to be stored are generated, and the names of the columns for those fields.

Lines 19 through 24 define an SQL INSERT command to store a new record. The question-mark placeholder will let us not have to worry about quoting.

Lines 26 to 45 are actually a giant comment as far as Perl is concerned. I'm using a Perl ``POD'' directive to store an SQL statement inside the Perl program. This statement is meant to be cut and pasted into the mysql interactive command to create the table needed for this log. Otherwise, it's ignored.

Lines 47 to 102 define the handler. This subroutine will be invoked at each request, very early in the request handling phases. The first parameter (picked off in line 48) is the Apache::Request object. Line 49 ensures that we are processing only the initial request (one that should be logged), and not the later redirects or subrequests that happen in include files.

Line 51 starts the counter ticking for wall-clock time and CPU time. This variable is not referenced further in the main subroutine, but is held as a ``closure'' variable for the subroutine reference created in the lines starting at line 56. The five elements are wall-clock timestamp, user cpu, system cpu, child user cpu, and child system cpu.

Lines 53 through 99 form a log handler. This log handler is not installed using the normal PerlLogHandler directive, but simply by having been ``pushed'' here. The nice thing about doing it this way is that this log handler can now share the closure variables created above. So, the outer handler runs before anything of interest has happened for a request, and the inner pushed handler happens as just about the last possible thing in a request. That's how we are sure to get the right resources spent on this request as accurately as possible.

Lines 58 to 64 ensure that any child processes are reaped, so that the child CPU times are properly updated. Without this, the child processes don't get reaped until the cleanup phase (after the logging phase), and the child CPU values are almost always zero.

Line 67 cleverly replaces the values in @times with their delta values.

Lines 69 and 70 grab the two interesting Apache::Request objects. If a request has been redirected internally, the ``original'' request is not necessarily the request to which all the loggable activity has been charged, so we need to know both.

Lines 72 to 85 construct the new record to be added to the database. The values must be in the same order as the @FIELDS were given earlier. Most of the elements are just straight extractions from the appropriate Apache::Request objects, with two exceptions. The request time is a MySQL-compatible date-time value, created with the assistance of the ht_time function. And the request URL comes from the request string, not the uri value, because the request string contains the parameter items as well.

Lines 87 to 93 connect to the database, and store the data. Now, this looks expensive to do on each request, but there's some caching going on here. First, I'm using Apache::DBI (described in a moment), which caches the connect requests, so that this happens only once per Apache server process. Second, the prepare_cached method enables the prepared SQL to be reused on subsequent hits. (Thanks to Tim Bunce for pointing that out to me while reviewing the code.) And the disconnect doesn't actually happen, but I include it for consistency if I ever choose not to use Apache::DBI.

If there's a would-be fatal error of any kind in lines 88 to 92, lines 94 to 96 catch that and throw the error in the log. For example, if the database could not connected, or the database was full, we'd see that in the error log.

So, I stuck this module somewhere in the @INC path for my embedded mod_perl interpreter, and then added the following lines to my main configuration file for Apache:

    ## DBI logging
    PerlModule Apache::DBI
    PerlModule DBI
    PerlInitHandler Stonehenge::DBILog

The Apache::DBI module ensures a mostly-transparent caching of all DBI objects created in the server. The current design requires that Apache::DBI be loaded before the DBI module, but I understand discussion is underway to remove this restriction. In any event, the DBI module is also pulled in, and finally a PerlInitHandler is established with our logging module as the handler definition.

After restarting the server, our database begins to fill up. At this point, we can use the mysql command to perform simple ad-hoc requests, but it's more interesting to let Perl do the canned reports.

Part 2: Analyzing the results

So, now that I have more than a week's worth of data, I want to see just how crazy my server has been working over the past week. I'll do this by making a single HTML page that has some of the most interesting statistics. This is just a demo, so I don't have to be comprehensive. In real life, I'd probably have a series of pages with different views and time periods.

To grab the statistics, I'll run a Perl program periodically to update an HTML file. I do this with cron on this Linux box, with an entry that looks like:

    52 0-23/3 * * * /home/merlyn/bin/AnalogDB

And cron interprets this as ``run this command every 3 hours between midnight and 11pm, at 52 minutes past the hour''. So it runs at 00:52, 03:52, 06:52 and so on up to 21:52. Running it more often would make it more fresh, but at some point, you'll be burning more CPU talking about how you burn CPU than you actually burned doing the real work.

The contents of AnalogDB are present in [listing two, below]. I'll point out just the highlights, since it's basically a preamble, four different reports, and a postamble.

The preamble in lines 1 to 44, set up this program to generate an HTML file on STDOUT. STDOUT has been redirected to a temporary file defined in line 32, which lives in the same directory as the ultimate output file named in line 12. The purpose of the temporary file is so that we don't confuse any existing readers until we have a complete HTML file to show, since this execution will probably take 30 seconds or so.

Lines 14 through 27 are also worth noting, as they define the subset of the records that I'm willing to describe in public. For various reasons, I have private areas of my webserver, and private accesses to the webserver for testing, and this SQL helps remove those from the public statistics.

Also note the SQL control in line 37. This MySQL-specific operation permits some queries to succeed where they might previous fail, at the expense of having all queries be temporarily saved to disk on the server. Seems like a fine tradeoff to me, but it's not the default.

And finally note that the HTML is being generated with the CGI.pm shortcuts. I like that much better than I like typing less-than and greater-than.

A typical report is shown in lines 45 through 70. Lines 45 through 47 show the description of the report. Lines 49 through 63 dump an HTML table with the result contents. The table header gets created by lines 51 through 55, while the body is the result of the map beginning in line 56.

To read this, we go to the data source in line 63, which generates a query to the database, and results in an arrayref of rows, each of which is an arrayref of the fields in the row. This arrayref is dereferenced in line 63 (the outer @{...}) and then spoonfed into map. Each $_ inside the map is therefore an arrayref for one row, which gets dereferenced and stored into named variables in line 57. Since two of the three columns are URLs, I wrap the data cell in a call to show_link (defined later) to turn it into a clickable link.

The postamble in lines 176 to 181 finish up the HTML file, and rename it over the top of the previous HTML file, meaning that users will now see the updated statistics. Lines 183 to 188 define the show_link routine, returning an appropriately escaped HTML string for a link to the given URL.

And that's all there is to it! If you want, you can use PNGgraph or GNUplot to make pictures from the data. I may even show that in a future column if I get ambitious. But until next month, enjoy!

Listings

        =0=     #### LISTING ONE ####
        =1=     package Stonehenge::DBILog;
        =2=     use strict;
        =3=     
        =4=     ## usage: PerlInitHandler Stonehenge::DBILog
        =5=     
        =6=     
        =7=     
        =8=     use Apache::Constants qw(OK DECLINED);
        =9=     use DBI ();
        =10=    use Apache::Util qw(ht_time);
        =11=    
        =12=    my $DSN = 'dbi:mysql:stonehenge_httpd';
        =13=    my $DB_TABLE = 'requests';
        =14=    my $DB_AUTH = 'sekretuser:sekretpassword';
        =15=    
        =16=    my @FIELDS =
        =17=      qw(when host method url user referer browser status bytes
        =18=         wall cpuuser cpusys cpucuser cpucsys);
        =19=    my $INSERT =
        =20=      "INSERT INTO $DB_TABLE (".
        =21=      (join ",", @FIELDS).
        =22=      ") VALUES(".
        =23=      (join ",", ("?") x @FIELDS).
        =24=      ")";
        =25=    
        =26=    =for SQL
        =27=    
        =28=    create table requests (
        =29=      when datetime not null,
        =30=      host varchar(255) not null,
        =31=      method varchar(8) not null,
        =32=      url varchar(255) not null,
        =33=      user varchar(50),
        =34=      referer varchar(255),
        =35=      browser varchar(255),
        =36=      status smallint(3) default 0,
        =37=      bytes int(8),
        =38=      wall smallint(5),
        =39=      cpuuser float(8),
        =40=      cpusys float(8),
        =41=      cpucuser float(8),
        =42=      cpucsys float(8)
        =43=    );
        =44=      
        =45=    =cut
        =46=    
        =47=    sub handler {
        =48=      my $r = shift;
        =49=      return DECLINED unless $r->is_initial_req;
        =50=    
        =51=      my @times = (time, times);    # closure
        =52=    
        =53=      $r->push_handlers
        =54=        (
        =55=         PerlLogHandler =>
        =56=         sub {
        =57=           ## first, reap any zombies so child CPU is proper:
        =58=           {
        =59=             my $kid = waitpid(-1, 1);
        =60=             if ($kid > 0) {
        =61=               # $r->log->warn("found kid $kid"); # DEBUG
        =62=               redo;
        =63=             }
        =64=           }
        =65=    
        =66=           ## delta these times:
        =67=           @times = map { $_ - shift @times } time, times;
        =68=    
        =69=           my $orig = shift;
        =70=           my $r = $orig->last;
        =71=    
        =72=           my @data =
        =73=             (
        =74=              ht_time($orig->request_time, '%Y-%m-%d %H:%M:%S', 0),
        =75=              $r->get_remote_host,
        =76=              $r->method,
        =77=              # $orig->uri,
        =78=              ($r->the_request =~ /^\S+\s+(\S+)/)[0],
        =79=              $r->connection->user,
        =80=              $r->header_in('Referer'),
        =81=              $r->header_in('User-agent'),
        =82=              $orig->status,
        =83=              $r->bytes_sent,
        =84=              @times,
        =85=             );
        =86=    
        =87=           eval {
        =88=             my $dbh = DBI->connect($DSN, (split ':', $DB_AUTH),
        =89=                                   { RaiseError => 1 });
        =90=             my $sth = $dbh->prepare_cached($INSERT);
        =91=             $sth->execute(@data);
        =92=             $dbh->disconnect;
        =93=           };
        =94=           if ($@) {
        =95=             $r->log->error("dbi: $@");
        =96=           }
        =97=    
        =98=           return DECLINED;
        =99=         });
        =100=     
        =101=     return DECLINED;
        =102=   }
        =103=   
        =104=   1;
        =0=     #### LISTING TWO ####
        =1=     #!/usr/bin/perl
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use DBI ();
        =6=     use CGI::Pretty qw(:all -no_debug);
        =7=     
        =8=     ## BEGIN CONFIG ##
        =9=     
        =10=    my $DSN = 'dbi:mysql:stonehenge_httpd';
        =11=    my $DB_AUTH = 'sekretuser:sekretpassword';
        =12=    my $OUTPUT = "/home/merlyn/Html/stats.html";
        =13=    my $DAY = 7;
        =14=    my $COMMON = <<END_COMMON;
        =15=    (
        =16=      (
        =17=        Url not like '/%/%'
        =18=        or Url like '/perltraining/%'
        =19=        or Url like '/merlyn/%'
        =20=        or Url like '/cgi/%'
        =21=        or Url like '/perl/%'
        =22=        or Url like '/icons/%'
        =23=        or Url like '/books/%'
        =24=      )
        =25=      and Host not like '%.stonehenge.%'
        =26=      and When > date_sub(now(), interval $DAY day)
        =27=    )
        =28=    END_COMMON
        =29=    
        =30=    ## END CONFIG ##
        =31=    
        =32=    my $TMP = "$OUTPUT~NEW~";
        =33=    open STDOUT, ">$TMP" or die "Cannot create $TMP: $!";
        =34=    chmod 0644, $TMP or warn "Cannot chmod $TMP: $!";
        =35=    
        =36=    my $dbh = DBI->connect($DSN, (split ':', $DB_AUTH), { RaiseError => 1 });
        =37=    $dbh->do("SET OPTION SQL_BIG_TABLES = 1");
        =38=    
        =39=    print
        =40=      start_html("Web server activity"),
        =41=      h1("Web server activity at ".localtime),
        =42=      p("This page gives web server activity viewed in various ways,",
        =43=        "updated frequently for information over the prior seven days.");
        =44=    
        =45=    print
        =46=      h2("Incoming links"),
        =47=      p("The following links were the most frequent ways that people found to this site.");
        =48=    
        =49=    print
        =50=      table({Cellspacing => 0, Cellpadding => 2, Border => 1},
        =51=            Tr(
        =52=               th("Hits in<br>past $DAY days"),
        =53=               th("Source of link"),
        =54=               th("Target of link"),
        =55=              ),
        =56=            map {
        =57=              my ($hits, $referer, $url) = @$_;
        =58=              Tr(
        =59=                 td($hits),
        =60=                 td(show_link($referer)),
        =61=                 td(show_link($url)),
        =62=                );
        =63=            } @{$dbh->selectall_arrayref(<<END)});
        =64=    select count(*) as Hits, Referer, Url
        =65=    from requests
        =66=    where $COMMON and Referer not like '%.stonehenge.%'
        =67=    group by Referer, Url
        =68=    order by Hits desc
        =69=    limit 30
        =70=    END
        =71=    
        =72=    print
        =73=      h2("Outgoing links"),
        =74=      p("The following links were the most frequent ways that people left this site.");
        =75=    
        =76=    print
        =77=      table({Cellspacing => 0, Cellpadding => 2, Border => 1},
        =78=            Tr(
        =79=               th("Hits in<br>past $DAY days"),
        =80=               th("Source of link"),
        =81=               th("Target of link"),
        =82=              ),
        =83=            map {
        =84=              my ($hits, $referer, $url) = @$_;
        =85=              $url =~ s#^/cgi/go/##;
        =86=              Tr(
        =87=                 td($hits),
        =88=                 td(show_link($referer)),
        =89=                 td(show_link($url)),
        =90=                );
        =91=            } @{$dbh->selectall_arrayref(<<END)});
        =92=    select count(*) as Hits, Referer, Url
        =93=    from requests
        =94=    where $COMMON and Url like '/cgi/go/%'
        =95=    group by Referer, Url
        =96=    order by Hits desc
        =97=    limit 30
        =98=    END
        =99=    
        =100=   print
        =101=     h2("CPU Burning"),
        =102=     p("The following hosts burnt the most cumulative CPU on the server.");
        =103=   
        =104=   print
        =105=     table({Cellspacing => 0, Cellpadding => 2, Border => 1},
        =106=           Tr(
        =107=              th("Total CPU seconds<br>in past $DAY days"),
        =108=              th("Host making the request"),
        =109=             ),
        =110=           map {
        =111=             my ($cpu, $host) = @$_;
        =112=             Tr(
        =113=                td($cpu),
        =114=                td($host),
        =115=               );
        =116=           } @{$dbh->selectall_arrayref(<<END)});
        =117=   select sum(cpuuser+cpusys+cpucuser+cpucsys) as Cpu, Host
        =118=   from requests
        =119=   where $COMMON
        =120=   group by Host
        =121=   order by Cpu desc
        =122=   limit 30
        =123=   END
        =124=   
        =125=   print
        =126=     h2("CPU Hogging"),
        =127=     p("The following periods were the busiest in terms of total CPU used.");
        =128=   
        =129=   print
        =130=     table({Cellspacing => 0, Cellpadding => 2, Border => 1},
        =131=           Tr(
        =132=              th("15-minute period beginning<br>(localtime)"),
        =133=              th("Total CPU seconds<br>burnt in the period"),
        =134=             ),
        =135=           map {
        =136=             my ($period, $cpu) = @$_;
        =137=             Tr(
        =138=                td($period),
        =139=                td($cpu),
        =140=               );
        =141=           } @{$dbh->selectall_arrayref(<<END)});
        =142=   select
        =143=     from_unixtime(15*60*floor(unix_timestamp(when)/(15*60))) as Period,
        =144=     sum(cpuuser+cpusys+cpucuser+cpucsys) as Cpu
        =145=   from requests
        =146=   where $COMMON group by Period
        =147=   order by Cpu desc
        =148=   limit 30
        =149=   END
        =150=   
        =151=   print
        =152=     h2("User Agent Bytesucking"),
        =153=     p("The following User Agents sucked the most cumulative bytes on the server.");
        =154=   
        =155=   print
        =156=     table({Cellspacing => 0, Cellpadding => 2, Border => 1},
        =157=           Tr(
        =158=              th("Total Bytes<br>in past $DAY days"),
        =159=              th("User Agent making the request"),
        =160=             ),
        =161=           map {
        =162=             my ($sent, $agent) = @$_;
        =163=             Tr(
        =164=                td($sent),
        =165=                td($agent),
        =166=               );
        =167=           } @{$dbh->selectall_arrayref(<<END)});
        =168=   select sum(Bytes) as Sent, Browser
        =169=   from requests
        =170=   where $COMMON
        =171=   group by Browser
        =172=   order by Sent desc
        =173=   limit 30
        =174=   END
        =175=   
        =176=   print end_html;
        =177=   
        =178=   $dbh->disconnect;
        =179=   
        =180=   close STDOUT;
        =181=   rename $TMP, $OUTPUT or die "Cannot rename $TMP to $OUTPUT: $!";
        =182=   
        =183=   sub show_link {
        =184=     use HTML::Entities ();
        =185=     my $url = shift;
        =186=     my $html_escaped_url = HTML::Entities::encode_entities($url);
        =187=     a({Href => $html_escaped_url}, $html_escaped_url);
        =188=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.