Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Download this listing!

Linux Magazine Column 77 (Jan 2006)

[Suggested title: ``Using a database for a web server log (part 1)'']

The server for www.stonehenge.com (and a few other sublease customers like www.geekcruises.com) uses mod_perl and apache.

Because the mod_perl processes tend to be heavy, I have the traditional caching reverse-proxy server stack, so that the mod_perl processes can be freed the moment they've completed their task, and some lightweight hits (such as icons) can be served without touching the backend at all.

For many years, I've been logging the backend processes directly to a database with DBI, using a custom mod_perl logger-phase handler. However, because the front processes didn't have mod_perl, the only logging I could get was a traditional flat file.

Recently, I upgraded to Apache2 and mod_perl2 (which is a bit of a story in itself, but perhaps for another time). This required updating my DBI logger a bit. At the same time, I was also asked to track users to the GeekCruises web site, so I wanted to add mod_usertrack and also a way to link front hits with back hits for complete correlation.

At first, I just enabled the DBI logger in the front servers, but within a few days, things started breaking. Because I use PostgreSQL, each child process in both the front (many) and back (few) processes was maintaining a separate connection to the database, and this was causing me to run out of memory and available database connections. Even worse, if the connections were just full enough, a new connection for the registration or administration applications for GeekCruises would fail, leaving either the customer or the admin person upset and frustrated.

So, I quickly concluded that rather than have separate connections from each child to the database, I needed some sort of aggregator. I noticed that the automatic logroller application included with Apache was basically a separate process that was forked during the startup phase, but shared a connection to all of the children. After studying the C source code of how this was accomplished, I decided I could do the same thing in a few lines of Perl as well.

The result is that I now have only two connections to the database (one used by the front cluster of servers, and the other used by the back cluster), and my PostgreSQL is happy, and so are the folks at GeekCruises.

The code from the mod_perl handler is in [listing one, below]. The code for the separate logger program will be presented in the next part of this article.

Line 1 marks this as Perl code, so that Emacs will switch to the right editing mode. Line 2 establishes the package for my handler. I put locally developed modules into the Stonehenge:: prefix to prevent collisions with CPAN modules. Line 3 enables the traditional compiler restrictions.

Line 5 reminds me what I need to do to use this module. I try to leave hints like this to remind myself what I was thinking at the time. In this case, I'm reminded of the necessary line in httpd.conf.

Line 9 defines pnote name that will be used to hold the initial wall clock and CPU time values. I'll set this up as an array ref at the beginning of the service cycle, and when we arrive at the logging phase, I can pull the values back to see what to subtract from the current values to get the deltas.

Line 11 defines the external program to accept the logging values and write them into the database. When this file is processed through Template Toolkit (as described in [my column from July 02]) during the installation process, the value inside the brackets will be replaced by the path to the root of the installation tree. Beneath that root, I have the traditional sbin directory for things like apachectl, and I've added this logger program as well. The logger program will be described in the next part of this article.

Lines 15 and 16 import the constants needed later in this module.

Line 18 declares the variable that will hold the handle for the pipe to the logger program. This handle will be set up in the parent process, before Apache forks each child, and will thus be shared for the children. This is the key to getting only one logger to activate for each cluster of front and back servers.

Lines 20 to 44 define the subroutine that will be triggered in the parent Apache process just after the configuration files have been read, but before Apache begins forking children. Line 21 is copied from the documentation, and provides me with things I don't understand, so I mostly ignore them. However, $s is the server handle, and this is useful.

Line 23 can be uncommented so that I can see that my handler is being executed in the right process (and only once).

Lines 25 to 33 set up the handlers for the start-of-transaction (PostReadRequest) and end-of-transaction (Log) phases. I loop through all of the servers, skipping over any that are virtual servers. There should be just one real server, and this is where I want to add the other phase handlers, which I verified by uncommenting line 28. The corresponding handlers are located in the current package with names other than handler, in fact corresponding to the phase name.

Lines 35 and 36 establish the pipe to the child process for the logger program. Note that death here means my master server process won't start, so this logger is critical to the entire operation. Lines 37 to 41 unbuffer this pipe.

Line 43 returns an OK status, which is probably ignored anyway, because we're not in a phase which is actually serving a response.

Once the child apache process has accepted a connection, and read the headers, the handler beginning in line 46 is executed.

Line 47 grabs the request object as the only parameter to the handler. Line 48 rejects subrequests, because we don't want to reset the counters if a page uses a subrequest, such as to perform an include file or to check the MIME type of a file in a directory listing.

Line 50 creates a 5-element list, consisting of the UNIX-epoch wall-clock time, and the four accumulated CPU times (user CPU, system CPU, child-user CPU, and child-system CPU). These values will form the ``start'' time for our current hit.

Line 52 pushes these values into the request pnotes structure as an array reference, so that we may access the values later.

Line 54 may be uncommented to verify that the right values are being stored.

Line 55 returns DECLINED so that other handlers at this phase will also be run. (I don't recall, but I suspect I can return OK as well, but this is safer and more familiar.)

Line 58 starts the handler that is called at the end of the request cycle, during the Log phase. Line 59 grabs the request object so that we can get further information. Line 60 can be uncommented to see that we are actually getting into this handler.

Lines 62 to 70 force any children that are running to be finished before we perform the logging. Generally, child processes are finished anyway, so this is typically a no-op, but if there are still some children running that would have been reaped in the cleanup phase, I want to make sure that their child CPU values don't get charged to the wrong process.

Line 72 extracts the 5-element list of time values saved during the earlier handler. If they don't exist, then line 74 issues a warning, and line 75 returns immediately, because there's no valid values for this hit. I have seen this message trigger occasionally, when a request comes in but the browser disconnects before sending a complete header, for example.

Line 79 subtracts these values just obtained from the current values, using a clever mapping that I figured out one day. I create a new list that results in pairing the new values with each element of the old list one by one, subtracting the old value.

Lines 81 and 82 attempt to extract the $orig (original) and $last request record from the current request record. When a request receives an internal redirect, such as for a 404 handler, an additional request record is added to the chain. I was having difficulty finding my wanted values, as you'll see in the code a few lines later, and decided to try looking at most of the likely elements of the request chain.

Line 83 extracts the server information object from the original request. Line 84 similarly extracts the connection object.

Line 85 fetches the server value, so that I can tell whether this is the lightweight reverse-proxy server or the heavyweight backend server. In the httpd.conf, I have:

  PerlSetVar "DBILogServer" "[* kind *]"

and ``kind'' expands to ``P'' for the proxy server and ``B'' for the back server. This string ends up in $server for logging.

Lines 87 to 108 remind me of the SQL that I used to create the request records in my database. And to load that data up, I create a hash %data in line 112, and fill it in lines 114 to 136. Each key corresponds to a column name. The logger program doesn't know the column names, and will simply take key-value pairs and create an insert for those names.

Lines 114 to 120 try to find my unique-id (added by mod_unique_id). I was having trouble finding it on some of the requests, so this is probably overkill now, but I'm impatient. Similarly, lines 121 to 124 try to find the cookie I created with my own Perl version of mod_usertrack.

Line 125 gets the timestamp of the beginning of the request. Similarly, lines 126 through 135 fetch other parts of the information about the request. Line 136 uses a hash slice to set the five values for the wallclock time and CPU times all at once.

Once the hash has been created, I need to create a simple string from the data so that it can be written to the logger, which is expecting line-oriented input. Lines 138 to 143 create a space-separated list of values that look like columnname=hexstring, where the hexstring is the hex-encoded value. I have to encode the value because it might contain spaces and other odd characters.

Lines 146 and 147 writes the data line to the logger handle, enclosed in angle brackets, which I use as start and end markers in the logger to make sure I got a clean record. Modern UNIX-like operating systems guarantee that if this write is smaller than a certain size (4096 bytes is typical), then the data will not be interrupted by other data from a different write, and this is the key to allowing many child processes to all use the same logger process.

Well, that's the server-side part of the process. Next time, I'll talk about the contents of logger, and show some typical queries that this database log lets me perform. Until then, enjoy!

        =1=     #! perl
        =2=     package Stonehenge::DBILog;
        =3=     use strict;
        =4=     
        =5=     ## usage: PerlPostConfigHandler Stonehenge::DBILog
        =6=     
        =7=     ## config
        =8=     
        =9=     my $PNOTE_IDENTIFIER = 'DBILog_Times';
        =10=    
        =11=    my $LOGGER = "[* env.PREFIX *]/sbin/logger";
        =12=    
        =13=    ## end config
        =14=    
        =15=    use Apache2::Const qw(DECLINED OK);
        =16=    use Apache2::Util qw(ht_time);
        =17=    
        =18=    my $logger_handle;
        =19=    
        =20=    sub handler {
        =21=      my ($conf_pool, $log_pool, $temp_pool, $s) = @_;
        =22=    
        =23=      ## warn "executed in $$\n";
        =24=    
        =25=      for (my $vs = $s; $vs; $vs = $vs->next) {
        =26=        ## set this up in the "real" server:
        =27=        next if $vs->is_virtual;
        =28=        ## warn "setting up ", $vs->server_hostname, ":", $vs->port, "\n";
        =29=        $vs->push_handlers(PerlPostReadRequestHandler =>
        =30=                           __PACKAGE__ . '::PerlPostReadRequestHandler');
        =31=        $vs->push_handlers(PerlLogHandler =>
        =32=                           __PACKAGE__ . '::PerlLogHandler');
        =33=      }
        =34=    
        =35=      open $logger_handle, "|$LOGGER"
        =36=        or die "Cannot create |$LOGGER: $!";
        =37=      {
        =38=        my $old = select($logger_handle);
        =39=        $| = 1;
        =40=        select($old);
        =41=      }
        =42=    
        =43=      return OK;
        =44=    }
        =45=    
        =46=    sub PerlPostReadRequestHandler {
        =47=      my $r = shift;
        =48=      return DECLINED unless $r->is_initial_req;
        =49=    
        =50=      my @times = (time, times);
        =51=    
        =52=      $r->pnotes($PNOTE_IDENTIFIER, \@times);
        =53=    
        =54=      ## warn "saved @times in pnote\n";
        =55=      return DECLINED;
        =56=    }
        =57=    
        =58=    sub PerlLogHandler {
        =59=      my $r = shift;
        =60=      ## warn "running the handler for ", $r->uri, "\n";
        =61=    
        =62=      ## first, reap any zombies so child CPU is proper:
        =63=      {
        =64=    
        =65=        my $kid = waitpid(-1, 1);
        =66=        if ($kid > 0) {
        =67=          # $r->warn("found kid $kid"); # DEBUG
        =68=          redo;
        =69=        }
        =70=      }
        =71=    
        =72=      my @times = @{$r->pnotes($PNOTE_IDENTIFIER) || []};
        =73=      unless (@times) {
        =74=        $r->warn($r->uri, ": DBILog: where is \@times?");
        =75=        return DECLINED;
        =76=      }
        =77=    
        =78=      ## delta these times:
        =79=      @times = map { $_ - shift @times } time, times;
        =80=    
        =81=      my $orig = $r->main || $r;
        =82=      my $last = $orig; $last = $last->next while $last->next;
        =83=      my $s = $orig->server;
        =84=      my $c = $orig->connection;
        =85=      my $server = $orig->dir_config->get("DBILogServer");
        =86=    
        =87=    =for SQL
        =88=    
        =89=    create table requests (
        =90=      uid text,
        =91=      cookie text,
        =92=      stamp timestamp with timezone default now(),
        =93=      host text not null,
        =94=      server text,
        =95=      vhost text not null,
        =96=      method text not null,
        =97=      url text not null,
        =98=      basicauth text,
        =99=      referer text,
        =100=     useragent text,
        =101=     status int default 0,
        =102=     bytes int,
        =103=     wall int,
        =104=     cpuuser real,
        =105=     cpusys real,
        =106=     cpucuser real,
        =107=     cpucsys real,
        =108=   );
        =109=   
        =110=   =cut
        =111=   
        =112=     my %data;
        =113=   
        =114=     $data{uid} =
        =115=       $orig->headers_in->{"x-stonehenge-unique-id"} ||
        =116=         $r->headers_in->{"x-stonehenge-unique-id"} ||
        =117=           $last->headers_in->{"x-stonehenge-unique-id"} ||
        =118=             $orig->subprocess_env->{UNIQUE_ID} ||
        =119=               $r->subprocess_env->{UNIQUE_ID} ||
        =120=                 $last->subprocess_env->{UNIQUE_ID};
        =121=     $data{cookie} =
        =122=       $orig->notes->{"cookie"} ||
        =123=         $r->notes->{"cookie"} ||
        =124=           $last->notes->{"cookie"};
        =125=     $data{stamp} =Apache2::Util::ht_time($r->pool, $orig->request_time);
        =126=     $data{host} = $c->get_remote_host;
        =127=     $data{server} = $server;
        =128=     $data{vhost} = join(":", $s->server_hostname, $s->port);
        =129=     $data{method} = $orig->method;
        =130=     $data{url} =($orig->the_request =~ /^\S+\s+(\S+)/)[0];
        =131=     $data{basicauth} =$orig->user;
        =132=     $data{referer} =$orig->headers_in->{'Referer'};
        =133=     $data{useragent} =$orig->headers_in->{'User-agent'};
        =134=     $data{status} =$orig->status;
        =135=     $data{bytes} =$r->bytes_sent;
        =136=     @data{qw(wall cpuuser cpusys cpucuser cpucsys)} = @times;
        =137=   
        =138=     my $data = join(" ",
        =139=                     map { defined $data{$_} ?
        =140=                             "$_=" . unpack("H*", $data{$_}) :
        =141=                               () }
        =142=                     sort keys %data
        =143=                    );
        =144=   
        =145=     ## we hope we get an atomic write here...
        =146=     print $logger_handle "<$data>\n"
        =147=       or die "Write error on $logger_handle: $!";
        =148=     ## hopefully, angle bracket markers will tell us if we got scrambled
        =149=   
        =150=     return OK;
        =151=   }
        =152=   
        =153=   1;

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Linux Magazine Column 77 (Jan 2006)