Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
![]() |
Download this listing! | ![]() |
![]() |
![]() |
Linux Magazine Column 78 (Feb 2006)
[Suggested title: ``Using a database for a web server log (part 2)'']
In [last month's column], I showed how my Apache2 processes are using mod_perl2 to write logging information to a logger process. Let's continue that description with information about the logger process.
The logger is launched with its STDIN
connected to a pipe created
by the master Apache process. As Apache forks each server child, this
pipe is shared, thus making the logger's input available to all
children.
During the logging phase, each child writes a line to this shared pipe that looks like:
<field1=hexvalue field2=hexvalue field3=hexvalue>
where each of the field names correspond to a column in the logging database, and the hexvalues will be their corresponding value.
Each server child process is writing to this pipe asynchronously, without performing any sort of locking. However, most modern Unix-like systems guarantee that such writes under a certain size (typically 4096 bytes) are performed atomically, which means that the logger process should get data with no fear of being intermixed. However, note that each record is bracketed by a less-than and greater-than mark. We can double check that we've gotten a complete record by looking for these marks. In the months that I've been using this system, I have not yet seen a failed record, so I believe this system to be reasonably safe.
The logger process needs to remain alive as long as the Apache children are alive. To stop the logger process, we can simply terminate the program when we finally get EOF on the pipe. But what about errors? If the logger process dies, there's no easy way to replace it (other than restart the Apache server) because the pipe from the parent Apache process is not a named pipe. (Hmm, making the interface through a named pipe might be an interesting idea for a future version of this program.)
Because this input pipe was ``sacred'', and the logger needed to run
``forever'', I decided to implement a watchdog process pattern. The
logger program forks a child to do the actual work, and watches for
the logger-child to exit. If the child exits cleanly, so does the
watchdog (such as when the child reads EOF because Apache is shutting
down). However, if the child dies badly, the watchdog respawns a new
child. Because the watchdog retains the STDIN
from the Apache
logging pipe, the new child also inherits this connection, including
any unprocessed data, and can pick up where the exited child left off.
As long as the watchdog itself doesn't have any unusual exit conditions, we're guaranteed a fair degree of safety, so it's important to make sure the watchdog does as little as possible. My resulting program that performs both the watchdog and the database writer are found in [listing one, below].
Lines 1-3 create my traditional program start. Because this program is
processed as a Template Toolkit file during the server build process,
anything marked with [*..*]
will be processed and replaced with the
appropriate text. In this case, the PERL
variable from the environment
provides the appropriate ``she-bang'' line. This variable is set
in the GNU Makefile as:
export PERL ?= $(shell which perl)
Thus, by default, an appropriate path to perl
is determined. This
is important because I've left the Mac OSX Perl alone on my box,
preferring instead to install local modules into a local bleading-edge
Perl that I've installed in /opt/perl/snap/bin
on my machine.
However, when I push this code up to the server, I need to use the
traditional /usr/bin/perl
path instead.
Lines 5 through 8 create a header on the resulting templated file, giving the original file name, processing time, and the modification time. I've discovered that when I generate files that look like source files, it's important to mark the file as such. Otherwise, I might try editing it and then wonder why my changes disappear later.
Lines 11 through 15 provide the database information to connect to my PostgreSQL server. Obviously, the username and password here are fictionalized. These items are used in line 31 to connect to the database.
Line 17 pulls in the DBI module.
Lines 19 to 57 form a loop for the basic watchdog algorithm.
Line 20 attempts to create a child process to do the real work. If
this should fail, a message is logged to STDERR
. This error
message appears in the Apache error log. After 5 seconds, we try
again. The five seconds ensures that we don't get into a tight loop
of trying to fork, failing, and burning up a lot of CPU trying that
again and again immediately.
If we reach line 21, we have both a parent (the watchdog) and the
child (the actual logging process) running in parallel. However,
$pid
will be zero in the child, and non-zero in the parent, so it's
time for them to start acting differently. If we're the parent, we'll
wait for the child (line 23).
If the child process exits cleanly, then $?
will be zero, and the
parent process breaks out of the loop (line 24), resulting in a clean
exit. However, if the child broke somehow, we reach line 25,
triggering a warning message, and an attempt to restart the child.
Again, we wait 5 seconds to ensure that there's no thrashing because
of something really horribly wrong out there.
The parent code will execute either the last
in line 24, or the
redo
in line 27, so it can never make it down to line 29, which
marks the beginning of the child code. Now we can see where the real
work gets done.
Lines 31 and 32 set up my database handle (called $dbh
for
traditional reasons). Because I've set RaiseError
, any bad
operations (including setting up the initial connection) trigger an
exception. Because these exceptions aren't caught (no eval
blocks
in this code), the entire child process will die, and the watchdog
will replace it. Simple, but effective.
Lines 34 through 53 read a line at a time from standard input.
Because this child process inherits STDIN
from the parent
(watchdog) process, and the parent process got its STDIN
as the
pipe connection to Apache for logging, this loop will execute once for
each log record.
Line 35 was useful during debugging to ensure that the right data was being read by the child logger. The Apache error log showed that I was getting the right information at the right time. Of course, in production, this would be excessively noisy (turning the error log into an access log), so I commented the line in case I needed it in the future.
Lines 36 and 37 extract the keyword and hex value payload from the line. If the line doesn't have the proper angle bracket markers, we report it and ignore it. This might happen if the writes weren't atomic, for example.
Line 38 sets up an array to hold the list of fields to be inserted.
Lines 39 to 44 break up the line by whitespace, causing $field
to
repeatedly contain a column name and hex value for that column. Lines
40 and 41 break up the $field
into the column name and hex value as
$1
and $2
. If that fails, the item is logged and ignored.
Line 42 grabs the column name into $key
, and turns the hex value
into the original string for $value. Line 43 pushes that into the
@fields
array for final insertion.
Once we've completed the value extraction, it's time to create the insert handle. Lines 45 to 50 create a string that looks like:
INSERT INTO requests (stamp, url, status) VALUES (?, ?, ?)
The number of question marks corresponds to the number of fields that we saw on this request. These question marks are DBI placeholders that will be given values during the execute step.
Line 51 prepares the SQL, using prepare_cached
. We hope that most
of these inserts will actually use the same column names, so the
prepare_cached
step will be very fast. However, for maximum
flexibility, each request is considered separately. Note that the
logger doesn't even have to know the names of the columns.
Once we have a good statement handle in $sth
, it's time to perform
the insert in line 52. Each of the values extracted from @fields
provides the value for the corresponding column selected in the SQL
earlier.
When the loop reading from STDIN
exits, it means we've seen the end
of the input. At this point, the Apache server is no longer talking
to us, so our usefulness has ceased, and it's time to leave quietly,
in line 56. The 0 exit code ensures that the watchdog will also know
to exit quietly as well.
And there you have it: a logging aggregator that can take log information from nearly any number of Apache child processes and write them all to the database using a single connection handle.
Now that this logging agent is in place, I can get near-real-time information from my database about the performance of my webserver.
For convenience, I've created a couple of views to select the front or back server cluster:
CREATE VIEW front AS SELECT requests.* FROM requests WHERE server = 'P'; CREATE VIEW back AS SELECT requests.* FROM requests WHERE server = 'B';
From these views, I can ask about all front requests that have resulted in 404s in the past 24 hours:
SELECT referer, url, count(*) FROM front WHERE stamp > 'yesterday' AND status = 404 GROUP BY 1, 2 ORDER BY 3 DESC LIMIT 10;
Or, I can see which user agents (types of browsers or robots) have burned the most CPU on my system in the past five minutes (possibly pointing out some badly behaved spider):
SELECT SUM(cpuuser+cpusys+cpucuser+cpucsys), useragent FROM requests WHERE stamp > now() - interval '5 minutes' GROUP BY 2 ORDER BY 1 DESC LIMIT 10;
I could also implement some bandwidth restrictions by penalizing any heavy downloaders by IP address using a query similar to:
SELECT SUM(bytes), host FROM front WHERE stamp > now() - interval '1 minutes' GROUP BY 2 ORDER BY 1 DESC LIMIT 10;
Since my pages are dynamically generated, it's nice to know which of my pages could use some optimization:
SELECT SUM(cpuuser+cpusys+cpucuser+cpucsys), vhost || url FROM requests WHERE stamp > now() - interval '1 day' GROUP BY 2 ORDER BY 1 DESC LIMIT 10;
Lots of useful queries! I'm still discovering more every day.
I hope you find the information useful. Until next time, enjoy!
Listing
=1= #![* env.${"PERL"} *] -w =2= use strict; =3= $|++; =4= =5= [* USE date; FILTER format('### %-70s ###') -*] =6= Processed automatically at [* date.format(date.now) *] from =7= [* template.name *] edited at [* date.format(component.modtime) *] =8= [*- END *] =9= =10= =11= ## config =12= my $DSN = 'dbi:Pg:dbname=web'; =13= my $DB_TABLE = 'requests'; =14= my $DB_AUTH = 'Username:Password'; =15= ## end config =16= =17= use DBI; =18= =19= { =20= defined (my $pid = fork) or do { warn "can't fork: $!"; sleep 5; redo }; =21= if ($pid) { =22= ## parent does... =23= waitpid($pid, 0); =24= last unless $?; # clean exit means I can go away too =25= warn "logger died with $?, restarting..."; =26= sleep 5; # do not thrash! =27= redo; =28= } =29= ## child does... =30= =31= my $dbh = DBI->connect($DSN, (split ':', $DB_AUTH), =32= {RaiseError => 1 }); =33= =34= while (<STDIN>) { =35= ## warn "$0 saw $_"; =36= s/^<(.*)>$/$1/ =37= or warn ("skipping $_\n"), next; =38= my @fields; =39= for my $field (split) { =40= $field =~ /^(\w+)=([0-9a-fA-F]*)$/ =41= or warn ("skipping $_ in $field\n"), next; =42= my ($key, $value) = ($1, pack("H*", $2)); =43= push @fields, [$key, $value]; =44= } =45= my $INSERT = =46= "INSERT INTO $DB_TABLE (". =47= (join ",", map $_->[0], @fields). =48= ") VALUES(". =49= (join ",", ("?") x @fields). =50= ")"; =51= my $sth = $dbh->prepare_cached($INSERT); =52= $sth->execute(map $_->[1], @fields); =53= } =54= =55= $dbh->disconnect; =56= exit 0; =57= }