Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Download this listing! | ||
Linux Magazine Column 28 (Sep 2001)
[suggested title: Cleaning out a logging database]
The Apache web server that handles www.stonehenge.com
logs its
transactions directly to a MySQL database, using a mod_perl
handler. This is really cool, because I can perform statistical
correlations on hits over the past few months, including such complex
things as the average CPU time used for a particular URL (to see if
some of my dynamic pages need better caching) or the greatest number
of referers to a particular page.
But it's a bit uncool because even for a moderate site like mine, the storage requirement to hold information about each of the many hits over time can grow rapidly, about 100 megabytes per month at the current hit rate.
Every once in a while, my ISP admin would yell at me and tell me to
reduce the size of my MySQL usage. So I'd type a few commands to roll
out the oldest data into a second temporary table, then mysqldump
that table into a flat file, compress the file, and then delete the
information from the original table. If I was feeling particularly
nice, I'd also perform an OPTIMIZE TABLE
as well. The compressed
file takes far less space than the original file (about 1/15th or so),
so I can safely store the historical data for a lot longer should I
ever want to restore it all for a grand archaeological trip.
But there are many steps to this procedure, and because it was performed infrequently, and I wasn't smart enough to take notes about what I had done the last time, I found myself always staring for 10 to 15 minutes at the MySQL online documentation. And once I was sure of the commands I was typing, I would proceed, cursing my ISP for not providing me unlimited disk space.
So, it got to be ``that time of the month'' again a few days ago, and I decided once and for-all to automate the procedure. Of course, this is a bit silly, because the work it took to automate this once-a-month rollout has now taken about the same amount of time as it would take to do this procedure for a year. But just in case a few of you can benefit from this technique (or perhaps a similar one), I'll pass along the program to recoup my investment.
I've tried to make the program rather generic. Although it's specific to MySQL, it should work with any table that has a date-time stamp that increases with time.
So, without further delay, I introduce my so-called time-saver in [listing one, below].
Line 1 turns on warnings. Line 2 turns on compiler restrictions (no symbolic references, variables must be declared, and barewords are not permitted). Line 3 disables output buffering: what little output we generate, we want to see right away.
Lines 5 to 10 define the configuration parameters, making this program
somewhat reusable for other applications. $DIR
is the directory in
which the output files will be stored. $DB
is the MySQL database
name, and $TABLE
defines the table within that database that will
be rolled out. $STAMP
is a date-time field within the $TABLE
table that will be used to select the records of interest for
archival. Finally, $DB_USER
and $DB_PASS
define the MySQL user
that has table creation and access privileges for that database. Of
course, I've replaced my live data for some of these values with
stand-in values: I'll let you guess which ones.
Line 12 pulls in the DBI
module (installed from the CPAN). In
addition, you'll also need DBD::mysql
installed, which DBI
pulls
in.
Line 14 puts us into the right directory, permitting my program to be run from cron without a special directory change in the invocation.
Lines 16 and 17 connect up to the database, using the selected
username and password and database name. In addition, I've enabled
the RaiseError
parameter, causing any significant error to
automatically throw an exception (die
), and disabled the automatic
printing of error messages. This keeps me from having to add an ``or
die ...'' after each database call.
Line 18 is a bit of voodoo programming. One time when I was dealing with large tables in SQL, I got some sort of table overflow, and it aborted my program. I found this command in the manual, and now ritually include it in each program as a gift to the deities to keep from bombing out. I have no idea when it's needed, and the manual isn't clear about that, and I still don't know why it isn't the default. But there it is.
Lines 21 to 33 determine the temporary table name and the upper bound for the date range. These are determined dynamically by looking at the oldest entry in the current records.
First, lines 22 to 23 get the Unix Epoch time value for the oldest
entry as the number of seconds since the Unix base value. Line 25
dumps that out for logging purposes, by using localtime
in a scalar
context.
Then, lines 27 to 30 compute a ``current month'' and ``next month'' pair
of values based on that stamp, by again using localtime
, but now in
a list context. The year number from localtime
is offset by 1900,
and the month by 1, so we add those back in line 28. The ending month
is initially the start month plus 1, but if that exceeds the end of
the year, we adjust the year and month appropriately.
Lines 31 and 32 format the current month values for the archival table
name, and the next month for the MySQL date string for the first of
the month. These two values end up in the outer $table_name
and
$before
values, respectively, and this is shown in line 35.
Lines 38 to 48 get the table definition (or at least a portion of it) so that we can create the archival table similar to the original table. First, line 39 creates a statement handle asking MySQL for the table structure, which is executed in line 40.
Line 41 is a bit of magic, so let's read it slowly. First, from right
to left, I'm declaring a %col
variable. A reference to that
variable forms the sole element of a list that the backwards
for(-each)
statement is now processing. So $_
is equal to that
hash reference as we execute the bind_columns
call. From the
inside-out there, we'll start with NAME_lc
element of the $info
hashref, which has the names of the columns, lowercased for
consistency. This returns an array ref which we dereference to form
the keys of a slice created by dereferencing $_
, which means a hash
slice of %col
. Whew! Almost there. The ``take a reference to''
operator is applied to this hash slice, which gets a list of
references to the values of that slice, and those are fed as a list to
bind_columns
.
So, the net effect is, $col{field}
is automatically bound to the
Field
column on each fetch! Wow. OK, there's 17 other ways to do
this, but it's pretty slick (or sick?) when you see how it works.
Lines 43 to 46 grab the column definitions, by constructing an array of the fields and types for each of the columns in the target table. This is not a complete representation of the table, but permits us a safe place to copy the data, and restore it later, without losing any precision or meaning.
Line 47 returns the column definition, which ends up in $cols
from
line 38. I use a do
block here because it permits me the
opportunity to create some temporary variables while I'm deciding what
the initial (or only) value of a variable should be, and to discard
those variables once I know the answer.
Beginning in line 50, we've got the code to start moving real data
around. First, we make the archival table using the definition
computed earlier. If anything breaks in lines 51 or 52, we'll be out
of the program with a fatal untrapped die
.
But once we've got the temporary archival table, we want to be nice to
the database and clean it up if something breaks, so starting in line
53, we'll stay within an eval
block to trap those errors.
Lines 54 to 57 copy the data from the live table to the archival table, and show how many rows were affected.
Lines 59 to 62 verify that in fact there are that many lines in the table, just to be double safe.
Lines 64 to 66 show the date range of the data to be archived. This is not needed for the calculations; it's just to keep me comfy about how much data I'm storing away.
Lines 69 to 81 call the mysqldump
utility with the necessary
arguments to extract the archival table into a compressed flat file.
Since we'll want to control the output of the child process precisely
to feed it through the compressor, we'll have to fork ourselves
manually. Line 75 computes the output file, and aborts in line 76 if
the file already exists. All through this program we're playing it
safe, perhaps excessively so. The arguments in line 79 define the
username, password, database, and table name for the dump, which will
end up being passed through an agressive gzip
(set up in line 77)
to generate the file.
If all went well, we continue through lines 84 to 86, deleting the
data from the original live data table, and finally dropping the temp
table in line 89. If that's all good, then a final OPTIMIZE TABLE
in line 92 restores some of that new-found empty space to the recycle
bin.
If there was any abort in the middle, we'll pop down to line 93, and show the error as a warning. Line 95 tries to delete the table one more time, in case line 89 was skipped, and then we're outta there in line 97.
So, we've now got a program that automates all those steps I was doing by hand. When my ISP admin yells at me for space, I can peel off the oldest month's worth automatically with this script, and spend that much more time responding to all my email instead. And there ya have it. Until next time, enjoy!
Listing
=1= #!/usr/bin/perl -w =2= use strict; =3= $|++; =4= =5= my $DIR = "/merlyn/web/OldLog"; =6= my $DB = "merlyn_httpd"; =7= my $TABLE = "requests"; =8= my $STAMP = "when"; =9= my $DB_USER = "--guess--"; =10= my $DB_PASS = "--not--"; =11= =12= use DBI; =13= =14= chdir $DIR or die "Cannot chdir to $DIR: $!"; =15= =16= my $dbh = DBI->connect("dbi:mysql:$DB", $DB_USER, $DB_PASS, =17= { PrintError => 0, RaiseError => 1 }); =18= $dbh->do("SET OPTION SQL_BIG_TABLES = 1"); =19= =20= ## first, get start and end for dates =21= my ($table_name, $before) = do { =22= my $lowest_epoch = $dbh->selectrow_array =23= ("SELECT UNIX_TIMESTAMP(MIN($STAMP)) FROM $TABLE"); =24= =25= print "$lowest_epoch = ".localtime($lowest_epoch)."\n"; =26= =27= my @low = localtime($lowest_epoch); =28= my ($start_year, $start_month) = ($low[5] + 1900, $low[4] + 1); =29= my ($end_year, $end_month) = ($start_year, $start_month + 1); =30= $end_year++, $end_month = 1 if $end_month > 12; =31= sprintf("${TABLE}_%04d_%02d", $start_year, $start_month), =32= sprintf("%04d-%02d-01", $end_year, $end_month); =33= }; =34= =35= print "table is $table_name, before is $before\n"; =36= =37= ## now get table definition =38= my $cols = do { =39= my $info = $dbh->prepare("DESCRIBE $TABLE"); =40= $info->execute; =41= $info->bind_columns(\@$_{@{$info->{NAME_lc}}}) for \my %col; =42= =43= my @coldefs; =44= while ($info->fetch) { =45= push @coldefs, "$col{field} $col{type}"; =46= } =47= join(",", @coldefs); =48= }; =49= =50= ## transfer live data to temp table =51= $dbh->do("DROP TABLE IF EXISTS $table_name"); =52= $dbh->do("CREATE TABLE $table_name ($cols)"); =53= eval { =54= my $count_insert = =55= $dbh->do("INSERT INTO $table_name SELECT * FROM $TABLE WHERE $STAMP < ?", =56= undef, $before); =57= print "inserted $count_insert rows\n"; =58= =59= my $count_selected = =60= $dbh->selectrow_array("SELECT count(*) FROM $table_name"); =61= print "transferred $count_selected rows\n"; =62= die "mismatch!" unless $count_selected == $count_insert; =63= =64= my ($min_date, $max_date) = =65= $dbh->selectrow_array("SELECT MIN($STAMP), MAX($STAMP) FROM $table_name"); =66= print "dates range from $min_date to $max_date\n"; =67= =68= ## use mysqldump to create file =69= print "dumping...\n"; =70= defined(my $kid = fork) or die "Cannot fork: $!"; =71= if ($kid) { # parent =72= waitpid($kid,0); =73= die "bad exit status: $?" if $?; =74= } else { # kid =75= my $file = "mysql_$table_name.gz"; =76= die "$file exists, aborting\n" if -e $file; =77= open STDOUT, "|gzip -9 >$file" or die "gzip: $!"; =78= exec "mysqldump", "--opt", =79= "--user=$DB_USER", "--password=$DB_PASS", $DB, $table_name; =80= die "Cannot exec: $!"; =81= } =82= =83= ## delete original data, and drop table =84= print "deleting...\n"; =85= my $count_delete = =86= $dbh->do("DELETE FROM $TABLE WHERE $STAMP < ?", undef, $before); =87= print "deleted $count_delete rows\n"; =88= =89= $dbh->do("DROP TABLE IF EXISTS $table_name"); =90= =91= print "optimizing...\n"; =92= $dbh->do("OPTIMIZE TABLE $TABLE"); =93= }; warn "insert block: $@" if $@; =94= =95= $dbh->do("DROP TABLE IF EXISTS $table_name"); # in case it didn't happen =96= =97= $dbh->disconnect;