Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 28 (Sep 2001)

[suggested title: Cleaning out a logging database]

The Apache web server that handles www.stonehenge.com logs its transactions directly to a MySQL database, using a mod_perl handler. This is really cool, because I can perform statistical correlations on hits over the past few months, including such complex things as the average CPU time used for a particular URL (to see if some of my dynamic pages need better caching) or the greatest number of referers to a particular page.

But it's a bit uncool because even for a moderate site like mine, the storage requirement to hold information about each of the many hits over time can grow rapidly, about 100 megabytes per month at the current hit rate.

Every once in a while, my ISP admin would yell at me and tell me to reduce the size of my MySQL usage. So I'd type a few commands to roll out the oldest data into a second temporary table, then mysqldump that table into a flat file, compress the file, and then delete the information from the original table. If I was feeling particularly nice, I'd also perform an OPTIMIZE TABLE as well. The compressed file takes far less space than the original file (about 1/15th or so), so I can safely store the historical data for a lot longer should I ever want to restore it all for a grand archaeological trip.

But there are many steps to this procedure, and because it was performed infrequently, and I wasn't smart enough to take notes about what I had done the last time, I found myself always staring for 10 to 15 minutes at the MySQL online documentation. And once I was sure of the commands I was typing, I would proceed, cursing my ISP for not providing me unlimited disk space.

So, it got to be ``that time of the month'' again a few days ago, and I decided once and for-all to automate the procedure. Of course, this is a bit silly, because the work it took to automate this once-a-month rollout has now taken about the same amount of time as it would take to do this procedure for a year. But just in case a few of you can benefit from this technique (or perhaps a similar one), I'll pass along the program to recoup my investment.

I've tried to make the program rather generic. Although it's specific to MySQL, it should work with any table that has a date-time stamp that increases with time.

So, without further delay, I introduce my so-called time-saver in [listing one, below].

Line 1 turns on warnings. Line 2 turns on compiler restrictions (no symbolic references, variables must be declared, and barewords are not permitted). Line 3 disables output buffering: what little output we generate, we want to see right away.

Lines 5 to 10 define the configuration parameters, making this program somewhat reusable for other applications. $DIR is the directory in which the output files will be stored. $DB is the MySQL database name, and $TABLE defines the table within that database that will be rolled out. $STAMP is a date-time field within the $TABLE table that will be used to select the records of interest for archival. Finally, $DB_USER and $DB_PASS define the MySQL user that has table creation and access privileges for that database. Of course, I've replaced my live data for some of these values with stand-in values: I'll let you guess which ones.

Line 12 pulls in the DBI module (installed from the CPAN). In addition, you'll also need DBD::mysql installed, which DBI pulls in.

Line 14 puts us into the right directory, permitting my program to be run from cron without a special directory change in the invocation.

Lines 16 and 17 connect up to the database, using the selected username and password and database name. In addition, I've enabled the RaiseError parameter, causing any significant error to automatically throw an exception (die), and disabled the automatic printing of error messages. This keeps me from having to add an ``or die ...'' after each database call.

Line 18 is a bit of voodoo programming. One time when I was dealing with large tables in SQL, I got some sort of table overflow, and it aborted my program. I found this command in the manual, and now ritually include it in each program as a gift to the deities to keep from bombing out. I have no idea when it's needed, and the manual isn't clear about that, and I still don't know why it isn't the default. But there it is.

Lines 21 to 33 determine the temporary table name and the upper bound for the date range. These are determined dynamically by looking at the oldest entry in the current records.

First, lines 22 to 23 get the Unix Epoch time value for the oldest entry as the number of seconds since the Unix base value. Line 25 dumps that out for logging purposes, by using localtime in a scalar context.

Then, lines 27 to 30 compute a ``current month'' and ``next month'' pair of values based on that stamp, by again using localtime, but now in a list context. The year number from localtime is offset by 1900, and the month by 1, so we add those back in line 28. The ending month is initially the start month plus 1, but if that exceeds the end of the year, we adjust the year and month appropriately.

Lines 31 and 32 format the current month values for the archival table name, and the next month for the MySQL date string for the first of the month. These two values end up in the outer $table_name and $before values, respectively, and this is shown in line 35.

Lines 38 to 48 get the table definition (or at least a portion of it) so that we can create the archival table similar to the original table. First, line 39 creates a statement handle asking MySQL for the table structure, which is executed in line 40.

Line 41 is a bit of magic, so let's read it slowly. First, from right to left, I'm declaring a %col variable. A reference to that variable forms the sole element of a list that the backwards for(-each) statement is now processing. So $_ is equal to that hash reference as we execute the bind_columns call. From the inside-out there, we'll start with NAME_lc element of the $info hashref, which has the names of the columns, lowercased for consistency. This returns an array ref which we dereference to form the keys of a slice created by dereferencing $_, which means a hash slice of %col. Whew! Almost there. The ``take a reference to'' operator is applied to this hash slice, which gets a list of references to the values of that slice, and those are fed as a list to bind_columns.

So, the net effect is, $col{field} is automatically bound to the Field column on each fetch! Wow. OK, there's 17 other ways to do this, but it's pretty slick (or sick?) when you see how it works.

Lines 43 to 46 grab the column definitions, by constructing an array of the fields and types for each of the columns in the target table. This is not a complete representation of the table, but permits us a safe place to copy the data, and restore it later, without losing any precision or meaning.

Line 47 returns the column definition, which ends up in $cols from line 38. I use a do block here because it permits me the opportunity to create some temporary variables while I'm deciding what the initial (or only) value of a variable should be, and to discard those variables once I know the answer.

Beginning in line 50, we've got the code to start moving real data around. First, we make the archival table using the definition computed earlier. If anything breaks in lines 51 or 52, we'll be out of the program with a fatal untrapped die.

But once we've got the temporary archival table, we want to be nice to the database and clean it up if something breaks, so starting in line 53, we'll stay within an eval block to trap those errors.

Lines 54 to 57 copy the data from the live table to the archival table, and show how many rows were affected.

Lines 59 to 62 verify that in fact there are that many lines in the table, just to be double safe.

Lines 64 to 66 show the date range of the data to be archived. This is not needed for the calculations; it's just to keep me comfy about how much data I'm storing away.

Lines 69 to 81 call the mysqldump utility with the necessary arguments to extract the archival table into a compressed flat file. Since we'll want to control the output of the child process precisely to feed it through the compressor, we'll have to fork ourselves manually. Line 75 computes the output file, and aborts in line 76 if the file already exists. All through this program we're playing it safe, perhaps excessively so. The arguments in line 79 define the username, password, database, and table name for the dump, which will end up being passed through an agressive gzip (set up in line 77) to generate the file.

If all went well, we continue through lines 84 to 86, deleting the data from the original live data table, and finally dropping the temp table in line 89. If that's all good, then a final OPTIMIZE TABLE in line 92 restores some of that new-found empty space to the recycle bin.

If there was any abort in the middle, we'll pop down to line 93, and show the error as a warning. Line 95 tries to delete the table one more time, in case line 89 was skipped, and then we're outta there in line 97.

So, we've now got a program that automates all those steps I was doing by hand. When my ISP admin yells at me for space, I can peel off the oldest month's worth automatically with this script, and spend that much more time responding to all my email instead. And there ya have it. Until next time, enjoy!

Listing

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     my $DIR = "/merlyn/web/OldLog";
        =6=     my $DB = "merlyn_httpd";
        =7=     my $TABLE = "requests";
        =8=     my $STAMP = "when";
        =9=     my $DB_USER = "--guess--";
        =10=    my $DB_PASS = "--not--";
        =11=    
        =12=    use DBI;
        =13=    
        =14=    chdir $DIR or die "Cannot chdir to $DIR: $!";
        =15=    
        =16=    my $dbh = DBI->connect("dbi:mysql:$DB", $DB_USER, $DB_PASS,
        =17=                           { PrintError => 0, RaiseError => 1 });
        =18=    $dbh->do("SET OPTION SQL_BIG_TABLES = 1");
        =19=    
        =20=    ## first, get start and end for dates
        =21=    my ($table_name, $before) = do {
        =22=      my $lowest_epoch = $dbh->selectrow_array
        =23=        ("SELECT UNIX_TIMESTAMP(MIN($STAMP)) FROM $TABLE");
        =24=    
        =25=      print "$lowest_epoch = ".localtime($lowest_epoch)."\n";
        =26=    
        =27=      my @low = localtime($lowest_epoch);
        =28=      my ($start_year, $start_month) = ($low[5] + 1900, $low[4] + 1);
        =29=      my ($end_year, $end_month) = ($start_year, $start_month + 1);
        =30=      $end_year++, $end_month = 1 if $end_month > 12;
        =31=      sprintf("${TABLE}_%04d_%02d", $start_year, $start_month),
        =32=        sprintf("%04d-%02d-01", $end_year, $end_month);
        =33=    };
        =34=    
        =35=    print "table is $table_name, before is $before\n";
        =36=    
        =37=    ## now get table definition
        =38=    my $cols = do {
        =39=      my $info = $dbh->prepare("DESCRIBE $TABLE");
        =40=      $info->execute;
        =41=      $info->bind_columns(\@$_{@{$info->{NAME_lc}}}) for \my %col;
        =42=    
        =43=      my @coldefs;
        =44=      while ($info->fetch) {
        =45=        push @coldefs, "$col{field} $col{type}";
        =46=      }
        =47=      join(",", @coldefs);
        =48=    };
        =49=    
        =50=    ## transfer live data to temp table
        =51=    $dbh->do("DROP TABLE IF EXISTS $table_name");
        =52=    $dbh->do("CREATE TABLE $table_name ($cols)");
        =53=    eval {
        =54=      my $count_insert =
        =55=        $dbh->do("INSERT INTO $table_name SELECT * FROM $TABLE WHERE $STAMP < ?",
        =56=                 undef, $before);
        =57=      print "inserted $count_insert rows\n";
        =58=    
        =59=      my $count_selected =
        =60=        $dbh->selectrow_array("SELECT count(*) FROM $table_name");
        =61=      print "transferred $count_selected rows\n";
        =62=      die "mismatch!" unless $count_selected == $count_insert;
        =63=    
        =64=      my ($min_date, $max_date) =
        =65=        $dbh->selectrow_array("SELECT MIN($STAMP), MAX($STAMP) FROM $table_name");
        =66=      print "dates range from $min_date to $max_date\n";
        =67=    
        =68=      ## use mysqldump to create file
        =69=      print "dumping...\n";
        =70=      defined(my $kid = fork) or die "Cannot fork: $!";
        =71=      if ($kid) {                   # parent
        =72=        waitpid($kid,0);
        =73=        die "bad exit status: $?" if $?;
        =74=      } else {                      # kid
        =75=        my $file = "mysql_$table_name.gz";
        =76=        die "$file exists, aborting\n" if -e $file;
        =77=        open STDOUT, "|gzip -9 >$file" or die "gzip: $!";
        =78=        exec "mysqldump", "--opt",
        =79=          "--user=$DB_USER", "--password=$DB_PASS", $DB, $table_name;
        =80=        die "Cannot exec: $!";
        =81=      }
        =82=    
        =83=      ## delete original data, and drop table
        =84=      print "deleting...\n";
        =85=      my $count_delete =
        =86=        $dbh->do("DELETE FROM $TABLE WHERE $STAMP < ?", undef, $before);
        =87=      print "deleted $count_delete rows\n";
        =88=    
        =89=      $dbh->do("DROP TABLE IF EXISTS $table_name");
        =90=    
        =91=      print "optimizing...\n";
        =92=      $dbh->do("OPTIMIZE TABLE $TABLE");
        =93=    }; warn "insert block: $@" if $@;
        =94=    
        =95=    $dbh->do("DROP TABLE IF EXISTS $table_name"); # in case it didn't happen
        =96=    
        =97=    $dbh->disconnect;

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.