Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 36 (May 2002)

[suggested title: Filling a pipe]

Recently, I attended a presentation at the Portland Linux User's Group (http://www.pdxlinux.org/) by Michael Rasmussen (http://www.patch.com/), about building Linux clusters using open-source technology (http://www.linuxvirtualserver.org/).

Michael mentioned at one point that he needed to analyze the traffic of his company's already active web server, and was surprised that many of the commercial and freely available web traffic tools do not do anything satisfactory with respect to bytes per second or hits per second for bursts. But, being reasonably fluent in Perl, Michael wrote a quick script to crawl through the text web log, and got the data he needed.

Well, that got me thinking about how much (or little) traffic I was shooting out of the webserver for www.stonehenge.com. And since I'm already storing the data into a MySQL database (via DBI and mod_perl), I thought it shouldn't be tough to analyze not only the bytes and hits per second (average and burst) as he had done, but also the CPU usage per second, which I'm also recording for each hit in my log.

To make the result more realistic, I decided to spread out the total bytes and CPU usage over the duration of the hit, which I'm also recording as wall-clock seconds. For example, if 100K has been downloaded over 10 seconds, each second in the span should get 10K of that allocation. And because this is a quick-and-dirty program to get some typical stats, I decided to handle this by simply having an array with a separate slot per second, limiting the reasonable size of analysis to just a few days. But that was enough to cover a useful time period, and actually gave me some information about usage that I didn't have from my other reports.

So, let's look at how I did all that in [listing one, below].

Lines 1 through 3 start nearly every program I write, enabling warnings and compiler restrictions (no barewords, symbol references, or undeclared variables), and disabling the normal buffering of standard output.

Line 5 brings in the standard DBI module. For this program, I also had to have the DBD::mysql module installed to access the database. Line 6 pulls in the Statistics::Descriptive module, just so I don't have to remember how to do the mean and max values, and because I was also considering the standard deviation, but ended up not using it. All these modules are found in the CPAN.

Lines 10 through 12 provide the configuration constants. In particular, the $DSN value defines the DBI connection information, while $DB_AUTH gives the username and password for the particular database. Obviously, these values have been slightly smudged for security reasons. And $DAYS gives the number of integer days over which this report will process. We're making and processing three arrays of this value times 86400 (seconds in a day), so don't get too anxious to crank it up to a large value.

Lines 16 and 17 set up the database connection, including enabling automatic ``die on error'' handling, so I don't have to check the valid return status of each and every database call. I routinely enable MySQL's big table mode, because I've been burned when I didn't do that, although it might not make much of a difference for this program.

Line 18 is for debugging. When I'm not sure what I'm passing to DBI, or what values are getting returned, I can uncomment this line to get a tracing of operations.

Lines 20 to 25 prepare the SQL query to fetch the data of interest. The return value is a ``statement handle'', which allows me to carry out the calculation and obtain the results. The parameter is a double-quoted string, using the qq{...} notation so I can easily balance the brackets nicely with my text editor. The one value that is interpolated is the number of days to fetch backwards in time. MySQL's date functions are a bit odd to wrangle, and took me a few tries to get right.

For my web server log, the when column is a date/time stamp of the transaction, wall is the wall-clock seconds (as an integer) from start to end, and the four cpu values represent the user, system, child user, and child system CPU as floating point numbers.

The request grabs the Unix timestamp value (seconds since the beginning of 1970 in London), the wall clock seconds, and the sum of the cpu usage in seconds, for all entries that have been posted in the past $DAYS number of days. The result is ordered for no apparent reason, except that during testing I was adding things like LIMIT 10 and wanted a consistently understandable output.

Line 27 computes the Unix timestamp corresponding to the beginning of the result. Note that if the time clocks of the DBI server and the server this process runs on are out of sync, this is probably not going to work well. Luckily, I was actually doing this on the very same machine, so there's no chance that the values were out of sync.

Line 29 starts the database query humming, and may even fetch all the results into a temporary area.

Line 30 declares the result to have four output values for the columns, stored directly into the four named scalars. This keeps the interface simple, as well as very fast.

Lines 32 to 34 declare the three accumulating arrays, for number of hits, total bytes transferred, and total CPU seconds used. These arrays will grow to be 86400 times $DAYS elements each. Since @bytes and @hits are always integers, I considered playing with a vec()-based data structure, but decided to optimize along those lines only if the straight solution didn't fit in memory. Never optimize prematurely, especially when you are facing deadlines.

Lines 36 to 46 process the database response data. As each row is returned (indicated by a true value back from fetch), the values are placed directly into the four scalar variables named in line 30. The next step is to apportion the hits, bytes, and CPU seconds across the wall-clock spread. First, line 37 computes the initial array index. Line 38 scales down the hit count to 1 more than the wall count. Since we're counting integer seconds, we've actually got a rounding error here, but a kludge of simply flattening out the request over a slightly larger rounded-up time period was satisfactory to me. And the +1 prevented a potential divide-by-zero error nicely.

Lines 39 and 40 adjust the bytes and CPU seconds to their per-second values, making it easier for lines 41 to 45 to accumulate that value across the spread of the range of seconds for this hit. I actually don't like the repetition of the code: I'm saying the same thing with different variables far too often here, and could probably have factored that out with a bit more effort. I certainly would have done that if there had been three more parameters per second to deal with, but the payback for just the items I have here didn't seem quite worth it.

Once the database scan is complete, we disconnect in line 48 to let the server know we're done there.

Lines 51 through 53 fix the undefs that appear in the arrays of data for the timespans when there's no activity on my web server. Leaving these undef values in the array will trigger warnings later, as the undefs are treated as zero. Right result, but noisy, so a quick search-and-destroy takes care of it.

Lines 55 to 60 dump out the raw data in a pretty format. Note that the code is inside an if (0) control, so there's no execution of this code in the final program, but while I was testing, this code was handy, and I tend to leave my paint scaffolding lying around in case there's a chance I'll be doing more painting later. Note the use of the scalar value of localtime to quickly turn a Unix timestamp into a human-readable string.

The real work happens in the final loop, in lines 62 to 68. This loop is executed three times (and yes, it was originally written as a cut-n-paste nightmare where I had the same thing three times with three different variables). Each time through the loop, $pair is a different label and its associated arrayref as a reference to a two-element array. Inside the loop, we create a new statistics object (line 63), push the data into the object (line 64), and then compute the mean and the maximum by calling two different methods against the statistics object. The mean method gets the mean label, and the label of the item being processed from the outer loop. The use of printf here ensures that the output is nicely columnized.

And there you have it. The output for my websever as I just ran this looks like:

    mean bytes/sec   2014.457
     max bytes/sec 207162.000
    mean  hits/sec      0.198
     max  hits/sec     25.524
    mean   cpu/sec      0.008
     max   cpu/sec      1.212

showing that at peak, I'm nearly filling a T-1 with the amount of data I'm squirting out, and yet on average, I'm using about 1% of the CPU for web service. That last value shouldn't be greater than one: there is obviously some discarded data about the specific allocation of CPU to particular seconds during the day, yielding a slightly bogus max CPU seconds per second.

I hope you enjoyed this discussion of how to make sure you're filling your pipes with your webserver. Until next time, enjoy!

Listing

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use DBI ();
        =6=     use Statistics::Descriptive;
        =7=     
        =8=     ## BEGIN CONFIG ##
        =9=     
        =10=    my $DSN = 'dbi:mysql:stonehenge_httpd';
        =11=    my $DB_AUTH = 'do_not:try_this_at_home';
        =12=    my $DAYS = 2;
        =13=    
        =14=    ## END CONFIG ##
        =15=    
        =16=    my $dbh = DBI->connect($DSN, (split ':', $DB_AUTH), { RaiseError => 1 });
        =17=    $dbh->do("SET OPTION SQL_BIG_TABLES = 1");
        =18=    ## $dbh->trace(2);
        =19=    
        =20=    my $sth = $dbh->prepare(qq{
        =21=      SELECT unix_timestamp(when), wall, bytes, cpuuser + cpusys + cpucuser + cpucsys
        =22=      FROM requests
        =23=      WHERE when >= DATE_SUB(NOW(), INTERVAL $DAYS DAY)
        =24=      ORDER BY when
        =25=    });
        =26=    
        =27=    my $offset = time - 86400 * $DAYS;
        =28=    
        =29=    $sth->execute();
        =30=    $sth->bind_columns(\my($when, $wall, $bytes, $cpu));
        =31=    
        =32=    my @hits;
        =33=    my @bytes;
        =34=    my @cpu;
        =35=    
        =36=    while ($sth->fetch) {
        =37=      $when -= $offset;
        =38=      my $hits = 1 / ($wall + 1);
        =39=      $bytes *= $hits;
        =40=      $cpu *= $hits;
        =41=      for ($when..($when + $wall)) {
        =42=        $hits[$_] += $hits;
        =43=        $bytes[$_] += $bytes;
        =44=        $cpu[$_] += $cpu;
        =45=      }
        =46=    }
        =47=    
        =48=    $dbh->disconnect;
        =49=    
        =50=    ## fix the undef's
        =51=    defined $_ or $_ = 0 for @bytes;
        =52=    defined $_ or $_ = 0 for @hits;
        =53=    defined $_ or $_ = 0 for @cpu;
        =54=    
        =55=    if (0) {                        # dump raw data
        =56=      for (0..$#bytes) {
        =57=        printf "%30s %10.1f %4.1f %4.1f\n",
        =58=          scalar(localtime($offset+$_)), $bytes[$_], $hits[$_], $cpu[$_];
        =59=      }
        =60=    }
        =61=    
        =62=    for my $pair ([bytes => \@bytes], [hits => \@hits], [cpu => \@cpu]) {
        =63=      my $stat = Statistics::Descriptive::Full->new;
        =64=      $stat->add_data(@{$pair->[1]});
        =65=      for my $thing (qw(mean max)) {
        =66=        printf "%4s %5s/sec %10.3f\n", $thing, $pair->[0], $stat->$thing();
        =67=      }
        =68=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.