Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 17 (Oct 2000)

[Suggested title: Throttling your web server]

The webserver for www.stonehenge.com is a nicely configured Linux box (of course) located at a nice co-location facility and maintained by my ISP. I share the box with a dozen other e-commerce clients (mostly because I've been too lazy and/or to move the server to a new solitary box), and that keeps me and everyone else on our toes about overloading the server, because we all have to share.

I bought a digital camera some large number of months ago, and started putting nearly every picture I took up on the site. I've got a nice mod_perl picture handler to show the thumbnails, provide the navigation, and even generate half-size images on the fly using PerlMagick.

However, as I put more and more pictures online, I started to notice some pretty creepy CPU loads from time to time. Worse than that, my ISP neighbors were also starting to complain. After investigation, I determined that I was getting hit by not-so-nice ``spiders'': web programs that recursively (and rapidly) fetch the contents of many pages given a few starting points. I believe most of these to be people on fast data connections (like my current cable modem that brings the equivalent of 2 T-1's into my house for $40 per month, yes!) innocently asking their web browser to download a whole area.

So, rather than pull my pictures offline, I decided to implement a throttling. I didn't care so much about transfer bandwidth as I did CPU, so I chose to track recent CPU activity for each visitor. Of course, HTTP has no concept of a ``session'', so I took a very easy shortcut: tracking by IP address. Yes, I know, I've ranted in discussion forums a lot about how an IP address is not a user. But for the purpose of throttling, it seemed the most expedient choice.

Once I put my throttler in place, no IP address is allowed to suck more than 7% of my CPU over a period of 15 seconds. Once the CPU threshold is reached, any additional request is met with a 503 error (service unavailable), which according to RFC2616 (the HTTP/1.1 specification) also allows me to give a ``retry after'' value of 15 seconds to advise the program that this was a temporary condition.

The throttler consists of two related mod_perl handlers: an ``access'' handler to note whether or not the IP address is currently permitted, and a ``log'' handler to track the CPU used by the transfer. Additionally, there's an external program triggered by cron to clean up the status files needed by the handlers.

So, let's take a look at the handlers in [LISTING ONE, below].

Line 1 puts the module into Stonehenge::Throttle. I use Stonehenge as a private prefix for all my local mod_perl goodies, to keep it separate from any CPAN-installed modules. Because mod_perl shares the namespace across all modules, it's very important to have a workable naming allocation to keep things from colliding.

Line 2 selects the critically important compiler restrictions. Designing code for mod_perl handlers requires careful attention to details, and the use strict restrictions are a good start to that.

Line 4 reminds me that this module needs to be installed as a PerlAccessHandler by giving the appropriate syntax. I have it selected at the top-level configuration file of my site, but if I had wanted it only for the pictures directory, I could have put the access handler inside a Directory or Files restriction, or even an .htaccess file in a subdirectory.

Lines 6 through 9 define some configuration constants. Line 6 is a directory that must be writable by the web userid (in my case, nobody). This directory will hold the historical information about CPU usage.

Line 8 defines the seconds in which we compute CPU history. If we make this too large, the throttling will be slow to react. If we make it too small, it'll be a knee-jerk reaction. I've tweaked this number up and down from time to time, but the current number is 15 as shown here. Line 9 defines how much CPU a particular IP address is allowed to consume, in percent, over the period of time given by $WINDOW. I found the 7 percent solution to be appropriate.

Lines 11 and 12 define a version string, which can be queried using the mod_perl maintenance tools, as well as being in the right format should I ever get around to submitting this to CPAN. The string comes from an RCS keyword, so I just check the file out and in and get the right version number automatically.

Lines 14 through 16 pull in some standard constants and modules from the mod_perl interface.

Line 18 begins the handler called on each requested transfer. Line 19 is commented out, but when enabled, uses my Stonehenge::Reload module to automatically reload this module whenever it changes. Since I'm pretty happy with the stability of this module, I've commented the line out. (Stonehenge::Reload hasn't been published, even though I've now referred to it in a few of my other published works. Perhaps someday soon I should talk about it, I suppose.)

Line 21 fetches the incoming request. This will be an Apache::Request object, as defined by the mod_perl interface. Line 22 ignores any requests that are not a request generated by an external query. This keeps internal lookups (like to get the MIME type for a directory index) from accidentally triggering the throttler. Line 23 grabs a log object for later use.

Lines 25 to 28 get the hostname of the remote server, and perform some slight massaging. If the hostname is my ISP, it means I'm performing some request directly, and I sure don't want to be throttling myself. Also, I decided that all Google fetches should be charged to the same host, even though they appear to be coming from different hosts. Yes, I throttle even Google if it gets too sucky on my pages.

Lines 30 through 33 set up a few variables that will be needed for both this handler, and the ``log'' handler that will be set up later. We'll note the filename of the CPU history file, the flagfile indicating the host is currently blocked, and the current CPU usage for both this process and its children.

Lines 35 through 59 ``push'' a log handler. This technique allows one handler phase to create a handler for another phase ``on the fly''. More importantly, it allows me to share the values of some of the variables into the later phase.

Line 40 subtracts the current value of the output of the times operator from its previous value (saved earlier in line 32). Lines 41 to 43 compute the sum total of CPU used, and rounds it off to the nearest hundredth of a second. Line 44 posts a notice in the error log, which I used for debugging, but have commented out now.

Lines 45 to 48 add this CPU usage as a eight-byte value to the end of a history file. The first four bytes define the timestamp second at which the observation is being taken, and the last four bytes are the CPU seconds in units of hundredths of a second. The advantage of this format is that it's very easy to go back from that to a value (no decimal conversion) and an append will always be atomic, so there's no need to flock the file!

The rest of the log handler determines whether future requests should be blocked or not. First, line 50 defines the beginning of the window of interest. If there's already a currently blockfile, lines 52 through 59 note that and exit the loghandler, so we don't even have to think very hard.

Lines 62 to 70 walk the history file, grabbing each eight byte string as a separate entry, converting it back to the timestamp and CPU used. For all the entries that occur within the window, we'll figure a total CPU. Older entries are ignored.

Lines 72 to 76 determine if the CPU is below the throttling percentage, and if so, remove any blockfile that may be present, thus letting future transactions proceed unthrottled (until the CPU is overused again).

But if we make it to line 78, we've got an IP address out there that has exceeded our threshold. Lines 79 to 81 grab the load average for logging purposes only. Line 83 likewise grabs the user agent for the log. (I've used this to determine if I should categorically deny bad user agents based on name rather than action.) And line 86, well, 86's them from the establishment by creating an empty blockfile. (The presence or absence of the blockfile is all that matters to the access handler.)

So, that's it for the log handler. Back in the access handler starting in line 94, we look for the blockfile that the log handler manages. If it's there, and new enough, we're blocking. Line 97 adds a clue for the client that we do indeed want them to come back, but just not right away. Line 98 triggers the 503 error and aborts any further access within this transfer.

And that's the mod_perl side of things. But now we have these neat little CPU history files being created in $HISTORYDIR, and there's nothing in either handler to clean them up. And I can't add anything there, because the only time the file should be removed is when there's nothing happening, but the only time I'm in a handler is when something is happening!

So, there's a little program invoked from cron on a regular basis, using a crontab entry similar to:

  3-59/10 * * * * /home/merlyn/lib/Apache/throttle-cleaner

which invokes the program I present in [LISTING TWO, below] every 10 minutes on minutes that end on 3 (3, 13, 23, etc). I try to invoke my cron stuff on unlikely minutes to avoid crowding with all those lusers that use precise multiples of 5 or 15. Bleh.

Because this is a standalone program, we've got the ``sh-bang'' line, with warnings turned on in line 1. Line 2 is the normal compiler restrictions.

Line 6 defines the same directory as the $Stonehenge::Throttle::HISTORYDIR, so if I change one, I need to change the other. It won't help to delete files that aren't in the same place. Line 7 similarly needs to be at least twice as large as the throttling window.

Lines 9 through 17 skip through the directory, looking for any file that has not been accessed in at least $SECS. For blocking files, this means that we've not seen a transaction since the blocking started. (Good, they went away permanently.) For history files, it means that we've not seen a transaction recently. In either case, the information is no longer of use, so we can destroy the file (in line 16).

And there you have it: a mechanism to keep people from making your ISP-neighbors mad at you. As a testimony to its value, I recently got ``slashdotted'' by having my pictures archive for ``YAPC 19100'' mentioned on www.slashdot.org. My hits per hour went to 20 times their normal pace for about 36 hours after the mention, and yet the loadaverage never got above 1 or 2 during the entire ordeal. So, I've now survived a slashdot attack.

Another success story comes from one of my clients: a Very Large on-line toys and games e-tailer. They told me that they had seen an earlier version of my throttler mentioned on the mod_perl mailing list, and had put it in place (with some modifications) during the past Christmas buying rush. And amazingly enough, it caught many attempts by people accidentally or deliberately attempting to download their entire online catalog for offload browsing: something that would be both useless and prohibitively expensive. Without the throttle, they might have lost literally millions of dollars. They did in fact buy me dinner for that. Thank you.

I'm interested to hear how this kind of code saved your bacon, so if you adapt it, let me know. Until next time, enjoy!

Listings

        =0=     ################ LISTING ONE ################
        =1=     package Stonehenge::Throttle;
        =2=     use strict;
        =3=     
        =4=     ## usage: PerlAccessHandler Stonehenge::Throttle
        =5=     
        =6=     my $HISTORYDIR = "/home/merlyn/lib/Apache/Throttle";
        =7=     
        =8=     my $WINDOW = 15;                # seconds of interest
        =9=     my $DECLINE_CPU_PERCENT = 7; # CPU percent in window before we 503 error
        =10=    
        =11=    use vars qw($VERSION);
        =12=    $VERSION = (qw$Revision$ )[-1];
        =13=    
        =14=    use Apache::Constants qw(OK DECLINED);
        =15=    use Apache::File;
        =16=    use Apache::Log;
        =17=    
        =18=    sub handler {
        =19=      ## use Stonehenge::Reload; goto &handler if Stonehenge::Reload->reload_me;
        =20=    
        =21=      my $r = shift;                # closure var
        =22=      return DECLINED unless $r->is_initial_req;
        =23=      my $log = $r->server->log;    # closure var
        =24=    
        =25=      my $host = $r->get_remote_host; # closure var
        =26=      return DECLINED if $host =~ /\.(holdit|stonehenge)\.com$/;
        =27=      return DECLINED if $host =~ /\.metronomicon\.com$/; # poor purl
        =28=      $host = "googlebot.com" if $host =~ /\.googlebot\.com$/;
        =29=    
        =30=      my $historyfile = "$HISTORYDIR/$host-times"; # closure var
        =31=      my $blockfile = "$HISTORYDIR/$host-blocked"; # closure var
        =32=      my @delta_times = times;      # closure var
        =33=      my $fh = Apache::File->new;   # closure var
        =34=    
        =35=      $r->push_handlers
        =36=        (PerlLogHandler =>
        =37=         sub {
        =38=    
        =39=           ## record this CPU usage
        =40=           @delta_times = map { $_ - shift @delta_times } times;
        =41=           my $cpu_hundred = 0;
        =42=           $cpu_hundred += $_ for @delta_times;
        =43=           $cpu_hundred = int 100*($cpu_hundred + 0.005);
        =44=           ## $log->notice("throttle: $host got $cpu_hundred/100 in this slot"); # DEBUG
        =45=           open $fh, ">>$historyfile" or return DECLINED;
        =46=           my $time = time;
        =47=           syswrite $fh, pack "LL", $time, $cpu_hundred;
        =48=           close $fh;
        =49=    
        =50=           my $startwindow = $time - $WINDOW;
        =51=    
        =52=           if (my @stat = stat($blockfile)) {
        =53=             if ($stat[9] > $startwindow) {
        =54=               ## $log->notice("throttle: $blockfile is already blocking"); # DEBUG
        =55=               return OK;           # nothing further to see... move along
        =56=             } else {
        =57=               ## $log->notice("throttle: $blockfile is old, ignoring"); # DEBUG
        =58=             }
        =59=           }
        =60=    
        =61=           # figure out if we should be blocking
        =62=           my $totalcpu = 0;        # scaled by 100
        =63=    
        =64=           open $fh, $historyfile or return DECLINED;
        =65=           while ((read $fh, my $buf, 8) > 0) {
        =66=             my ($time, $cpu) = unpack "LL", $buf;
        =67=             next if $time < $startwindow;
        =68=             $totalcpu += $cpu;
        =69=           }
        =70=           close $fh;
        =71=    
        =72=           if ($totalcpu < $WINDOW * $DECLINE_CPU_PERCENT) {
        =73=             ## $log->notice("throttle: $host got $totalcpu/100 CPU in $WINDOW secs"); # DEBUG
        =74=             unlink $blockfile;
        =75=             return OK;
        =76=           }
        =77=    
        =78=           ## about to be nasty... let's see how bad it is:
        =79=           open $fh, "/proc/loadavg";
        =80=           chomp(my $loadavg = <$fh>);
        =81=           close $fh;
        =82=    
        =83=           my $useragent = $r->header_in('User-Agent') || "unknown";
        =84=    
        =85=           $log->notice("throttle: $host got $totalcpu/100 CPU in $WINDOW secs, enabling block [loadavg $loadavg, agent $useragent]");
        =86=           open $fh, ">$blockfile";
        =87=           close $fh;
        =88=    
        =89=           return OK;
        =90=         });
        =91=    
        =92=      ## back in the access handler:
        =93=    
        =94=      if (my @stat = stat($blockfile)) {
        =95=        if ($stat[9] > time - $WINDOW) {
        =96=          $log->warn("throttle access: $blockfile is blocking");
        =97=          $r->header_out("Retry-After", $WINDOW);
        =98=          return 503;               # Service Unavailable
        =99=        } else {
        =100=         ## $log->notice("throttle access: $blockfile is old, ignoring"); # DEBUG
        =101=         return DECLINED;
        =102=       }
        =103=     }
        =104=   
        =105=     return DECLINED;
        =106=   }
        =107=   1;
        =0=     ################ LISTING TWO ################
        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     
        =4=     # $Id$
        =5=     
        =6=     my $DIR = "/home/merlyn/lib/Apache/Throttle";
        =7=     my $SECS = 360;                 # more than Stonehenge::Throttle $WINDOW
        =8=     
        =9=     chdir $DIR or die "Cannot chdir $DIR: $!";
        =10=    opendir DOT, "." or die "Cannot opendir .: $!";
        =11=    my $when = time - $SECS;
        =12=    while (my $name = readdir DOT) {
        =13=      next unless -f $name;
        =14=      next if (stat($name))[8] > $when;
        =15=      ## warn "unlinking $name\n";
        =16=      unlink $name;
        =17=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.