Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 20 (Jan 2001)

[suggested title: 'Caching in your chips']

Dynamic content on a web site is cool. It keeps people coming back, and gives the appearance that there's people behind the scenes actively updating the site to provide new and improved information.

However, in the real world, you regret the day you added that SSI include directive in your homepage when you finally did something important enough that Slashdot noticed you. Running a CGI script on every single webhit is a great way to get your Quake game interrupted.

So, what do you do? You could set up an include that pulls in a file updated from cron. But how often? Perhaps you get hits during the day, but not so many at night, and why waste all the CPU to update a file when it won't be examined for hours? Or maybe you haven't quite figured out that arcane crontab syntax yet. Also, you have to master the weirdness of having a script running as you create and update a file that is readable by the webuser. Fun fun fun.

Or somewhere in between, you can fire off a CGI hit that has a cache value. If the cache value is recent enough, you pop it up, and if it's too old, you generate the new value, saving that for the next hit. The problem there is that it leads to nearly the worst of both worlds, because you must trade off a long cache time with an expensive delay (possibly frustrating the visitor).

So, a while back, I came up with a three-state caching idea. (As I started finalizing this column, I realized I wasn't the only one, but I was the only one that was thinking of it in my mind for a while.)

Keep a cache of the expensive computation. For the ``good'' phase, the cache is served as-is. For the ``bad'' phase, the computation is performed in-line, cached, and the new value returned.

But here's the fun part. Define a ``stale'' period between the two, for as long as you can stand it, where the current cache value is served, but a background process gets launched to update the cache value.

Then, when your system is fairly busy, an occasional background process gets triggered to refresh the cache, but your customers are all served in real time. And during the periods your page gets less-frequent hits, an occasional visit gets delayed, but not many of them if you pick the parameters correctly.

Sounds hard? Not at all, once you see how it's done. And that's covered in [listing one, below].

Lines 1 through 3 start most of the programs I write, enabling taint checking, turning on warnings, enabling compiler restrictions, and unbuffering standard output.

Line 5 I added for debugging. Any fatal errors are resent as HTML to the browser. This line should definitely be removed for production, however, as it is a security risk, revealing potentially damaging information to a random remote user.

Lines 7 to 24 form the configuration part of this program. My programs are not intended to be ready-to-run scripts, but are instead demonstrations of the technology. However, I tend to put anything I'm likely to change at the beginning of the program for easy access.

Line 9 gives the location of a writable file within a writable directory, for the CGI user ID, that is. This is where the results will be cached. The directory must be writable by the CGI user ID (typically nobody for Apache under Linux) because a new file will be renamed over this file. Here, I'm using /tmp. Do not include $$ in this value, because the name must be consistent regardless of the process ID of the invocation!

Lines 10 and 11 define the windows of time (in seconds) for the two phases of age rlating to caching. If the file is less than $STALE seconds old, it's current, and we simply show it and move along. If a file is between $STALE and $DEAD seconds old, we show it, but put an updating process in the background, so that the next hit (after the computation is complete) will get a current file. If the file is older than $DEAD seconds, we force an update in the foreground. Choose these numbers wisely, and you'll have both happy visitors, and a reasonable load average.

Lines 13 to 20 generate the content. The MIME type is defined in line 13, while lines 15 to 20 define a subroutine that should put the content into the filehandle passed as a parameter. For this demo, we're just putting the time of day into a plain text file after delaying a few seconds to simulate computation time.

Line 22 will rarely be touched, but I put it here in the configuration section in case I wanted to play with it. It's the name of a file that can be renamed atomically (within the same filesystem) to $CACHE. Adding .tmp on the end is a common trick of mine for this task.

And now we get down to the main guts. I wrote the code in lines 26 to 59 almost as you see it on the first pass, creating a ``pseudocode'' directly in Perl. And it reads like the top-level algorithm. If the cache is good, we show it. Otherwise, if it's stale (but not yet dead), we see if we're the only one about to update it, and if so, fork. The parent shows the stale cache, while the child updates the cache in the background. If the fork fails, or the cache is older than stale, we update the cache in the foreground instead, and show that as a current information. We hope this is rare, because this means the client will be waiting a while.

In the rare case that the cache is dead, but we can't get the lock to update it, it means someone else has also noticed the dead cache and is trying to fix it, so we loop back after a five second delay and hope all is now well. Perhaps I should count the number of times through that loop and do some error recovery, but I didn't think of that in time for this writing.

And then we come to the implementation of all those not-so-pseudo-code subroutine calls. I've broken the implementations into logical chunks, each inside their own BEGIN block to permit initialization of static local shared variables. Lines 61 to 131 provide routines that relate to the existing cache, lines 133 to 155 relate to the temporary output cache, and finally lines 157 to 177 handle the fork-related activities.

Lines 64 and 65 define the cache variables. The $cache_handle holds the IO object for the filehandle. And $cache_mtime holds the modification time as the internal Unix Epoch offset in seconds.

Lines 67 through 80 provide the predicates to test whether the cache is good enough for the caller. As a side-effect, the cache is also opened, so that we get consistent testing of the modification time to the eventual contents that may or may not be dumped.

Lines 82 through 92 open and close the cache file, and set the cached modification time upon open.

Lines 94 through 112 deal with the HTTP protocol and caching at the browser level. When a web response includes a ``last-modified'' header, browsers typically remember that time, and include it on later requests to the same URL as an ``if-modified-since'' header. Most CGI scripts ignore these headers, but for a few extra lines of code, you can prevent unnecessary computations and save transmission bytes.

Lines 94 through 100 check the incoming ``if-modified-since'' header to see if it's at least as recent as the modification timestamp of our current cache. As the date format, while RFC'ed, is somewhat messy to parse, we'll pull in the HTTP::Date module from LWP (in the CPAN) to handle it for us. The return value in $time is again in Unix Epoch offset seconds, making an easy value to compare to the modification time. A true return value means the browser should be told that they've already got the current version, and to run with it.

The other two routines (at lines 102 and 108, respectively) generate the appropriate ``last-modified'' and ``expires'' header dates. Again, these are clues to the browsers that the data can and should be cached, and for how long.

And then we get to the real workhorse. Lines 114 to 131 define the show_cache_and_exit routine. The single input parameter hints at weather we are showing fresh data, or settling for leftovers, saved away in line 115, and it's added to the output in line 119 in a non-standard (X- prefix) response header.

Although the cache should already be open, we verify this in line 117. A little defensive programming never hurt.

Lines 121 to 128 decide whether the browser will get a full display or just a notification that they've already got the goods. If the qualifications for the incoming ``if-modified-since'' header are met, a 304 response is all they get. Otherwise, we show a last-modified time, an expires time, the content-type for the response, and then dump the content.

Now, back to creating the cache, on the handle defined in line 136. Both this handle and the other handle are created using a reference to a localized glob. Smoke and mirrors abound. If you feel more comfortable using the normal IO::File module, you may do so.

Lines 138 to 144 attempt to get an exclusive lock on the temporary file, creating it in the process if needed. First, we pull in the Fcntl module to define the flocking constants (in lines 139 and 140). This is nearly the equivalent of:

  use Fcntl qw(LOCK_EX LOCK_NB);

except that having been done at runtime, the compiler won't know about those two names being subroutines, so we use parens to force subroutine mode.

The return value is true if we managed to get an exclusive flock, and false if someone else has it. The non-blocking flock ensures we won't wait around for it.

Lines 146 to 153 handle the cache updating. First, we presume we've got an exclusive flock before we get here, so it's safe to empty the file. In fact, most of the time, there shouldn't be anything in the file, unless a process that was trying to write to the file aborted before renaming it, but to be safe, we clean it out in lines 147 and 148.

The task defined in the configuration is next called in line 149, being passed the handle to the temporary file. There's no protocol here to indicate failure, so it's just full speed ahead regardless of breakage.

Once the file has been written, an atomic rename brings the temporary file over the top of the cache file in one fell swoop, and we close the two handles after that, also releasing the flock. There's a small window here where a second process might have opened a the cache file, decided it was too old, then flocked a brand new file during the renaming of this process, but the result will be excessive computations, not a failure of the cache invariants. It's always important to err on the side of safety with regard to caches.

Finally, a few utility subroutines to handle the background process, starting in line 158. Line 160 holds the process ID of the child process from the parent's perspective, simply for testing.

Lines 162 to 164 cause the process to fork (if possible) returning a true value if successful.

Line 167 identifies the parent process, presuming it's called after the previous subroutine (not checked for). I almost called this subroutine who's_your_daddy, but decided that would peg the cute-o-meter, and reconsidered.

Finally, lines 170 to 175 handle the child process requirements for the webserver. First, we ignore any signals (to keep from getting triggered by an untimely SIGPIPE or other nasty thing) and then we'll close STDOUT. Failure to close STDOUT generally keeps Apache waiting around for us to finish before releasing the browser, and that'd be nasty.

So there you have it, a model of how to have three-phase cached data. Consider using it for those little slowly-changing status icons on your page: an image/png is just as valid in the little cache as a text/plain. Until next time, enjoy!

Listings

        =1=     #!/usr/bin/perl -Tw
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use CGI::Carp qw(fatalsToBrowser); # development only
        =6=     
        =7=     ## CONFIG
        =8=     
        =9=     my $CACHE = "/tmp/merlyn-cacheddate";
        =10=    my $STALE = 5;
        =11=    my $DEAD = 60;
        =12=    
        =13=    my $CONTENT_TYPE = "text/plain";
        =14=    
        =15=    sub WRITE_TASK_TO {
        =16=      my $handle = shift;           # must write to this handle
        =17=    
        =18=      sleep 5;                      # simulate some computation
        =19=      print $handle scalar localtime;
        =20=    }
        =21=    
        =22=    my $TMP = "$CACHE.tmp";         # probably don't need to mess with
        =23=    
        =24=    ## END CONFIG
        =25=    
        =26=    {
        =27=      ## main loop
        =28=    
        =29=      if (cache_is_good()) {
        =30=        show_cache_and_exit("current");
        =31=      }
        =32=      if (cache_is_stale()) {
        =33=        if (i_am_the_writer()) {
        =34=          if (i_can_fork()) {
        =35=            if (i_am_the_parent()) {
        =36=              show_cache_and_exit("stale");
        =37=            }
        =38=            ## child does:
        =39=            be_a_child();
        =40=            update_cache();
        =41=            exit 0;
        =42=          }
        =43=          ## cannot fork, so it's up to me
        =44=          update_cache();
        =45=          show_cache_and_exit("current");
        =46=        }
        =47=        ## I'm not the writer, so show old cache
        =48=        show_cache_and_exit("stale");
        =49=      }
        =50=      ## cache is dead
        =51=      if (i_am_the_writer()) {
        =52=        update_cache();
        =53=        show_cache_and_exit("current");
        =54=      }
        =55=      ## we cannot do anything about a bad cache, so retry
        =56=      close_cache();
        =57=      sleep 5;
        =58=      redo;
        =59=    }
        =60=    
        =61=    BEGIN {
        =62=      ## current-cache-related stuff
        =63=    
        =64=      my $cache_handle = \do {local *HANDLE};
        =65=      my $cache_mtime;
        =66=    
        =67=      sub cache_is_good {
        =68=        cache_not_older_than($STALE);
        =69=      }
        =70=    
        =71=      sub cache_is_stale {
        =72=        not cache_is_good() and
        =73=          cache_not_older_than($DEAD);
        =74=      }
        =75=    
        =76=      sub cache_not_older_than {
        =77=        my $seconds = shift;
        =78=        open_cache() or return 0;
        =79=        defined $cache_mtime and time - $cache_mtime < $seconds;
        =80=      }
        =81=    
        =82=      sub open_cache {
        =83=        return 0 unless
        =84=          defined fileno($cache_handle) or
        =85=            open $cache_handle, $CACHE;
        =86=        ($cache_mtime) = (stat $cache_handle)[9];
        =87=        1;
        =88=      }
        =89=    
        =90=      sub close_cache {
        =91=        close $cache_handle;
        =92=      }
        =93=    
        =94=      sub not_modified {
        =95=        return 0 unless defined(my $ims = $ENV{HTTP_IF_MODIFIED_SINCE});
        =96=        require HTTP::Date;
        =97=    
        =98=        my $time = HTTP::Date::str2time($ims);
        =99=        $time >= $cache_mtime;
        =100=     }
        =101=   
        =102=     sub modified_date {
        =103=       require HTTP::Date;
        =104=   
        =105=       HTTP::Date::time2str($cache_mtime);
        =106=     }
        =107=   
        =108=     sub expires_date {
        =109=       require HTTP::Date;
        =110=   
        =111=       HTTP::Date::time2str($cache_mtime + $DEAD);
        =112=     }
        =113=   
        =114=     sub show_cache_and_exit {
        =115=       my $status = shift;
        =116=   
        =117=       open_cache() or die "cache missing: $!";
        =118=   
        =119=       print "X-cache-status: $status\n";
        =120=   
        =121=       if (not_modified()) {
        =122=         print "Status: 304 Not Modified\n\n";
        =123=       } else {
        =124=         print "Last-modified: ", modified_date(), "\n";
        =125=         print "Expires: ", expires_date(), "\n";
        =126=         print "Content-type: $CONTENT_TYPE\n\n";
        =127=         print while <$cache_handle>;
        =128=       }
        =129=       exit 0;
        =130=     }
        =131=   }
        =132=   
        =133=   BEGIN {
        =134=     ## output-cache-related related stuff
        =135=   
        =136=     my $cache_tmp_handle = \do {local *HANDLE};
        =137=   
        =138=     sub i_am_the_writer {
        =139=       require Fcntl;
        =140=       Fcntl->import(qw(LOCK_EX LOCK_NB));
        =141=   
        =142=       open $cache_tmp_handle, ">>$TMP" or die "Cannot create $TMP: $!";
        =143=       flock $cache_tmp_handle, LOCK_EX() | LOCK_NB();
        =144=     }
        =145=   
        =146=     sub update_cache {
        =147=       truncate $TMP, 0 or die "truncate: $!";
        =148=       seek $cache_tmp_handle, 0, 0;
        =149=       WRITE_TASK_TO($cache_tmp_handle);
        =150=       rename $TMP, $CACHE or die "Cannot rename: $!";
        =151=       close $cache_tmp_handle;
        =152=       close_cache();
        =153=     }
        =154=   
        =155=   }
        =156=   
        =157=   BEGIN {
        =158=     ## forking-related stuff
        =159=   
        =160=     my $kid_pid;
        =161=       
        =162=     sub i_can_fork {
        =163=       defined ($kid_pid = fork);
        =164=     }
        =165=   
        =166=     sub i_am_the_parent {
        =167=       $kid_pid > 0;
        =168=     }
        =169=   
        =170=     sub be_a_child {
        =171=       require sigtrap;
        =172=       sigtrap::->import(qw(handler __IGNORE__ any));
        =173=   
        =174=       close STDOUT;
        =175=     }
        =176=   
        =177=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.