Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 34 (Mar 2002)

[suggested title: Caching those database accesses]

Sometimes, solving little problems can be fun. You stare at the project requirements, then stare at the available tools, and figure out how to bridge the gap from the tools to the problem solution.

But sometimes, I get frustrated when I'm treading new ground, because the task needed to be done ``yesterday''. So I'm always on the lookout for little snippets of how to solve a particular style of problems that I can reuse. With this in mind, I'd like to share with you some snippets that I hope you can reuse, since I spent a bit of time having to invent them in the first place.

In one of my many roles, I'm working with Lincoln Stein (best known as the creator of CGI.pm) to create and modify software to support data exchange for genetics research, including the famous Human Genome project. The Gene-Ontology database (http://www.geneontology.org) provides a ``dictionary'' so all the researchers can use the same terminology when talking about similar discoveries. I was tasked with providing an ``explorer-style'' interface to browsing and searching this database. What I had available to me was an interface through the Ace Perl modules (maintained by Lincoln) to the database, and I had to respond to browsers using an Apache::Registry-compatible ``CGI'' script.

Each entry in the database is a term with four aspects. The details of the aspects are not important, except that they each linked to zero or more other terms. Each term also has a unique unchanging ID. The database needed to be ``connected'', and once connected, could return a term object for a given ID, or perform a query to return a list of term objects meeting a particular condition. The object contained a reference back to the database connection, and could therefore not be marshalled properly to disk for sharing, and this was the start of my problems. Also, to get the objects along each of the four aspects, I had to start with an object, not an ID.

To make this fast, I wanted to minimize the number of hits to the database. But since I was working in a web environment, that meant sharing the object information, and that meant marshalling the information to disk. I couldn't marshall the object itself (since it contained the reference to the database). But I could at least marshall the aspects. Since the database was slow-moving, I could retain the list of other IDs for a given aspect on disk for say, 10 minutes, and then additional identical queries would use my cache instead of a database query. Within a particular Apache process, I could also cache the mapping between an ID and a particular object, keeping the recently used objects in memory. I employed both of these techniques.

For cleanliness, I also wanted to ensure that my interface module never returned an actual object to the caller, but just some object IDs. That way, the caller could safely store the value as a key in hash, which was very handy when writing the tree-walking code, but I'll save that for another discussion.

Wow. So many constraints, and so few lines of code. To summarize, I'll have a connection to the database, that I can use to convert an ID into a term object, and then cache that mapping in memory. I'll also provide ways of taking a term ID and querying for the four aspects, caching that to disk so that many processes share it.

Let's see how that works, by looking at [Listing one, below].

Line 1 defines the package name for this module, expected to be pulled in with use GO in the main code, and therefore found as GO.pm somewhere along the active @INC value.

Line 2 turns on the compiler restrictions.

Lines 4 and 5 perform the equivalent of use Exporter qw(import), but I ran into problems doing it that way (which I can't recall now), so I did each step behind the scenes.

Line 7 pulls in the Ace module, with connection parameters given in line 8. Obviously, if you don't have an Ace database running, this won't connect on your box.

Line 10 sets up the @EXPORT variable. Most modules also initialize this variable all at once. But I was adding a few subroutines at a time to my module during testing, so I decided to add them to the export list right next to their definition instead of at the top of the file.

Lines 12 through 18 set up the on-disk cache to share the computed aspects for a given term, using the wonderful Cache::Cache package from the CPAN. Here, I'm specifying that a given cache item is valid for only 10 minutes. Additionally, once an hour, any un-fetched item that was also now invalid would be purged to save disk space. The $cache object will be used to interface to this cache.

Lines 20 to 28 manage the database connection, starting it up if needed. The value of $db is held privately, but returned any time the db routine is called. The query call to status ensures that the handle is not only defined but active as well.

Lines 30 to 50 manage the in-memory cache between ID names and the objects from the database. The $timeout value defines the next time value at which this cache is flushed, ensuring that new queries don't return excessively stale information. Again, there are trade-offs here between performance, space, and freshness of information that should have been carefully thought through, but I just typed 600 (10 minutes worth of seconds) and moved on.

The node_to_name routine takes an object returned from the Ace database and extracts its string name, while remembering it in the cache. It's very likely that if we have an object and return its string name that the caller will very likely perform a query against that name very soon, so this cache is important.

Similarly, name_to_node looks for the object with that name, going out to the database to fetch it if it's not in the cache. Note that in both cases, if %name_to_node were zeroed on each call, the routines would still work. This is important, because that's the initial state, and caching should work even with nothing in the cache.

Lines 52 through 54 define a little helper routine that came in handy when I wanted to return a list in a list context, or the first (possibly only) element of that list in a scalar context. This routine is not exported, because that's not consistent with the purpose of the GO module interface.

Similarly, lines 56 to 58 define another helper routine to ensure that a returned list is entirely nice strings rather than nasty node objects.

Lines 60 to 72 define the common pattern for cache management for the later routines that take a given node ID and return back a list of node IDs along one of the aspects or meeting a particular condition. The $cacheid parameter defines a unique string under which the return value would be cached. Often, this was the node ID plus some distinguishing identifier, but occasionally, just the identifier. This should become clearer as we see the subroutine being used. The $coderef parameter defines how to generate these values if the cache wasn't valid.

The basic strategy is to take the cache ID, and try to fetch it. If the cache is valid (meaning that something was stored into the cache with that particular cache ID within the past 10 minutes), then that was an array-ref that was dereferenced to get a list of node IDs (in line 71). However, if the cache was a miss, then we call the coderef to generate the list of return values, and also store that into the cache so that identical queries in the next 10 minutes (in this process or others) can avoid calling the coderef. A simple version of this with a lot less hassle can be found in the Memoize module in the CPAN, but I wanted precise control over the storage used, and mapping from arguments to objects and back.

So, let's see how this is used, starting with root_names in line 74. The cache ID is root_names, no matter what the arguments are to this subroutine. If cache_one_or_many_names finds a valid cache, then it returns that. However, if the cache misses, the anonymous subroutine defined in line 78 is called. This subroutine uses an Ace query to get the terms without ``parents'' at the top of the tree. The query will return objects, which the nodes_to_names call in line 68 will then turn into names and cache the objects in memory. Those names will then be cached to disk in line 69, and the list returned back to the caller.

Similarly, the find_terms routine starting in line 82 takes a query for a particular term name, and caches the resulting query. This term can contain wildcards, so all things ending in wall can be found with a query of *wall for example. Note that the cache ID now contains find_terms followed by the query pattern. This is because the cache must be specific to the query being performed.

The four aspects (i_parents, c_parents, i_children, and c_children) come next, with far too much repetition to the code for me to be entirely happy. Had there been two or three more of these, I would have made the items table-driven. But in each case, we take a node ID, convert it to the object, make the object call to get the list of other objects, then convert those back to node IDs to return to the caller, caching the result in the on-disk cache to save hits to the Ace database.

Finally, the code to convert a node ID to a text description comes last, starting in line 120. Again, the node ID is converted to an object, which is then asked for its Term, which is another object, which again gets cached.

So, while the code is probably not reusable directly, I hope you can see the strategies I used to perform a two-level cache (both memory for the unmarshallable objects, and disk for the sharing of IDs), as well as some nice tricks to regularize the code. Until next time, enjoy!

Listing

        =1=     package GO;
        =2=     use strict;
        =3=     
        =4=     require Exporter;
        =5=     *import = \*Exporter::import;
        =6=     
        =7=     use Ace;
        =8=     my @CONNECT = qw(-host localhost -port 2005);
        =9=     
        =10=    use vars qw(@EXPORT); @EXPORT = ();
        =11=    
        =12=    use Cache::FileCache;
        =13=    my $cache = Cache::FileCache->new
        =14=      ({namespace => 'GO',
        =15=        username => 'nobody',
        =16=        default_expires_in => '10 minutes',
        =17=        auto_purge_interval => '1 hour',
        =18=       });
        =19=    
        =20=    {
        =21=      my $db;                       # database handle - cannot be cached
        =22=    
        =23=      push @EXPORT, 'db';
        =24=      sub db {
        =25=        return $db if $db and $db->status->{directory};
        =26=        $db = Ace->connect(@CONNECT) or die "Cannot connect!";
        =27=      }
        =28=    }
        =29=    
        =30=    {
        =31=      my %name_to_node;             # object cache - cannot be memoized
        =32=      my $timeout_interval = 600;   # constant
        =33=      my $timeout = time + $timeout_interval; # "flush the cache" sentinel
        =34=    
        =35=      push @EXPORT, 'node_to_name';
        =36=      sub node_to_name {
        =37=        ($timeout, %name_to_node) = time + $timeout_interval if $timeout < time;
        =38=        my $node = shift;
        =39=        my $name = $node->name;
        =40=        $name_to_node{$name} = $node;
        =41=        $name;
        =42=      }
        =43=    
        =44=      push @EXPORT, 'name_to_node';
        =45=      sub name_to_node {
        =46=        ($timeout, %name_to_node) = time + $timeout_interval if $timeout < time;
        =47=        my $name = shift;
        =48=        $name_to_node{$name} ||= db->fetch(GO_term => $name);
        =49=      }
        =50=    }
        =51=    
        =52=    sub one_or_many {
        =53=      wantarray ? @_ : $_[0];
        =54=    }
        =55=    
        =56=    sub nodes_to_names {
        =57=      map node_to_name($_), @_;
        =58=    }
        =59=    
        =60=    sub cache_one_or_many_names_at {
        =61=      my $cacheid = shift;          # prefix
        =62=      my $coderef = shift;          # given names, returns 0..n nodes
        =63=      my $cached = $cache->get($cacheid);
        =64=      if ($cached) {
        =65=        ## warn "** cache hit at $cacheid\n";
        =66=      } else {
        =67=        ## warn "** cache miss at $cacheid\n";
        =68=        $cached = [nodes_to_names($coderef->(@_))];
        =69=        $cache->set($cacheid, $cached);
        =70=      }
        =71=      one_or_many(@$cached);
        =72=    }
        =73=    
        =74=    push @EXPORT, 'root_names';
        =75=    sub root_names {
        =76=      cache_one_or_many_names_at
        =77=        ("root_names",
        =78=         sub { db->fetch(-query => 'find GO_term !Parent', -fill => 1)});
        =79=    }
        =80=    
        =81=    push @EXPORT, 'find_terms';
        =82=    sub find_terms {
        =83=      my $term = shift;
        =84=      $term =~ tr/"//d;
        =85=      cache_one_or_many_names_at
        =86=        ("find_terms $term",
        =87=         sub { db->fetch(-query => 'find GO_term where term = "'.shift().'"') },
        =88=         $term);
        =89=    }
        =90=    
        =91=    push @EXPORT, 'i_parents';
        =92=    sub i_parents {
        =93=      cache_one_or_many_names_at
        =94=        ("i_parents $_[0]",
        =95=         sub { (name_to_node shift)->Instance_of }, $_[0]);
        =96=    }
        =97=    
        =98=    push @EXPORT, 'c_parents';
        =99=    sub c_parents {
        =100=     cache_one_or_many_names_at
        =101=       ("c_parents $_[0]",
        =102=        sub { (name_to_node shift)->Component_of }, $_[0]);
        =103=   }
        =104=   
        =105=   push @EXPORT, 'i_children';
        =106=   sub i_children {
        =107=     cache_one_or_many_names_at
        =108=       ("i_children $_[0]",
        =109=        sub { (name_to_node shift)->Instance }, $_[0]);
        =110=   }
        =111=   
        =112=   push @EXPORT, 'c_children';
        =113=   sub c_children {
        =114=     cache_one_or_many_names_at
        =115=       ("c_children $_[0]",
        =116=        sub { (name_to_node shift)->Component }, $_[0]);
        =117=   }
        =118=   
        =119=   push @EXPORT, 'a_term';
        =120=   sub a_term {
        =121=     cache_one_or_many_names_at
        =122=       ("a_term $_[0]",
        =123=        sub { (name_to_node shift)->Term }, $_[0]);
        =124=   }
        =125=   
        =126=   1;

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.