Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
![]() |
Download this listing! | ![]() |
![]() |
![]() |
Linux Magazine Column 34 (Mar 2002)
[suggested title: Caching those database accesses]
Sometimes, solving little problems can be fun. You stare at the project requirements, then stare at the available tools, and figure out how to bridge the gap from the tools to the problem solution.
But sometimes, I get frustrated when I'm treading new ground, because the task needed to be done ``yesterday''. So I'm always on the lookout for little snippets of how to solve a particular style of problems that I can reuse. With this in mind, I'd like to share with you some snippets that I hope you can reuse, since I spent a bit of time having to invent them in the first place.
In one of my many roles, I'm working with Lincoln Stein (best known as
the creator of CGI.pm) to create and modify software to support data
exchange for genetics research, including the famous Human Genome
project. The Gene-Ontology database (http://www.geneontology.org
)
provides a ``dictionary'' so all the researchers can use the same
terminology when talking about similar discoveries. I was tasked with
providing an ``explorer-style'' interface to browsing and searching this
database. What I had available to me was an interface through the Ace
Perl modules (maintained by Lincoln) to the database, and I had to
respond to browsers using an Apache::Registry-compatible ``CGI'' script.
Each entry in the database is a term with four aspects. The details of the aspects are not important, except that they each linked to zero or more other terms. Each term also has a unique unchanging ID. The database needed to be ``connected'', and once connected, could return a term object for a given ID, or perform a query to return a list of term objects meeting a particular condition. The object contained a reference back to the database connection, and could therefore not be marshalled properly to disk for sharing, and this was the start of my problems. Also, to get the objects along each of the four aspects, I had to start with an object, not an ID.
To make this fast, I wanted to minimize the number of hits to the database. But since I was working in a web environment, that meant sharing the object information, and that meant marshalling the information to disk. I couldn't marshall the object itself (since it contained the reference to the database). But I could at least marshall the aspects. Since the database was slow-moving, I could retain the list of other IDs for a given aspect on disk for say, 10 minutes, and then additional identical queries would use my cache instead of a database query. Within a particular Apache process, I could also cache the mapping between an ID and a particular object, keeping the recently used objects in memory. I employed both of these techniques.
For cleanliness, I also wanted to ensure that my interface module never returned an actual object to the caller, but just some object IDs. That way, the caller could safely store the value as a key in hash, which was very handy when writing the tree-walking code, but I'll save that for another discussion.
Wow. So many constraints, and so few lines of code. To summarize, I'll have a connection to the database, that I can use to convert an ID into a term object, and then cache that mapping in memory. I'll also provide ways of taking a term ID and querying for the four aspects, caching that to disk so that many processes share it.
Let's see how that works, by looking at [Listing one, below].
Line 1 defines the package name for this module, expected to be pulled
in with use GO
in the main code, and therefore found as GO.pm
somewhere along the active @INC
value.
Line 2 turns on the compiler restrictions.
Lines 4 and 5 perform the equivalent of use Exporter qw(import)
,
but I ran into problems doing it that way (which I can't recall now),
so I did each step behind the scenes.
Line 7 pulls in the Ace module, with connection parameters given in line 8. Obviously, if you don't have an Ace database running, this won't connect on your box.
Line 10 sets up the @EXPORT
variable. Most modules also initialize
this variable all at once. But I was adding a few subroutines at a
time to my module during testing, so I decided to add them to the
export list right next to their definition instead of at the top of
the file.
Lines 12 through 18 set up the on-disk cache to share the computed
aspects for a given term, using the wonderful Cache::Cache
package
from the CPAN. Here, I'm specifying that a given cache item is valid
for only 10 minutes. Additionally, once an hour, any un-fetched item
that was also now invalid would be purged to save disk space. The
$cache
object will be used to interface to this cache.
Lines 20 to 28 manage the database connection, starting it up if
needed. The value of $db
is held privately, but returned any time
the db
routine is called. The query call to status
ensures that
the handle is not only defined but active as well.
Lines 30 to 50 manage the in-memory cache between ID names and the
objects from the database. The $timeout
value defines the next
time value at which this cache is flushed, ensuring that new queries
don't return excessively stale information. Again, there are
trade-offs here between performance, space, and freshness of
information that should have been carefully thought through, but I
just typed 600 (10 minutes worth of seconds) and moved on.
The node_to_name
routine takes an object returned from the Ace
database and extracts its string name, while remembering it in the
cache. It's very likely that if we have an object and return its
string name that the caller will very likely perform a query against
that name very soon, so this cache is important.
Similarly, name_to_node
looks for the object with that name, going
out to the database to fetch it if it's not in the cache. Note that
in both cases, if %name_to_node
were zeroed on each call, the
routines would still work. This is important, because that's the
initial state, and caching should work even with nothing in the cache.
Lines 52 through 54 define a little helper routine that came in handy
when I wanted to return a list in a list context, or the first
(possibly only) element of that list in a scalar context. This
routine is not exported, because that's not consistent with the
purpose of the GO
module interface.
Similarly, lines 56 to 58 define another helper routine to ensure that a returned list is entirely nice strings rather than nasty node objects.
Lines 60 to 72 define the common pattern for cache management for the
later routines that take a given node ID and return back a list of
node IDs along one of the aspects or meeting a particular condition.
The $cacheid
parameter defines a unique string under which the
return value would be cached. Often, this was the node ID plus some
distinguishing identifier, but occasionally, just the identifier.
This should become clearer as we see the subroutine being used.
The $coderef
parameter defines how to generate these values if the
cache wasn't valid.
The basic strategy is to take the cache ID, and try to fetch it. If
the cache is valid (meaning that something was stored into the cache
with that particular cache ID within the past 10 minutes), then that
was an array-ref that was dereferenced to get a list of node IDs (in
line 71). However, if the cache was a miss, then we call the coderef
to generate the list of return values, and also store that into the
cache so that identical queries in the next 10 minutes (in this
process or others) can avoid calling the coderef. A simple version of
this with a lot less hassle can be found in the Memoize
module in
the CPAN, but I wanted precise control over the storage used, and
mapping from arguments to objects and back.
So, let's see how this is used, starting with root_names
in line
74. The cache ID is root_names
, no matter what the arguments are to
this subroutine. If cache_one_or_many_names
finds a valid cache,
then it returns that. However, if the cache misses, the anonymous
subroutine defined in line 78 is called. This subroutine uses an Ace
query to get the terms without ``parents'' at the top of the tree. The
query will return objects, which the nodes_to_names
call in line 68
will then turn into names and cache the objects in memory. Those
names will then be cached to disk in line 69, and the list returned
back to the caller.
Similarly, the find_terms
routine starting in line 82 takes a query
for a particular term name, and caches the resulting query. This term
can contain wildcards, so all things ending in wall
can be found
with a query of *wall
for example. Note that the cache ID now
contains find_terms
followed by the query pattern. This is because
the cache must be specific to the query being performed.
The four aspects (i_parents
, c_parents
, i_children
, and
c_children
) come next, with far too much repetition to the code for
me to be entirely happy. Had there been two or three more of these, I
would have made the items table-driven. But in each case, we take a
node ID, convert it to the object, make the object call to get the
list of other objects, then convert those back to node IDs to return
to the caller, caching the result in the on-disk cache to save hits to
the Ace database.
Finally, the code to convert a node ID to a text description comes
last, starting in line 120. Again, the node ID is converted to an
object, which is then asked for its Term
, which is another object,
which again gets cached.
So, while the code is probably not reusable directly, I hope you can see the strategies I used to perform a two-level cache (both memory for the unmarshallable objects, and disk for the sharing of IDs), as well as some nice tricks to regularize the code. Until next time, enjoy!
Listing
=1= package GO; =2= use strict; =3= =4= require Exporter; =5= *import = \*Exporter::import; =6= =7= use Ace; =8= my @CONNECT = qw(-host localhost -port 2005); =9= =10= use vars qw(@EXPORT); @EXPORT = (); =11= =12= use Cache::FileCache; =13= my $cache = Cache::FileCache->new =14= ({namespace => 'GO', =15= username => 'nobody', =16= default_expires_in => '10 minutes', =17= auto_purge_interval => '1 hour', =18= }); =19= =20= { =21= my $db; # database handle - cannot be cached =22= =23= push @EXPORT, 'db'; =24= sub db { =25= return $db if $db and $db->status->{directory}; =26= $db = Ace->connect(@CONNECT) or die "Cannot connect!"; =27= } =28= } =29= =30= { =31= my %name_to_node; # object cache - cannot be memoized =32= my $timeout_interval = 600; # constant =33= my $timeout = time + $timeout_interval; # "flush the cache" sentinel =34= =35= push @EXPORT, 'node_to_name'; =36= sub node_to_name { =37= ($timeout, %name_to_node) = time + $timeout_interval if $timeout < time; =38= my $node = shift; =39= my $name = $node->name; =40= $name_to_node{$name} = $node; =41= $name; =42= } =43= =44= push @EXPORT, 'name_to_node'; =45= sub name_to_node { =46= ($timeout, %name_to_node) = time + $timeout_interval if $timeout < time; =47= my $name = shift; =48= $name_to_node{$name} ||= db->fetch(GO_term => $name); =49= } =50= } =51= =52= sub one_or_many { =53= wantarray ? @_ : $_[0]; =54= } =55= =56= sub nodes_to_names { =57= map node_to_name($_), @_; =58= } =59= =60= sub cache_one_or_many_names_at { =61= my $cacheid = shift; # prefix =62= my $coderef = shift; # given names, returns 0..n nodes =63= my $cached = $cache->get($cacheid); =64= if ($cached) { =65= ## warn "** cache hit at $cacheid\n"; =66= } else { =67= ## warn "** cache miss at $cacheid\n"; =68= $cached = [nodes_to_names($coderef->(@_))]; =69= $cache->set($cacheid, $cached); =70= } =71= one_or_many(@$cached); =72= } =73= =74= push @EXPORT, 'root_names'; =75= sub root_names { =76= cache_one_or_many_names_at =77= ("root_names", =78= sub { db->fetch(-query => 'find GO_term !Parent', -fill => 1)}); =79= } =80= =81= push @EXPORT, 'find_terms'; =82= sub find_terms { =83= my $term = shift; =84= $term =~ tr/"//d; =85= cache_one_or_many_names_at =86= ("find_terms $term", =87= sub { db->fetch(-query => 'find GO_term where term = "'.shift().'"') }, =88= $term); =89= } =90= =91= push @EXPORT, 'i_parents'; =92= sub i_parents { =93= cache_one_or_many_names_at =94= ("i_parents $_[0]", =95= sub { (name_to_node shift)->Instance_of }, $_[0]); =96= } =97= =98= push @EXPORT, 'c_parents'; =99= sub c_parents { =100= cache_one_or_many_names_at =101= ("c_parents $_[0]", =102= sub { (name_to_node shift)->Component_of }, $_[0]); =103= } =104= =105= push @EXPORT, 'i_children'; =106= sub i_children { =107= cache_one_or_many_names_at =108= ("i_children $_[0]", =109= sub { (name_to_node shift)->Instance }, $_[0]); =110= } =111= =112= push @EXPORT, 'c_children'; =113= sub c_children { =114= cache_one_or_many_names_at =115= ("c_children $_[0]", =116= sub { (name_to_node shift)->Component }, $_[0]); =117= } =118= =119= push @EXPORT, 'a_term'; =120= sub a_term { =121= cache_one_or_many_names_at =122= ("a_term $_[0]", =123= sub { (name_to_node shift)->Term }, $_[0]); =124= } =125= =126= 1;