Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Linux Magazine Column 63 (Sep 2004)

[suggested title: ``Caching Proxy Servers with Template Toolkit'']

In the previous three articles, I introduced my templating system of choice, Template Toolkit (TT). As an overview, I didn't have much space to go into meaty, real examples. In this article, I'll look at how I'm using TT every day to help me manage the www.stonehenge.com website.

In the July 2002 edition of this column, I described in detail about how I had placed www.stonehenge.com under CVS management, and had chosen to use TT to manage the slight variations required for the files between development versions and production versions. However, I glossed over (for lack of room) how I use TT to manage the variations between the front caching-reverse-proxy server and the backend heavy Apache mod_perl server, including making it easy to have many virtual servers with similar configurations. Let's look at that now.

A mod_perl-enabled application server generally caches many Perl subroutines and data structures in memory, trading that memory for the delay of reloading such structures from disk or recomputing them each time. In my case, a typical mod_perl process is about 20 to 30 megabytes of memory. In the ``old days'', I let these fat 20MB processes take care of every request from a browser, including those to images or static HTML files that really didn't need any mod_perl involved. Besides just wasting resources, this gets particularly nasty when large images are being downloaded over a slow link: the fat process is tied up for a number of seconds, not milliseconds.

Modern best practices suggest that requests that must be handled by mod_perl processes be somehow separated from those that aren't. One popular strategy is the caching-reverse-proxy server. In this model, the incoming requests are handed to a thin proxy server. Some people use Squid for this, but I like using Apache for this to keep everything consistent. This proxy server sits in front of the real mod_perl server, and caches any results from the server if possible. Thus, the first hit might be a bit slow and wasteful, but every hit to the cached item is much faster because the backend mod_perl server isn't even consulted (or is consulted only to verify that the cache is up to date). And, if the proxy servers have a low memory footprint (I've been keeping mine around 1 to 2 megabytes per process), we can run many of them to reduce delivery latency.

Even fully dynamic pages benefit, because the data is squirted quickly from the backend server to the proxy server, freeing up the backend process to handle a different request. The proxy server can then dole out the page over the network. This makes a big difference with large dynamic pages being handed out over a slow network link, such as to a dialup end user or over a satellite connection.

Additionally, some requests can be diverted by the proxy server to be served directly from local files. If the proxy server and backend server are on the same machine, this is trivial when I have the DocumentRoot for both servers be the same directory: I simply let the request be satisfied directly instead of proxied.

But getting this proxy and backend configuration correct, including multiple virtual hosts, can be a real pain, especially when you consider that we have to make this work in a development environment as well. Thankfully, Template Toolkit lets me write the configuration once and reuse it in slightly different variations as needed.

My first convenient tool to manage this process is what I call my ``PB&J filter'' (with the obvious reference to Peanut Butter and Jelly rather intentional). I have a simple filter wrapper like so:

    [* BLOCK pbj_filter;
      # filter lines marked beginning with #PBJ#
      FOR line = content.split("\n");
        IF kind == "P";
          line = line.replace('^#\w*P\w*#\s*','');
        ELSIF kind == "J";
          line = line.replace('^#\w*J\w*#\s*','');
        ELSE; # kind == "B";
          line = line.replace('^#\w*B\w*#\s*','');
        END;
        line; "\n";
      END;
    END; # BLOCK pbj_filter
    *]

The variable kind is set (using code not shown here) to one of three values by examining various environment variables: ``P'' for the front proxy server, ``B'' for the backend server, and ``J'' for a joined single-server version (sometimes used for testing, but not in my production servers). I use the filter like this:

    [* MACRO module(mod_name_c, name_module, mod_name_so) BLOCK;
      IF env.MODULES_INTERNAL.search(mod_name_c);
        "AddModule "; mod_name_c;
      ELSE;
        "LoadModule "; name_module; " "; env.MODULES; "/"; mod_name_so;
      END;
    END;
    *]
    [* WRAPPER pbj_filter *]
    # [* module('mod_vhost_alias.c', 'vhost_alias_module', 'mod_vhost_alias.so') *]
    # [* module('mod_env.c', 'env_module', 'mod_env.so') *]
    [* module('mod_log_config.c', 'log_config_module', 'mod_log_config.so') *]
    # [* module('mod_log_agent.c', 'log_agent_module', 'mod_log_agent.so') *]
    # [* module('mod_log_referer.c', 'log_referer_module', 'mod_log_referer.so') *]
    # [* module('mod_mime_magic.c', 'mime_magic_module', 'mod_mime_magic.so') *]
    [* module('mod_mime.c', 'mime_module', 'mod_mime.so') *]
    #BJ# [* module('mod_negotiation.c', 'negotiation_module', 'mod_negotiation.so') *]
    [* module('mod_status.c', 'status_module', 'mod_status.so') *]
    [* module('mod_info.c', 'info_module', 'mod_info.so') *]
    #BJ# [* module('mod_include.c', 'include_module', 'mod_include.so') *]
    #BJ# [* module('mod_autoindex.c', 'autoindex_module', 'mod_autoindex.so') *]
    #BJ# [* module('mod_dir.c', 'dir_module', 'mod_dir.so') *]
    #BJ# [* module('mod_cgi.c', 'cgi_module', 'mod_cgi.so') *]
    # [* module('mod_asis.c', 'asis_module', 'mod_asis.so') *]
    # [* module('mod_imap.c', 'imap_module', 'mod_imap.so') *]
    # [* module('mod_actions.c', 'actions_module', 'mod_actions.so') *]
    # [* module('mod_speling.c', 'speling_module', 'mod_speling.so') *]
    # [* module('mod_userdir.c', 'userdir_module', 'mod_userdir.so') *]
    #BJ# [* module('mod_alias.c', 'alias_module', 'mod_alias.so') *]
    [* module('mod_rewrite.c', 'rewrite_module', 'mod_rewrite.so') *]
    #BJ# [* module('mod_access.c', 'access_module', 'mod_access.so') *]
    #BJ# [* module('mod_auth.c', 'auth_module', 'mod_auth.so') *]
    # [* module('mod_auth_anon.c', 'auth_anon_module', 'mod_auth_anon.so') *]
    # [* module('mod_auth_dbm.c', 'auth_dbm_module', 'mod_auth_dbm.so') *]
    # [* module('mod_auth_db.c', 'auth_db_module', 'mod_auth_db.so') *]
    # [* module('mod_digest.c', 'digest_module', 'mod_digest.so') *]
    #P# [* module('mod_proxy.c', 'proxy_module', 'libproxy.so') *]
    # [* module('mod_cern_meta.c', 'cern_meta_module', 'mod_cern_meta.so') *]
    #BJ# [* module('mod_expires.c', 'expires_module', 'mod_expires.so') *]
    # [* module('mod_headers.c', 'headers_module', 'mod_headers.so') *]
    # [* module('mod_usertrack.c', 'usertrack_module', 'mod_usertrack.so') *]
    # [* module('mod_unique_id.c', 'unique_id_module', 'mod_unique_id.so') *]
    #PJ# [* module('mod_setenvif.c', 'setenvif_module', 'mod_setenvif.so') *]
    #PJ# [* module('mod_ssl.c', 'ssl_module', 'libssl.so') *]
    #BJ# [* module('mod_perl.c', 'perl_module', 'libperl.so') *]
    [* END *]

Here, I'm defining the modules that will be used in my front or backend server. Any line that begins with a normal comment will remain commented for all versions, such as mod_speling. Any line that doesn't begin with a comment is live in all versions, such as mod_rewrite. But any line that begins with a hash mark, one or more alphabetic letters, and another hash mark, is essentially a conditional comment that will be uncommented on that particular variation of the file. Thus, mod_perl will be enabled in backend and single-server variations, while mod_ssl will be enabled only in proxy and single-server variations. The resulting last few lines for my backend server look like:

    #P# AddModule mod_proxy.c
    # AddModule mod_cern_meta.c
    AddModule mod_expires.c
    # AddModule mod_headers.c
    # AddModule mod_usertrack.c
    # AddModule mod_unique_id.c
    #PJ# AddModule mod_setenvif.c
    #PJ# AddModule mod_ssl.c
    AddModule mod_perl.c

and for the front proxy server look like:

    AddModule mod_proxy.c
    # AddModule mod_cern_meta.c
    #BJ# AddModule mod_expires.c
    # AddModule mod_headers.c
    # AddModule mod_usertrack.c
    # AddModule mod_unique_id.c
    AddModule mod_setenvif.c
    AddModule mod_ssl.c
    #BJ# AddModule mod_perl.c

This is a powerful way of expressing exactly how my front and back servers are similar, and yet different. TT helps me tremendously here, keeping me from having to maintain two separate files and try to keep them in sync.

But the real gain comes around when I get to a virtual server. For the proxy server, I use mod_rewrite to decide whether to serve it locally, proxy it to the backend, or just forbid it. In the proxy server for www.stonehenge.com, I end up with:

  RewriteEngine On
  RewriteRule ^/icons/ - [last]
  RewriteRule ^/tt2/images/ - [last]
  RewriteMap escape int:escape
  RewriteRule ^/(.*)$ http://127.0.0.1:8081/${escape:$1} [proxy,noescape]
  ProxyPassReverse / http://127.0.0.1:8001/

This configuration causes /icons and /tt2/images to be served directly in the proxy server (from a shared DocumentRoot). All other requests are properly repackaged as a proxy request to 127.0.0.1:8081, which is where the ``real'' www.stonehenge.com server lives, including the mod_perl support. The result will be cached if possible (has a last-modified date or an expires date), and then returned to the incoming client request. In the backend server, these lines are absent for this virtual host.

Every virtual host is built with a simple invocation to a make_virtual_server wrapper, like so:

    [* WRAPPER make_virtual_server
      name = "www.stonehenge.com" develport = 8001
      host = "www.stonehenge.com" port = 80
      backhost = "127.0.0.1" backport = 8081;
      WRAPPER pbj_filter *]
    #PJ# ## block bad robots like evil sitescooper
    #PJ# RewriteCond %{HTTP_USER_AGENT} ^sitescooper
    #PJ# RewriteRule ^ - [forbidden]
    #PJ# ## local services:
    #PJ# RewriteRule ^/icons/ - [last]
    #PJ# RewriteRule ^/tt2/images/ - [last]

    #BJ#  <IfModule mod_perl.c>
    #BJ#  <Perl>
    #BJ#  use lib "[* env.PREFIX *]/perl-lib";
    #BJ#  do "startup.pl";
    #BJ#  </Perl>
    #BJ#  </IfModule>
    #BJ#  ErrorDocument 404 /404.html
    [* END; END *]

The parameters at the top control the common Hostname and Listen and mod_rewrite lines added to proxy servers. The pbj_filter material provides additional per-virtual-host specific additions. With this in place, adding the virtual server for www.geekcruises.com was as simple as adding a few more lines to the configuration:

    [* WRAPPER make_virtual_server
      name = "www.geekcruises.com" develport = 8005
      alias = "geekcruises.com geekcruises.stonehenge.com"
      host = "geekcruises.stonehenge.com" port = 80
      backhost = "127.0.0.1" backport = 8083;
      WRAPPER pbj_filter *]
      DocumentRoot [* env.GEEKCRUISES_ROOT *]
    #BJ#  <IfModule mod_perl.c>
    #BJ#  <Perl>
    #BJ#  use lib "[* env.PREFIX *]/perl-lib";
    #BJ#  </Perl>
    #BJ#  PerlPostReadRequestHandler Stonehenge::MyPostReadRequest
    #BJ#  </IfModule>
    [* END; END *]

And thus geekcruises.com can use the same proxy caching servers and mod_perl backend servers as stonehenge.com, with minimal effort.

While I don't have room to include the full definition of make_virtual_server here, it was really just a matter of figuring out how to plug in the values given as parameters. For example, two lines from the mod_rewrite section look like:

  RewriteRule ^/(.*)$ http://[* "$backhost:$backport" *]/${escape:$1} [proxy,noescape]
  ProxyPassReverse / http://[* name *]:[* port *]/

And once that information was captured in a template, I can reuse the pattern again and again for each virtual server.

I hope this description has inspired you to consider an alternate means of maintaining those ``similar but different'' files, using the powerful Template Toolkit as your tool of choice. Until next time, enjoy!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Linux Magazine Column 63 (Sep 2004)