Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Linux Magazine Column 64 (Oct 2004)

[suggested title: ``Introduction to mod_perl (part 1)'']

Last month, I talked a bit about mod_perl, in terms of one of the things I need to manage about my web server. But I was reminded by a few of my reviewers that I had never really provided a good overview of mod_perl yet in any of my columns! Time to fix that.

As its name might suggest, mod_perl is an Apache module, which can be built into a statically or dynamically linked Apache server, or can be added later via the APXS mechanism. Generally, if your flavor of Unix has Apache, you probably have simple instructions for adding mod_perl already available. Macintoshes running OSX already have mod_perl built in, and there are also pre-built versions of Apache with mod_perl for Windows machines as well.

The Apache server goes through a number of phases while processing a request: parsing the incoming data stream; determining the resource requested by the URL; controlling access, authentication, and authorization; determining the MIME type of the response; serving the content; and logging what happened.

Most mod_whatevers apply to only one phase of the Apache process. For example, mod_cgi, which handles CGI scripts, deals exclusively with the content phase. And mod_auth_dbm deals with authentication using DBM files during the authentication phase. But mod_perl can be used to alter the behavior of all phases of the request! In other words, mod_perl exposes nearly the entire Apache API to Perl code, to extend and embed complex behaviors easily. This makes mod_perl far more powerful and useful than say, mod_cgi or mod_php, which can impact only the content phase.

For example, Perl code can be used to manipulate the URL-to-filename mapping during the translation phase, allowing different content to be served depending on the time of day or origin of the requestor. Or you may want to control whether a given URL is even permitted based on the origin of the requestor, during the access phase. Or maybe authenticate against an LDAP database, or authorize based on a combination of specialized authentication and session tickets. Or maybe even log the cpu time of each request to a database, to see which dynamic requests are burdening your overloaded machine.

While some of this can be accomplished through the existing specialized modules, being able to script this in Perl gives a lot more flexibility, and is arguably easier to use and extend.

And not to be left out, mod_perl can also also deliver dynamic content as well, and this is probably the primary use of mod_perl. One huge advantage of mod_perl to deliver content over mod_cgi is that the Apache process does not have to fork and exec a new Perl program, causing that program to be parsed from scratch. Both the forking and new parsing take time and resources, substantially reducing the effectiveness of a given hardware configuration. I've heard stories of between 10x and 100x speed improvements just switching from mod_cgi to mod_perl (using Apache::Registry, which I'll get to in a moment). Additionally, database connections can be re-used, preventing costly authentication handshaking on each hit.

Though, as they say in the recent Spiderman movies (and original comics, if I'm told correctly), ``With great power comes great responsibility''. Perl code that is used and re-used repeatedly must ``behave well'', and this does require a particular discipline and cooperation from all involved. Thus, mod_perl is not generally a suitable candidate for shared-hosting situations where you cannot trust every other user of a given Apache process.

By default, installing mod_perl into an Apache server does absolutely nothing. You have to tell Apache to hand some (or all) of the requests to one or more Perl handlers at the appropriate phase, by placing the appropriate configuration directives into httpd.conf and various .htaccess files (if so enabled). Some of these directives control the state of the entire embedded Perl interpreter. For example:

  PerlWarn Yes

enables the equivalent of the -w command-line flag, turning on warnings for all processed code. And

  PerlTaintCheck Yes

is like -T on the perl command-line, enabling taint checking. Because these two flags affect all code for this embedded Perl interpreter, they should be used cautiously.

When we place

  PerlModule My::Module

into a configuration file, Perl acts as if we said:

  require My::Module;

pulling in the module according to the @INC path (more on that in a moment). And

  PerlRequire /some/path/foo.pl

is like

  require "/some/path/foo.pl";

allowing us to pull in arbitary code. Note that this code can set @INC or execute anything we desire. Also be aware that if the server is running as root to be able to bind to a low-numbered port, this code is also executed as root. Caveat executor!

If you don't want to create a separate file of Perl code, you can also embed the code directly into the configuration file:

  <perl>
  use lib "/my/place";
  $ENV{FOO} = "bar";
  </perl>

This Perl code is executed at the time the server is started (or restarted), again with root privileges if available.

If the code in one of these ``perl'' sections sets either of the variables of $PerlConfig or @PerlConfig, these variables are then interpreted as if they were lines in the configuration file. Thus, Perl code can generate configuration directives on the fly. For example, to set the listening port dynamically based on the presence at server startup of an environment variable named TESTING, we could use:

  <perl>
  my $port = $ENV{TESTING} ? 8080 : 80;
  $PerlConfig = "Port $port\n";
  </perl>

If TESTING is set, we get Port 8080 as a directive. If not, we get Port 80. One cool use I've heard of this feature is configuring a series of virtual hosts based on reading a database with DBI.

Associating Perl code with content delivery is rather straightforward. Within an .htaccess file, or a Location or Files or Directory section of some configuration file, we simply add both a SetHandler and PerlHandler directive.

One of the most common content handlers is the Apache::Registry handler, which takes a file (typically an entire CGI.pm-based Perl program), turns it into a Perl subroutine, then caches that subroutine to provide dynamic content for that URL. For example:

  <Location /perl>
  SetHandler perl-script
  PerlHandler Apache::Registry
  </Location>

causes all scripts located below the perl directory within the document root to be treated with Apache::Registry. Now, when we visit /perl/myproggy, the file myproggy is turned into a Perl subroutine, and executed in a manner similar to a CGI script. However, we do this without forking, and caching the resulting subroutine in memory. On the next hit to the same Apache process, we've already parsed the file, and things move much quicker. If the file changes, Apache::Registry reparses the file on the next hit transparently.

We can also get the same behavior based on a file extension rather than a particular location using:

  <Files *.pl>
  SetHandler perl-script
  PerlHandler Apache::Registry
  </Files>

However, I do not recommend the use of extensions as a trigger, as it gives away the implementation technology too easily, inviting a possible security exploit.

Because the subroutine created from the script is used repeatedly, we have to ensure that the code works well for being reused. Package variables are not reinitialized on every hit, for example, nor are open filehandles automatically closed and reopened. Most of the most common traps are documented in the cgi_to_mod_perl and mod_perl_traps manpages, included along the mod_perl distribution.

The caching of the scripts by Apache::Registry is a nice feature. If the file changes, the new script is automatically sucked in. However, once a Perl interpreter has loaded a require file or module, that file is marked as having been already loaded, and is never examined again. In a development enviroment, this can be frustrating, because you might be updating a module during testing, and yet only some of your Apache processes will be loading the new code, while others hang on to your old code. One workaround is the use of Apache::StatINC. Adding:

  PerlInitHandler Apache::StatINC

to your configuration file causes Perl during the initial phase of each request to walk through the %INC hash (containing the already loaded require files and modules) to see if any files had been updated since they were last loaded. If so, they are flushed and reloaded as needed. While this module is great during development, you should not use it during production, as it adds a number of additional filesystem system calls on each hit.

Another core module that can be useful to determine the state of things during development is Apache::Status, enabled like:

  <Location /perl-status>
  SetHandler  perl-script
  PerlHandler Apache::Status
  </Location>

Now, when you visit /perl-status, you'll get some status information about that particular embedded Perl process, including loaded modules, environment variables, and so on. Again, this is too much information for a would-be intruder during production, so be sure to enable this only on development machines.

Another core module is Apache::Resource, which can be used to limit the resources used by a child server process. For example:

  PerlSetEnv PERL_RLIMIT_CPU 120
  PerlChildInitHandler Apache::Resource

If a child process now takes more than 120 CPU seconds, it is aborted immediately. This is a hard abort, returning a 500-type error to the client, but at least you won't have a runaway Apache process. Also, note that this is not per-request, but rather per-child, so you'll want to set the appropriate MaxRequestsPerChild to a low enough number so as not to trigger this limit in normal execution.

I've run out of space for this month's article, so next month, I'll continue this introduction to mod_perl, including the complete API from Perl back into the Apache, and some nice CPAN modules as well. Until then, enjoy!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Linux Magazine Column 64 (Oct 2004)