Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 75 (Nov 2005)

[Suggested title: ``Profiling in Template Toolkit via overriding'']

I've talked a lot about Andy Wardley's Template Toolkit in this column over the past year or so. This isn't surprising, since I've been doing quite a bit of work lately for my clients (such as www.geekcruises.com) building web sites using Template Toolkit. Of course, ``work'' means ``solving problems'', and some of the problems having me a bit stumped until that ``a-ha!'' moment comes along.

For example, I recently completed a major phase of the deliverable web site, at least in terms of functionality. But as Neil (the customer) started playing with it, one of his first comments was ``this part is too slow, and I spend all my time on this screen''. That wasn't a show-stopping comment, because I had deliberately chosen to program the application with flexibility and ease-of-implementation first, knowing that I'd need to make a performance pass later to optimize anything that was too slow. Failing to do so would have put me in the realm of ``premature optimization'': making decisions about performance without actually understanding where the bottlenecks might eventually be.

The next step to keep Neil paying my paycheck was to make that particular operation faster. ``No problem'', I said to Neil, ``because I'll just trot out Devel::DProf to find the hotspots and slow spots, and optimize those!'' But, alas, adding Devel::DProf caused the program to core dump. Ugh. Even moving the code to a different architecture (from my MacOSX development platform to the OpenBSD deployment platform) still caused it to coredump. This was not looking good.

After fretting a bit, and asking online in all the usual places for some help, I decided that I still needed to know where the program was spending its time. I suspected that 90% or more of the time was in the ``rendering'' phase of CGI::Prototype, so I could accept a set of timings on a per-template basis as a first-order diagnostic tool.

Template Toolkit doesn't have any profiling tools ``out of the box''. But this is Perl, and thus the code is open. Also, Andy designed Template to be easily subclassed to alter portions of the behavior. I spent about an hour tracing the flow of a template being processed, and decided that my per-template statistics could be obtained by noting the execution time of each invocation of the process method of Template::Context. The first parameter to process seemed to be the same kind of value as was handed in as the first parameter of the process method of the top-level Template object, or the template handed to the PROCESS or INCLUDE or WRAPPER Template directives. Thus, I could use that value to aggregate results over different invocations of the same template.

My strategy was simple. I'd create a subclass of Template::Context, called My::Template::Context. I'd override the process method to ``start the timer'' by noting the wall clock time (via time), and the four CPU times (user CPU, system CPU, child user CPU, and child system CPU) accumulated thus far in the process (via times). Then I'd call the ``real'' process method to do the actual work, capturing its return value. When the method returned, I'd subtract the initial timer values from the new values, and that would give me the amount of time spent processing this template. Finally, I'd tell Template to use my new class instead of the original class by setting $Template::Config::CONTEXT to my class name (another nice configuration hook provided by Andy).

Or at least it looked like it'd be simple. But then I realized that what I really wanted was time spent only in that template, not in all the templates that it calls. This is important to me because I'm also interested in the immediate cost of all the callbacks made by a template, which I could measure only indirectly. (To measure those more directly, I would have had to perform a similar trick with Template::Stash, and that looked even hairier than this.)

So, I decided that I'd simply maintain a stack of calls into process, and on each return, compute the deltas for this call, but also subtract that value from each of the parent calls. After all, if the time was being charged to this invocation, it's no longer affecting the parent. To do that, I push forward the starting time counters of every parent call, pretending they started later. It all works out in the end, although I got the math wrong at least twice while I was hacking.

And the result of that is in [listing one, below], which I will now describe in my traditional manner.

Line 1 defines the beginning of my module. Obviously, in a mod_perl enviroment, the name My:: is not a good choice. But we're still in CGI at this point, so I didn't have trouble with this for now.

Line 2 both pulls in the Template::Context module, and declares the current module to be a derived class by setting @ISA automatically.

Line 4 declares the @stack of times for each nested invocation. Initially, the empty stack shows that we're at the top level. As each new invocation is entered, a new arrayref is pushed onto the stack. When the stack again becomes empty, we're at the top-level, and it's time to dump the results (simply to STDERR and thus into my web log).

Line 5 accumulates the results. The key is the template name, with the value being an arrayref of the accumulated sums of the times. This hash is reset as each top-level is reached.

Lines 7 to 47 define the process method that we're overriding from Template::Context.

Line 8 pulls out the $self object from the arguments.

Lines 10 to 13 attempt to figure out the name for this template. The first parameter to process can either be a template name, or a Template::Document if it's a BLOCK inside another template, or even a coderef. Luckily, I didn't have any coderefs in my application, but I still had to do something sensible with BLOCK values. So, line 11 sees if this is an object that inherits from Template::Document. If it does, then we can ask it its name with the name method. Another way that might have worked just as well would be:

  my $template = $_[0];
  eval { $template = $template->name };

This simpler version tries the name method, and if it doesn't work, leaves the template name alone. Again, I coded what worked first, and moved on.

Line 15 captures the initial wall clock time and the CPU times into a 5 element list, then pushes that as an arrayref onto the stack. This starts our timer.

Lines 17 to 19 call SUPER::process on the arguments to do the job. However, I was unsure of the invocation context of process, so I had to be careful to pass that same context on to the super call. If we're in a list context, my @return ... is already good. If not, then the scalar prefix ensures the proper context. I'm not paying attention to the void context here, as it would have complicated the code even further.

Now that we've done the real work, it's time to figure out how long that took. Line 21 pops the stack, retrieving the five starting values. These might have been altered by nested calls to process: we'll get to that in a moment.

Line 22 subtracts the old values from the current values, using a crazy little idiom I invented a while back. The map executes the block five times. Each value is subtracted from the next element shifted from @delta_times, creating the delta value. But the map accumulates these five deltas, which then get assigned... right back into @delta_times. OK, slightly obfuscated, admittedly. But I've used it enough times now that it begins to feel automatically like the right thing to reach for.

Lines 23 through 28 process each of the five deltas (wall clock, and the four CPU times). First, line 24 adds them into the appropriate accumulator, keyed by the name in $template. Then, the loop from line 25 to 27 pushes each of the ``start time'' counters forward by that same amount in the stack, thus discounting that amount of time from those accumulations. Trickiest part of the code, I'd say. But now that I got it, it looks straightforward.

Line 29 accumulates a count of the calls, just so I can see if something was called once or a hundred times. In general, you worry about total CPU time used, not the per-usage CPU time, but having the call count might be useful to understand ``big O'' scaling issues to determine an asymptotic upper bound.

If we're at the top of the stack, it's time to show how much the hit costs us. Lines 31 to 43 dump out the totals. The dump goes out to STDERR so that I see it on my web error log, and I'm using printf to try to make it at least slightly pretty. Line 42 clears out the totals (the stack is already clear) so that a new invocation starts it all over again.

Line 46 returns an appropriate value to the caller: if we're in a scalar context, only $return[0] has the proper value. Otherwise, we use the entire list.

Finally, line 49 tells Template that this is the kind of Template::Context that we want in our program. Line 51 provides the traditional true value for the require embedded in the invoking use.

Speaking of which, using this module is now as simple as putting a few extra lines at the top of my CGI script:

  #!/usr/bin/perl
  use lib "/my/local/lib"; # location of this module
  use My::Template::Context; # turn on logging
  use My::App; # CGI::Prototype
  My::App->activate;

Now, as this CGI program is invoked, I get extra lines in my web log showing exactly where my time is being spent. Yeay!

Within a few hours, I was able to cut the CPU time by over 70%, and Neil was happy, and continues to employ me to this day. Happy endings. Until next time, enjoy!

Listing

        =1=     package My::Template::Context;
        =2=     use base qw(Template::Context);
        =3=     
        =4=     my @stack;
        =5=     my %totals;
        =6=     
        =7=     sub process {
        =8=       my $self = shift;
        =9=     
        =10=      my $template = $_[0];
        =11=      if (UNIVERSAL::isa($template, "Template::Document")) {
        =12=        $template = $template->name || $template;
        =13=      }
        =14=    
        =15=      push @stack, [time, times];
        =16=    
        =17=      my @return = wantarray ?
        =18=        $self->SUPER::process(@_) :
        =19=          scalar $self->SUPER::process(@_);
        =20=    
        =21=      my @delta_times = @{pop @stack};
        =22=      @delta_times = map { $_ - shift @delta_times } time, times;
        =23=      for (0..$#delta_times) {
        =24=        $totals{$template}[$_] += $delta_times[$_];
        =25=        for my $parent (@stack) {
        =26=          $parent->[$_] += $delta_times[$_] if @stack; # parent adjust
        =27=        }
        =28=      }
        =29=      $totals{$template}[5] ++;     # count of calls
        =30=    
        =31=      unless (@stack) {
        =32=        ## top level again, time to display results
        =33=        print STDERR "-- $template at ". localtime, ":\n";
        =34=        printf STDERR "%3s %3s %6s %6s %6s %6s %s\n",
        =35=          qw(cnt clk user sys cuser csys template);
        =36=        for my $template (sort keys %totals) {
        =37=          my @values = @{$totals{$template}};
        =38=          printf STDERR "%3d %3d %6.2f %6.2f %6.2f %6.2f %s\n",
        =39=            $values[5], @values[0..4], $template;
        =40=        }
        =41=        print STDERR "-- end\n";
        =42=        %totals = ();               # clear out results
        =43=      }
        =44=    
        =45=      # return value from process:
        =46=      wantarray ? @return : $return[0];
        =47=    }
        =48=    
        =49=    $Template::Config::CONTEXT = __PACKAGE__;
        =50=    
        =51=    1;

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.