Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Download this listing! | ||
Linux Magazine Column 75 (Nov 2005)
[Suggested title: ``Profiling in Template Toolkit via overriding'']
I've talked a lot about Andy Wardley's Template Toolkit in this column
over the past year or so. This isn't surprising, since I've been
doing quite a bit of work lately for my clients (such as
www.geekcruises.com
) building web sites using Template Toolkit. Of
course, ``work'' means ``solving problems'', and some of the problems
having me a bit stumped until that ``a-ha!'' moment comes along.
For example, I recently completed a major phase of the deliverable web site, at least in terms of functionality. But as Neil (the customer) started playing with it, one of his first comments was ``this part is too slow, and I spend all my time on this screen''. That wasn't a show-stopping comment, because I had deliberately chosen to program the application with flexibility and ease-of-implementation first, knowing that I'd need to make a performance pass later to optimize anything that was too slow. Failing to do so would have put me in the realm of ``premature optimization'': making decisions about performance without actually understanding where the bottlenecks might eventually be.
The next step to keep Neil paying my paycheck was to make that
particular operation faster. ``No problem'', I said to Neil, ``because
I'll just trot out Devel::DProf
to find the hotspots and slow
spots, and optimize those!'' But, alas, adding Devel::DProf
caused
the program to core dump. Ugh. Even moving the code to a different
architecture (from my MacOSX development platform to the OpenBSD
deployment platform) still caused it to coredump. This was not
looking good.
After fretting a bit, and asking online in all the usual places for
some help, I decided that I still needed to know where the program was
spending its time. I suspected that 90% or more of the time was in
the ``rendering'' phase of CGI::Prototype
, so I could accept a set of
timings on a per-template basis as a first-order diagnostic tool.
Template Toolkit doesn't have any profiling tools ``out of the
box''. But this is Perl, and thus the code is open. Also, Andy
designed Template to be easily subclassed to alter portions of the
behavior. I spent about an hour tracing the flow of a template being
processed, and decided that my per-template statistics could be
obtained by noting the execution time of each invocation of the
process
method of Template::Context
. The first parameter to
process
seemed to be the same kind of value as was handed in as the
first parameter of the process
method of the top-level Template
object, or the template handed to the PROCESS or INCLUDE or WRAPPER
Template directives. Thus, I could use that value to aggregate results
over different invocations of the same template.
My strategy was simple. I'd create a subclass of
Template::Context
, called My::Template::Context
. I'd override
the process
method to ``start the timer'' by noting the wall clock
time (via time
), and the four CPU times (user CPU, system CPU,
child user CPU, and child system CPU) accumulated thus far in the
process (via times
). Then I'd call the ``real'' process
method to
do the actual work, capturing its return value. When the method
returned, I'd subtract the initial timer values from the new values,
and that would give me the amount of time spent processing this
template. Finally, I'd tell Template
to use my new class instead
of the original class by setting $Template::Config::CONTEXT
to my
class name (another nice configuration hook provided by Andy).
Or at least it looked like it'd be simple. But then I realized that
what I really wanted was time spent only in that template, not in
all the templates that it calls. This is important to me because I'm
also interested in the immediate cost of all the callbacks made by a
template, which I could measure only indirectly. (To measure those
more directly, I would have had to perform a similar trick with
Template::Stash
, and that looked even hairier than this.)
So, I decided that I'd simply maintain a stack of calls into
process
, and on each return, compute the deltas for this call, but
also subtract that value from each of the parent calls. After all, if
the time was being charged to this invocation, it's no longer
affecting the parent. To do that, I push forward the starting time
counters of every parent call, pretending they started later. It all
works out in the end, although I got the math wrong at least twice
while I was hacking.
And the result of that is in [listing one, below], which I will now describe in my traditional manner.
Line 1 defines the beginning of my module. Obviously, in a
mod_perl
enviroment, the name My::
is not a good choice. But
we're still in CGI at this point, so I didn't have trouble with this
for now.
Line 2 both pulls in the Template::Context
module, and declares the
current module to be a derived class by setting @ISA
automatically.
Line 4 declares the @stack
of times for each nested invocation.
Initially, the empty stack shows that we're at the top level. As each
new invocation is entered, a new arrayref is pushed onto the stack.
When the stack again becomes empty, we're at the top-level, and it's
time to dump the results (simply to STDERR
and thus into my web
log).
Line 5 accumulates the results. The key is the template name, with the value being an arrayref of the accumulated sums of the times. This hash is reset as each top-level is reached.
Lines 7 to 47 define the process
method that we're overriding
from Template::Context
.
Line 8 pulls out the $self
object from the arguments.
Lines 10 to 13 attempt to figure out the name for this template. The
first parameter to process
can either be a template name, or a
Template::Document
if it's a BLOCK inside another template, or even
a coderef. Luckily, I didn't have any coderefs in my application, but
I still had to do something sensible with BLOCK values. So, line
11 sees if this is an object that inherits from Template::Document
.
If it does, then we can ask it its name with the name
method.
Another way that might have worked just as well would be:
my $template = $_[0]; eval { $template = $template->name };
This simpler version tries the name
method, and if it doesn't work,
leaves the template name alone. Again, I coded what worked first, and
moved on.
Line 15 captures the initial wall clock time and the CPU times into a 5 element list, then pushes that as an arrayref onto the stack. This starts our timer.
Lines 17 to 19 call SUPER::process
on the arguments to do the job.
However, I was unsure of the invocation context of process
, so I
had to be careful to pass that same context on to the super call. If
we're in a list context, my @return ...
is already good. If not,
then the scalar
prefix ensures the proper context. I'm not paying
attention to the void context here, as it would have complicated the
code even further.
Now that we've done the real work, it's time to figure out how long
that took. Line 21 pops the stack, retrieving the five starting
values. These might have been altered by nested calls to process
:
we'll get to that in a moment.
Line 22 subtracts the old values from the current values, using
a crazy little idiom I invented a while back. The map executes the
block five times. Each value is subtracted from the next element
shifted from @delta_times
, creating the delta value. But the
map
accumulates these five deltas, which then get assigned... right
back into @delta_times. OK, slightly obfuscated, admittedly. But
I've used it enough times now that it begins to feel automatically
like the right thing to reach for.
Lines 23 through 28 process each of the five deltas (wall clock, and
the four CPU times). First, line 24 adds them into the appropriate
accumulator, keyed by the name in $template
. Then, the loop from
line 25 to 27 pushes each of the ``start time'' counters forward by that
same amount in the stack, thus discounting that amount of time from
those accumulations. Trickiest part of the code, I'd say. But now
that I got it, it looks straightforward.
Line 29 accumulates a count of the calls, just so I can see if something was called once or a hundred times. In general, you worry about total CPU time used, not the per-usage CPU time, but having the call count might be useful to understand ``big O'' scaling issues to determine an asymptotic upper bound.
If we're at the top of the stack, it's time to show how much the hit
costs us. Lines 31 to 43 dump out the totals. The dump goes out to
STDERR
so that I see it on my web error log, and I'm using
printf
to try to make it at least slightly pretty. Line 42 clears
out the totals (the stack is already clear) so that a new invocation
starts it all over again.
Line 46 returns an appropriate value to the caller: if we're in a
scalar context, only $return[0]
has the proper value. Otherwise,
we use the entire list.
Finally, line 49 tells Template
that this is the kind of
Template::Context
that we want in our program. Line 51 provides
the traditional true value for the require
embedded in the invoking
use
.
Speaking of which, using this module is now as simple as putting a few extra lines at the top of my CGI script:
#!/usr/bin/perl
use lib "/my/local/lib"; # location of this module use My::Template::Context; # turn on logging
use My::App; # CGI::Prototype My::App->activate;
Now, as this CGI program is invoked, I get extra lines in my web log showing exactly where my time is being spent. Yeay!
Within a few hours, I was able to cut the CPU time by over 70%, and Neil was happy, and continues to employ me to this day. Happy endings. Until next time, enjoy!
Listing
=1= package My::Template::Context; =2= use base qw(Template::Context); =3= =4= my @stack; =5= my %totals; =6= =7= sub process { =8= my $self = shift; =9= =10= my $template = $_[0]; =11= if (UNIVERSAL::isa($template, "Template::Document")) { =12= $template = $template->name || $template; =13= } =14= =15= push @stack, [time, times]; =16= =17= my @return = wantarray ? =18= $self->SUPER::process(@_) : =19= scalar $self->SUPER::process(@_); =20= =21= my @delta_times = @{pop @stack}; =22= @delta_times = map { $_ - shift @delta_times } time, times; =23= for (0..$#delta_times) { =24= $totals{$template}[$_] += $delta_times[$_]; =25= for my $parent (@stack) { =26= $parent->[$_] += $delta_times[$_] if @stack; # parent adjust =27= } =28= } =29= $totals{$template}[5] ++; # count of calls =30= =31= unless (@stack) { =32= ## top level again, time to display results =33= print STDERR "-- $template at ". localtime, ":\n"; =34= printf STDERR "%3s %3s %6s %6s %6s %6s %s\n", =35= qw(cnt clk user sys cuser csys template); =36= for my $template (sort keys %totals) { =37= my @values = @{$totals{$template}}; =38= printf STDERR "%3d %3d %6.2f %6.2f %6.2f %6.2f %s\n", =39= $values[5], @values[0..4], $template; =40= } =41= print STDERR "-- end\n"; =42= %totals = (); # clear out results =43= } =44= =45= # return value from process: =46= wantarray ? @return : $return[0]; =47= } =48= =49= $Template::Config::CONTEXT = __PACKAGE__; =50= =51= 1;