Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
![]() |
Download this listing! | ![]() |
![]() |
![]() |
Linux Magazine Column 54 (Dec 2003)
[suggested title: ``Checking your website's health, part 2'']
In last month's column, I showed how to create a testing program based
on Perl's own testing framework together with WWW::Mechanize
to
test the health of a website. For reference, I've reproduced the
entire code developed in last month's article in [listing one, below].
The test code is run on demand to verify the proper operation of our
target website: in this case, search.cpan.org
. The output looks
something like:
1..9 ok 1 - The object isa WWW::Mechanize ok 2 - fetched / ok 3 - / title matches ok 4 - follow FAQ link ok 5 - fetched FAQ page ok 6 - FAQ content matches ok 7 - select query form ok 8 - query returned good for 'author' ok 9 - found Andy Lester
While invoking this program directly certainly gives us immediate status, it'd be more useful to run this program automatically and frequently. For example, I could invoke this program every five minutes from cron, and then mail the results to my cell phone or pager. However, because there's a lot of output even when everything is OK, I'll be getting a lot of useless interruptions just to say ``everything is OK''.
Running the program under the standard Test::Harness
module helps a
bit. This module interprets the ok
and not ok
values
appropriately, providing a nice summary at the end, resulting in output
something like:
01-search-cpan....ok All tests successful. Files=1, Tests=9, 2 wallclock secs ( 0.53 cusr + 0.05 csys = 0.58 CPU)
However, it's still hard to reduce the text to just pinpoint the errors, or know whether things were successful or a partial or total failure. Also, one thing I hate is being told the same thing over and over again when a failure occurs, or not being told when something has cleared up. And, the text doesn't squish well into a nice SMS message for my phone or pager.
So, let's take this one step further. The Test::Harness
module
inherits its core functionality from Test::Harness::Straps
, which
is still in development. We can use a Test::Harness::Straps
object
to invoke the test script and interpret its output in a way that is
consistent with the Test::Harness
interpretation, and programmatically
determine which tests failed.
If we have that, we can tailor the output. One strategy might be to test every five minutes (from cron), but page a message only when things are broken, and then only once every thirty minutes. This message can be cut down to just precisely the failing tests, and perhaps the associated error output and exit status of the test program. Once the error clears up, the program can page on the next round with a single ``all clear'' signal so that we can turn back around and head home again instead of finishing our trek into the office in the middle of the night to fix the problem.
Of course, there's more going on than just ``all OK'', and ``something broken''. We can consider each different combination of ``something broken'' to be a different thing worthy of paging. Let's ensure that only one page per unique combination of events gets sent, and one all clear signal only when everything clears up.
Sounds difficult? Not at all, especially when we use
Cache::FileCache
as a lightweight time-oriented database. The
resulting cron-job program is shown in [listing two, below].
Lines 1 through 3 start nearly every program I write, turning on
warnings and compiler restrictions, and disabling the buffering of
STDOUT
.
Lines 5 to 16 define the configuration section: things I'd be likely
to change and want to locate quickly. The two time constants are
defined in Cache::Cache
-compatible units, which understands things
like ``15 seconds'' or ``4 hours''. See the documentation for details.
The $ALL_CLEAR_INTERVAL
defines how often a repeat page saying
``everything is OK'' gets sent. If you set this to ``1 day'', you'll get
a single page a day saying everything is OK as a nice meta-check that
your health-check is OK. By setting it to never
, you get one page
when the monitoring starts the very first time, but never again unless
it's after a failure has been fixed. Similarly,
$TEST_FAIL_INTERVAL
defines how often a page is sent for an
identical combination of failures.
Lines 10 to 14 define the callback subroutine of what to do when an
event of significance occurs. If this subroutine is called with no
parameters, then it's the ``all clear'' signal. Otherwise, it's the
current error text. For debugging, I'm simply displaying this text to
STDOUT
, but by reopening standard output to a pipe to sendmail, I
could just as easily send this to my cell phone or pager, presuming
there's a mail gateway.
Lines 18 to 21 pull in the four modules needed by the rest of the
program. Three of the four modules come with all recent Perl
distributions, although you might have to upgrade your
Test::Harness
from the CPAN if you get an error because of
Test::Harness::Straps
. The fourth is Cache::FileCache
, part of
the Cache::Cache
distribution.
Lines 23 and 24 create a temporary file and associated filehandle to
let me capture the STDERR
output from the various test programs
being run (such as the one in [listing one]). The error output is
usually diagnostic explanations, and often elaborates on the reasons
for failure. We also save the current STDERR
so that we can reset
it after every child process, so that our own die
and warn
messages end up in the right place.
Line 26 sets up the Cache::FileCache
object, giving us a memory
between invocations.
Line 28 ensures that our current directory is the same as the running script. This permits the paging program to be called as an absolute path in cron without having to manage the location of the test scripts, as long as they're all nearby.
Line 30 creates the Test::Harness::Straps
object to interpret the
result of one or more testing program invocations.
Line 32 collects together the failure information that will eventually decide what we report. The array will contain arrayrefs: each referenced array has two elements. The first element is an identifying (hopefully unique) key of a failing condition, and the second is its associated text. We'll see how this gets created and reported later.
Lines 34 to 46 loop over every test file in either the current
directory, or a subdirectory named t
, similar to the normal
Test::Harness
module-installation operation. Each of these tests
is run separately, although the results will all be gathered for a
single page. Instead of limiting the location of these *.t
files
to the current directory and one subdirectory, I might also want to
use File::Find
or a configuration file to define which tests are
run.
Line 35 defines %results
, having the same meaning as the
Test::Harness::Straps
documentation gives to the variable.
Lines 36 to 40 run a specific test file, using the
Test::Harness::Straps
object. Because we want to capture the
STDERR
output from each invocation, we must first redirect our own
STDERR
into the temporary file, then call the testing harness, then
restore it back. It's a bit messy, but necessary. Perhaps future
versions of Test::Harness::Straps
will provide a hook to do this
directly.
Once we have the results, we're concerned with two things. First, did any of the tests fail? And second, did the test child exit in some unusual manner.
To see if any of the tests failed, we look at $results{details}
,
which references an array of individual tests and their results.
Within each element of the array, the ok
element of the hashref
will be true if the test succeeded. If that's true, we simply ignore
the test. If it's false, we'll add to @failed
a new arrayref that
contains an identification name for the test (offset by one because
element zero of the array is test one) and the name the test gives
itself (usually the text after the comment mark). This is all handled
nicely by the map
in lines 41 to 44.
If the child exited badly, we'll add another element with the wait
status to @failed
in line 45.
Lines 48 to 52 look at the standard error output that has accumulated
from running all of the tests. If any output exists, it's gathered
into another @failed
entry keyed by errors
.
When we get to line 54 in the program, we've run all of our tests,
possibly from many different test programs, and have the results in
@failed
. We next create a ``current problems key'' in $key
,
resulting from joining all of the error tags into a space separated
list. If this string is empty, everything went OK, but otherwise
we'll end up with a list like "health.t:4 health.t:wait errors"
showing that test 4 failed, the wait status was bad, and we also had
some text on STDERR
from one or more of the children.
Based on this error key string, we now decide whether to page or not. In line 56, we'll distinguish between ``everything is OK'', and ``something is wrong''.
If something is wrong, we'll execute the code in lines 57 to 73. First, we'll remove any marker for ``everything is OK'' in the cache. This ensures that the next time everything is OK, we'll send a page to say so.
Line 58 sees if we have recently sent a page with this particular
error combination. If so, the value returns true from the cache, and
we'll do nothing further in this invocation. (I've left the
commented-out debugging print
at line 59 so you can see where this
happens.)
Otherwise, it's time to send a page. First, in line 61, we'll ensure
that we don't duplicate this particular page within the
$TEST_FAIL_INTERVAL
time window.
Line 63 defines a holder for the report. Lines 65 to 70 process the
@failed
array, extracting out each key/value pair, and then
prepending each line of the value with the key for careful labelling.
Even if the value is empty, at least one line containing the key is
generated.
Line 72 passes this report list into the SEND_REPORT
callback,
defined at the top of the program. This sends the appropriate report
with just the broken pieces of the collective tests.
Lines 75 to 82 deal with an ``everything is OK'' run. First, if
there's already been an ``all OK'' signal recently enough, there's
nothing to do (again, noted in a commented-out debugging print
in line 76). Otherwise, we throw away all the recently seen
failure tags in line 78 by clearing out the entire cache, and then
setting a flag to prevent an additional ``everything is OK'' message
until the $ALL_CLEAR_INTERVAL
has passed in line 79.
Line 81 passes an empty list to SEND_REPORT
, a signal that it's
time to send the all-clear message.
Although this simple test reporting tool doesn't have a lot of fancy features, it illustrates how the basics are accomplished and how the reporting can be kept to the essentials, and would probably work fine for a single system administrator on a typical personal or small-business site. If you want more, there's larger, more complex, and even commercial solutions to being notified when things go wrong. And as always, until next time, enjoy!
Listings
=0= #### LISTING ONE #### =1= #!/usr/bin/perl =2= use Test::More tests => 9; =3= use WWW::Mechanize; =4= isa_ok(my $a = WWW::Mechanize->new, "WWW::Mechanize"); =5= =6= $a->timeout(1); =7= $a->get("http://search.cpan.org/"); =8= is($a->status, 200, "fetched /"); =9= like($a->title, qr/The CPAN Search Site/, "/ title matches"); =10= SKIP: { =11= ok($a->follow_link( text => 'FAQ' ), "follow FAQ link") =12= or skip "missing FAQ link", 2; =13= SKIP: { =14= is($a->status, 200, "fetched FAQ page") =15= or skip "bad FAQ fetch", 1; =16= like($a->content, qr/Frequently Asked Questions/, "FAQ content matches"); =17= $a->back; =18= } =19= } =20= SKIP: { =21= ok($a->form_number(1), "select query form") =22= or skip "cannot select query form", 2; =23= $a->set_fields(query => "PETDANCE", mode => 'author'); =24= $a->click(); =25= SKIP: { =26= is($a->status, 200, "query returned good for 'author'") =27= or skip "missing author page", 1; =28= like($a->content, qr/Andy Lester/, "found Andy Lester"); =29= $a->back; =30= } =31= } =0= #### LISTING TWO #### =1= #!/usr/bin/perl -w =2= use strict; =3= $|++; =4= =5= ## CONFIG =6= =7= my $ALL_CLEAR_INTERVAL = "never"; # how often to repeat "all clear" signal =8= my $TEST_FAIL_INTERVAL = "30 minutes"; # how often to repeat test failed =9= =10= sub SEND_REPORT { # what do I do with a report? =11= ## open STDOUT, "|sendmail 5035551212\@vtext.com" or die "sendmail: $!"; =12= @_ = "ALL CLEAR\n" unless @_; =13= print @_; =14= } =15= =16= ## END CONFIG =17= =18= use File::Temp qw(tempfile); # core =19= use File::Basename qw(dirname); # core =20= use Test::Harness::Straps (); # core =21= use Cache::FileCache (); # CPAN =22= =23= my $errors = tempfile(); =24= open SAVE_STDERR, ">&STDERR" or warn "dup 2 to SAVE_STDERR: $!"; =25= =26= my $cache = Cache::FileCache->new({namespace => 'healthcheck_reporter'}); =27= =28= chdir dirname($0) or warn "Cannot chdir to dirname of $0: $!"; =29= =30= my $strap = Test::Harness::Straps->new; =31= =32= my @failed; =33= =34= for my $test_file (glob "*.t t/*.t") { =35= my %results; =36= { =37= open STDERR, ">&", $errors or print "dup $errors to STDERR: $!"; =38= %results = $strap->analyze_file($test_file); =39= open STDERR, ">&", \*SAVE_STDERR or print "dup SAVE_STDERR TO STDERR: $!"; =40= }; =41= push @failed, map { =42= $results{details}[$_]{ok} ? () : =43= ["$test_file:".($_+1) => $results{details}[$_]{name}] =44= } 0..$#{$results{details}}; =45= push @failed, ["$test_file:wait" => $results{wait}] if $results{wait}; =46= } =47= =48= if (-s $errors) { =49= seek $errors, 0, 0; =50= local $/; =51= push @failed, ["errors" => <$errors>]; =52= } =53= =54= my $key = join " ", map $_->[0], @failed; =55= =56= if ($key) { # bad report =57= $cache->remove(""); # blow away good report stamp =58= if ($cache->get($key)) { # seen this recently? =59= ## print "ignoring duplicate report for $key\n"; =60= } else { =61= $cache->set($key, 1, $TEST_FAIL_INTERVAL); =62= =63= my @report; =64= =65= for (@failed) { =66= my ($key, $value) = @$_; =67= my @values = split /\n/, $value; =68= @values = ("") unless @values; # ensure at least one line =69= push @report, "$key = $_\n" for @values; =70= } =71= =72= SEND_REPORT(@report); =73= } =74= } else { # good report =75= if ($cache->get("")) { # already said good? =76= ## print "ignoring good report\n"; =77= } else { =78= $cache->clear(); # all is forgiven =79= $cache->set("", 1, $ALL_CLEAR_INTERVAL); =80= =81= SEND_REPORT(); # empty means good report =82= } =83= }