Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 54 (Dec 2003)

[suggested title: ``Checking your website's health, part 2'']

In last month's column, I showed how to create a testing program based on Perl's own testing framework together with WWW::Mechanize to test the health of a website. For reference, I've reproduced the entire code developed in last month's article in [listing one, below].

The test code is run on demand to verify the proper operation of our target website: in this case, search.cpan.org. The output looks something like:

    1..9
    ok 1 - The object isa WWW::Mechanize
    ok 2 - fetched /
    ok 3 - / title matches
    ok 4 - follow FAQ link
    ok 5 - fetched FAQ page
    ok 6 - FAQ content matches
    ok 7 - select query form
    ok 8 - query returned good for 'author'
    ok 9 - found Andy Lester

While invoking this program directly certainly gives us immediate status, it'd be more useful to run this program automatically and frequently. For example, I could invoke this program every five minutes from cron, and then mail the results to my cell phone or pager. However, because there's a lot of output even when everything is OK, I'll be getting a lot of useless interruptions just to say ``everything is OK''.

Running the program under the standard Test::Harness module helps a bit. This module interprets the ok and not ok values appropriately, providing a nice summary at the end, resulting in output something like:

    01-search-cpan....ok
    All tests successful.
    Files=1, Tests=9,  2 wallclock secs ( 0.53 cusr +  0.05 csys = 0.58 CPU)

However, it's still hard to reduce the text to just pinpoint the errors, or know whether things were successful or a partial or total failure. Also, one thing I hate is being told the same thing over and over again when a failure occurs, or not being told when something has cleared up. And, the text doesn't squish well into a nice SMS message for my phone or pager.

So, let's take this one step further. The Test::Harness module inherits its core functionality from Test::Harness::Straps, which is still in development. We can use a Test::Harness::Straps object to invoke the test script and interpret its output in a way that is consistent with the Test::Harness interpretation, and programmatically determine which tests failed.

If we have that, we can tailor the output. One strategy might be to test every five minutes (from cron), but page a message only when things are broken, and then only once every thirty minutes. This message can be cut down to just precisely the failing tests, and perhaps the associated error output and exit status of the test program. Once the error clears up, the program can page on the next round with a single ``all clear'' signal so that we can turn back around and head home again instead of finishing our trek into the office in the middle of the night to fix the problem.

Of course, there's more going on than just ``all OK'', and ``something broken''. We can consider each different combination of ``something broken'' to be a different thing worthy of paging. Let's ensure that only one page per unique combination of events gets sent, and one all clear signal only when everything clears up.

Sounds difficult? Not at all, especially when we use Cache::FileCache as a lightweight time-oriented database. The resulting cron-job program is shown in [listing two, below].

Lines 1 through 3 start nearly every program I write, turning on warnings and compiler restrictions, and disabling the buffering of STDOUT.

Lines 5 to 16 define the configuration section: things I'd be likely to change and want to locate quickly. The two time constants are defined in Cache::Cache-compatible units, which understands things like ``15 seconds'' or ``4 hours''. See the documentation for details. The $ALL_CLEAR_INTERVAL defines how often a repeat page saying ``everything is OK'' gets sent. If you set this to ``1 day'', you'll get a single page a day saying everything is OK as a nice meta-check that your health-check is OK. By setting it to never, you get one page when the monitoring starts the very first time, but never again unless it's after a failure has been fixed. Similarly, $TEST_FAIL_INTERVAL defines how often a page is sent for an identical combination of failures.

Lines 10 to 14 define the callback subroutine of what to do when an event of significance occurs. If this subroutine is called with no parameters, then it's the ``all clear'' signal. Otherwise, it's the current error text. For debugging, I'm simply displaying this text to STDOUT, but by reopening standard output to a pipe to sendmail, I could just as easily send this to my cell phone or pager, presuming there's a mail gateway.

Lines 18 to 21 pull in the four modules needed by the rest of the program. Three of the four modules come with all recent Perl distributions, although you might have to upgrade your Test::Harness from the CPAN if you get an error because of Test::Harness::Straps. The fourth is Cache::FileCache, part of the Cache::Cache distribution.

Lines 23 and 24 create a temporary file and associated filehandle to let me capture the STDERR output from the various test programs being run (such as the one in [listing one]). The error output is usually diagnostic explanations, and often elaborates on the reasons for failure. We also save the current STDERR so that we can reset it after every child process, so that our own die and warn messages end up in the right place.

Line 26 sets up the Cache::FileCache object, giving us a memory between invocations.

Line 28 ensures that our current directory is the same as the running script. This permits the paging program to be called as an absolute path in cron without having to manage the location of the test scripts, as long as they're all nearby.

Line 30 creates the Test::Harness::Straps object to interpret the result of one or more testing program invocations.

Line 32 collects together the failure information that will eventually decide what we report. The array will contain arrayrefs: each referenced array has two elements. The first element is an identifying (hopefully unique) key of a failing condition, and the second is its associated text. We'll see how this gets created and reported later.

Lines 34 to 46 loop over every test file in either the current directory, or a subdirectory named t, similar to the normal Test::Harness module-installation operation. Each of these tests is run separately, although the results will all be gathered for a single page. Instead of limiting the location of these *.t files to the current directory and one subdirectory, I might also want to use File::Find or a configuration file to define which tests are run.

Line 35 defines %results, having the same meaning as the Test::Harness::Straps documentation gives to the variable.

Lines 36 to 40 run a specific test file, using the Test::Harness::Straps object. Because we want to capture the STDERR output from each invocation, we must first redirect our own STDERR into the temporary file, then call the testing harness, then restore it back. It's a bit messy, but necessary. Perhaps future versions of Test::Harness::Straps will provide a hook to do this directly.

Once we have the results, we're concerned with two things. First, did any of the tests fail? And second, did the test child exit in some unusual manner.

To see if any of the tests failed, we look at $results{details}, which references an array of individual tests and their results. Within each element of the array, the ok element of the hashref will be true if the test succeeded. If that's true, we simply ignore the test. If it's false, we'll add to @failed a new arrayref that contains an identification name for the test (offset by one because element zero of the array is test one) and the name the test gives itself (usually the text after the comment mark). This is all handled nicely by the map in lines 41 to 44.

If the child exited badly, we'll add another element with the wait status to @failed in line 45.

Lines 48 to 52 look at the standard error output that has accumulated from running all of the tests. If any output exists, it's gathered into another @failed entry keyed by errors.

When we get to line 54 in the program, we've run all of our tests, possibly from many different test programs, and have the results in @failed. We next create a ``current problems key'' in $key, resulting from joining all of the error tags into a space separated list. If this string is empty, everything went OK, but otherwise we'll end up with a list like "health.t:4 health.t:wait errors" showing that test 4 failed, the wait status was bad, and we also had some text on STDERR from one or more of the children.

Based on this error key string, we now decide whether to page or not. In line 56, we'll distinguish between ``everything is OK'', and ``something is wrong''.

If something is wrong, we'll execute the code in lines 57 to 73. First, we'll remove any marker for ``everything is OK'' in the cache. This ensures that the next time everything is OK, we'll send a page to say so.

Line 58 sees if we have recently sent a page with this particular error combination. If so, the value returns true from the cache, and we'll do nothing further in this invocation. (I've left the commented-out debugging print at line 59 so you can see where this happens.)

Otherwise, it's time to send a page. First, in line 61, we'll ensure that we don't duplicate this particular page within the $TEST_FAIL_INTERVAL time window.

Line 63 defines a holder for the report. Lines 65 to 70 process the @failed array, extracting out each key/value pair, and then prepending each line of the value with the key for careful labelling. Even if the value is empty, at least one line containing the key is generated.

Line 72 passes this report list into the SEND_REPORT callback, defined at the top of the program. This sends the appropriate report with just the broken pieces of the collective tests.

Lines 75 to 82 deal with an ``everything is OK'' run. First, if there's already been an ``all OK'' signal recently enough, there's nothing to do (again, noted in a commented-out debugging print in line 76). Otherwise, we throw away all the recently seen failure tags in line 78 by clearing out the entire cache, and then setting a flag to prevent an additional ``everything is OK'' message until the $ALL_CLEAR_INTERVAL has passed in line 79.

Line 81 passes an empty list to SEND_REPORT, a signal that it's time to send the all-clear message.

Although this simple test reporting tool doesn't have a lot of fancy features, it illustrates how the basics are accomplished and how the reporting can be kept to the essentials, and would probably work fine for a single system administrator on a typical personal or small-business site. If you want more, there's larger, more complex, and even commercial solutions to being notified when things go wrong. And as always, until next time, enjoy!

Listings

        =0=     #### LISTING ONE ####
        =1=     #!/usr/bin/perl
        =2=     use Test::More tests => 9;
        =3=     use WWW::Mechanize;
        =4=     isa_ok(my $a = WWW::Mechanize->new, "WWW::Mechanize");
        =5=     
        =6=     $a->timeout(1);
        =7=     $a->get("http://search.cpan.org/";);
        =8=     is($a->status, 200, "fetched /");
        =9=     like($a->title, qr/The CPAN Search Site/, "/ title matches");
        =10=    SKIP: {
        =11=      ok($a->follow_link( text => 'FAQ' ), "follow FAQ link")
        =12=        or skip "missing FAQ link", 2;
        =13=     SKIP: {
        =14=        is($a->status, 200, "fetched FAQ page")
        =15=          or skip "bad FAQ fetch", 1;
        =16=        like($a->content, qr/Frequently Asked Questions/, "FAQ content matches");
        =17=        $a->back;
        =18=      }
        =19=    }
        =20=    SKIP: {
        =21=      ok($a->form_number(1), "select query form")
        =22=        or skip "cannot select query form", 2;
        =23=      $a->set_fields(query => "PETDANCE", mode => 'author');
        =24=      $a->click();
        =25=     SKIP: {
        =26=        is($a->status, 200, "query returned good for 'author'")
        =27=          or skip "missing author page", 1;
        =28=        like($a->content, qr/Andy Lester/, "found Andy Lester");
        =29=        $a->back;
        =30=      }
        =31=    }
        =0=     #### LISTING TWO ####
        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     ## CONFIG
        =6=     
        =7=     my $ALL_CLEAR_INTERVAL = "never"; # how often to repeat "all clear" signal
        =8=     my $TEST_FAIL_INTERVAL = "30 minutes"; # how often to repeat test failed
        =9=     
        =10=    sub SEND_REPORT {               # what do I do with a report?
        =11=      ## open STDOUT, "|sendmail 5035551212\@vtext.com" or die "sendmail: $!";
        =12=      @_ = "ALL CLEAR\n" unless @_;
        =13=      print @_;
        =14=    }
        =15=    
        =16=    ## END CONFIG
        =17=    
        =18=    use File::Temp qw(tempfile);    # core
        =19=    use File::Basename qw(dirname); # core
        =20=    use Test::Harness::Straps ();   # core
        =21=    use Cache::FileCache ();        # CPAN
        =22=    
        =23=    my $errors = tempfile();
        =24=    open SAVE_STDERR, ">&STDERR" or warn "dup 2 to SAVE_STDERR: $!";
        =25=    
        =26=    my $cache = Cache::FileCache->new({namespace => 'healthcheck_reporter'});
        =27=    
        =28=    chdir dirname($0) or warn "Cannot chdir to dirname of $0: $!";
        =29=    
        =30=    my $strap = Test::Harness::Straps->new;
        =31=    
        =32=    my @failed;
        =33=    
        =34=    for my $test_file (glob "*.t t/*.t") {
        =35=      my %results;
        =36=      {
        =37=        open STDERR, ">&", $errors or print "dup $errors to STDERR: $!";
        =38=        %results = $strap->analyze_file($test_file);
        =39=        open STDERR, ">&", \*SAVE_STDERR or print "dup SAVE_STDERR TO STDERR: $!";
        =40=      };
        =41=      push @failed, map {
        =42=        $results{details}[$_]{ok} ? () :
        =43=          ["$test_file:".($_+1) => $results{details}[$_]{name}]
        =44=      } 0..$#{$results{details}};
        =45=      push @failed, ["$test_file:wait" => $results{wait}] if $results{wait};
        =46=    }
        =47=    
        =48=    if (-s $errors) {
        =49=      seek $errors, 0, 0;
        =50=      local $/;
        =51=      push @failed, ["errors" => <$errors>];
        =52=    }
        =53=    
        =54=    my $key = join " ", map $_->[0], @failed;
        =55=    
        =56=    if ($key) {                     # bad report
        =57=      $cache->remove("");           # blow away good report stamp
        =58=      if ($cache->get($key)) {      # seen this recently?
        =59=        ## print "ignoring duplicate report for $key\n";
        =60=      } else {
        =61=        $cache->set($key, 1, $TEST_FAIL_INTERVAL);
        =62=    
        =63=        my @report;
        =64=    
        =65=        for (@failed) {
        =66=          my ($key, $value) = @$_;
        =67=          my @values = split /\n/, $value;
        =68=          @values = ("") unless @values; # ensure at least one line
        =69=          push @report, "$key = $_\n" for @values;
        =70=        }
        =71=    
        =72=        SEND_REPORT(@report);
        =73=      }
        =74=    } else {                        # good report
        =75=      if ($cache->get("")) {        # already said good?
        =76=        ## print "ignoring good report\n";
        =77=      } else {
        =78=        $cache->clear();            # all is forgiven
        =79=        $cache->set("", 1, $ALL_CLEAR_INTERVAL);
        =80=    
        =81=        SEND_REPORT();              # empty means good report
        =82=      }
        =83=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.