Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 39 (Aug 2002)

[suggested title: Watching long processes through CGI]

The CGI protocol is wonderful for the remote execution of short tasks. But how do you execute a longer task? You can't just have the task slowly executing without giving some kind of feedback to the user, and eventually Apache will get bored and have the connection time out anyway.

I've seen (and written) some solutions that depend on the browser understanding ``server push'', but that's not a universal feature. And then there's the solutions that are writing simple enough HTML that the page being incrementally rendered shows some signs of activity. Again, you can't count on that across the browser spectrum.

But one solution that minimizes server overhead and client browser dependence is the use of ``client pull'', also called ``meta refresh''. The initial request sets up a forked process to perform the real work, and redirects the browser to a new URL which will ``pull'' the results obtained so far. If the results are incomplete, an additional header instructs the browser to ``refresh'' the data after some number of seconds.

Now, this sounds like it might be messy, at least in terms of managing the inter-process communication. How will the new CGI invocations know which data to display? This is handled by creating a unique ``session key'' which should be hard to guess but easy to hand around. For my sample implementation, I'm using the MD5 hash of some mostly unpredictable data. And where will this data be? Sure, I could use temporary files, which will then require some sort of cleaner to zap out the old stale files, but an easier solution is to use the Cache::Cache module from the CPAN, for which I've sung the praises here in the past.

So, the basic strategy is this: the browser hits the form, and the user fills out that form; the browser submits the form, and after verifying good infomation, the response forks to run the task, and redirects the browser back with a session key; the forked process runs the task, taking output as it arrives and updates a cache value, flagging when it is complete; the CGI script pulls from the cache, and displays it, sending a refresh as long as the data is not complete.

For purposes of demonstration, we'll use a typical system administration task: running a traceroute. Obviously, this consumes system and network resources, so you should not set this up exactly as I've written it [in the listing below] in a public place unless you want angry glares from your net neighbors.

Lines 1 through 3 begin nearly every CGI program I write: enabling taint checking, compiler restrictions, and disabling the buffering of standard output.

Line 5 sets the shell execution path. Because we're tainted, any use of a child command will be flagged unless the PATH itself is also untainted, and the simplest way to do that is to set it directly.

Line 7 pulls in the CGI shortcuts, including a couple of unusual entries that don't get pulled in with :all for reasons I don't fathom.

Lines 9 to 57 form the three-way switch to have this CGI program decide what its personality will be for this particular invocation. Since they are actually listed in the reverse order of their normal invocation sequence, I'll start at the bottom and work backwards.

Line 56 shows a web form to accept the single parameter: the host to which we are tracerouting. This comes from a subroutine defined in lines 61 to 66. Simply put, we print the HTTP (actually CGI) header, the beginning of the HTML, titling the page as Traceroute, and then a first-level head of the same. The form comes next (with an action that defaults to the same script again), with a single submit button and a textfield. The fieldname is host, which we note for the next part of the description. Then the form is closed, and the HTML completed. This is your standard trivial form.

When the user submits this form, we come back to the same script, and end up in the code starting in line 24. And here's where it get interesting.

First lines 25 and 26 validate the input parameter, and untaint it by extracting a good known narrow regex. Note that I limit the size of the hostname to 100 characters to prevent a denial-of-service or buffer-overflow attack, and the range of characters to prevent other messiness. Be very conservative when accepting web form parameters. If the validation fails, we redisplay the form in line 53.

Line 27 fetches a unique session ID. This will be 32 hex characters that will be reasonably hard to predict. The subroutine in lines 80 through 84 pull in the Digest::MD5 module (found in the CPAN) to take some random and unpredictable data to generate such a hash. I stole the routine from Apache::Session, so if it's good enough for them, it's good enough for me.

Line 28 gets a Cache::Cache object to hold the information for the interprocess communication. The subroutine beginning in line 68 defines this object: we'll cache in the filespace, naming the application tracerouter. The data will be good for 30 minutes before purging, and a purging run will be executed automatically on the first hit after 4 hours has passed. Look ma, no maintenance run. If you decide you want to use shared memory, a simple change to this subroutine will create it in shared memory instead.

Line 29 puts the initial load into the cache. The cache is always a two element arrayref. The first element is a flag: true if the output is complete. The second element is the data so far.

And now the fun part. We're going to fork starting in line 31. This permits the parent process to tell Apache that we're done responding to this request, while letting the child go off to perform the long traceroute. If we're the parent, we need to construct a URL that points back to us, but with just the session ID. So, we clear all the stored CGI parameters (line 32), set the session ID (line 33), and then print a CGI-redirect of ``ourself'' (as modified), which becomes an external redirect to the browser (line 34), and we're done.

The child goes on, but it must first close STDOUT, because otherwise Apache will think there might still be some output coming for the browser, and won't respond to the browser or release the connection until this is all resolved. Next, we have to launch a child process of the child to execute the traceroute.

We'll do this with a pipe-open which includes an implicit fork, in line 37. The grandchild process merges STDERR to STDOUT, and then executes traceroute, passing it the validated host parameter from before. If line 40 is executed, we get a single line of output in our response with the die message.

The child (that is, the parent of the traceroute) reads from the filehandle opened from the STDOUT (and STDERR) of the traceroute starting in line 42. We declare a buffer ($buf), and as each line is read (line 43), the line is added to the buffer (line 44) and shoved into the cache storage (line 45). When the command is complete, we get end-of-file, drop out of the loop, and store the entire buffer again with an ``I'm done'' flag (line 46) and exit (line 47).

In short, the child process scurries off to execute the command. And the parent has told the server to tell the browser to ``please revisit me with this session key''. So the browser comes back on its own volition, and ends up starting in line 9, for the third and final part of this program.

Line 10 gets the cache handle, opening to the same cache to which the forked child is writing. Line 11 gets the cache data for that session key. Now if the data is missing, either the data has expired, or someone is trying to jimmy up a session key to hijack someone else's session. In either case, we show the form (again), and stop.

Line 16 generates the CGI header. Lines 17 to 19 follow that with the HTML header. If the ``data complete'' flag is not set, then we need to keep going after this display, so we'll add a meta-refresh tag to the head info. This instructs the browser to poll the same URL in some number of seconds (here 5 seconds).

Line 20 labels the output with a first-level header, and dumps the data (nicely HTML escaped and wrapped in a PRE element) that we have so far. If the data is incomplete, an italicized ``continuing'' paragraph is appended, to let the user know that we're still working on the answer. And that's it!

So, that's a basic strategy to watching a long-running program do its job, remotely through CGI invocations. Again, be aware of the resources this web script would let your remote visitors consume, and of the range of actions you'd really want to permit. Also, the child process has no awareness if the parent is finally disinterested, and continues merrily chugging away to produce a result that no-one will see. Perhaps that can be fixed in another revision of the program. But until next time, enjoy!

Listing

        =1=     #!/usr/bin/perl -T
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     $ENV{PATH} = "/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin";
        =6=     
        =7=     use CGI qw(:all delete_all escapeHTML);
        =8=     
        =9=     if (my $session = param('session')) { # returning to pick up session data
        =10=      my $cache = get_cache_handle();
        =11=      my $data = $cache->get($session);
        =12=      unless ($data and ref $data eq "ARRAY") { # something is wrong
        =13=        show_form();
        =14=        exit 0;
        =15=      }
        =16=      print header;
        =17=      print start_html(-title => "Traceroute Results",
        =18=                       ($data->[0] ? () :
        =19=                        (-head => ["<meta http-equiv=refresh content=5>"])));
        =20=      print h1("Traceroute Results");
        =21=      print pre(escapeHTML($data->[1]));
        =22=      print p(i("... continuing ...")) unless $data->[0];
        =23=      print end_html;
        =24=    } elsif (my $host = param('host')) { # returning to select host
        =25=      if ($host =~ /^([a-zA-Z0-9.\-]{1,100})\z/) { # create a session
        =26=        $host = $1;                 # untainted now
        =27=        my $session = get_session_id();
        =28=        my $cache = get_cache_handle();
        =29=        $cache->set($session, [0, ""]); # no data yet
        =30=    
        =31=        if (my $pid = fork) {       # parent does
        =32=          delete_all();             # clear parameters
        =33=          param('session', $session);
        =34=          print redirect(self_url());
        =35=        } elsif (defined $pid) {    # child does
        =36=          close STDOUT;             # so parent can go on
        =37=          unless (open F, "-|") {
        =38=            open STDERR, ">&=1";
        =39=            exec "/usr/sbin/traceroute", $host;
        =40=            die "Cannot execute traceroute: $!";
        =41=          }
        =42=          my $buf = "";
        =43=          while (<F>) {
        =44=            $buf .= $_;
        =45=            $cache->set($session, [0, $buf]);
        =46=          }
        =47=          $cache->set($session, [1, $buf]);
        =48=          exit 0;
        =49=        } else {
        =50=          die "Cannot fork: $!";
        =51=        }
        =52=      } else {
        =53=        show_form();
        =54=      }
        =55=    } else {                        # display form
        =56=      show_form();
        =57=    }
        =58=    
        =59=    exit 0;
        =60=    
        =61=    sub show_form {
        =62=      print header, start_html("Traceroute"), h1("Traceroute");
        =63=      print start_form;
        =64=      print submit('traceroute to this host:'), " ", textfield('host');
        =65=      print end_form, end_html;
        =66=    }
        =67=    
        =68=    sub get_cache_handle {
        =69=      require Cache::FileCache;
        =70=    
        =71=      Cache::FileCache->new
        =72=          ({
        =73=            namespace => 'tracerouter',
        =74=            username => 'nobody',
        =75=            default_expires_in => '30 minutes',
        =76=            auto_purge_interval => '4 hours',
        =77=           });
        =78=    }
        =79=    
        =80=    sub get_session_id {
        =81=        require Digest::MD5;
        =82=    
        =83=        Digest::MD5::md5_hex(Digest::MD5::md5_hex(time().{}.rand().$$));
        =84=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.