Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 20 (December 1997)

Suppose you've got a search engine attached to a web page form that takes a little while to come up with a result. Now, you could just tie up the user's browser with a r-e-a-l-l-y l-o-n-g response time, but most people are into instant gratification, and will probably just abort the query.

It'd be nice if you could return some message right away that says ``Hey, I'm working on it. Gimme a second.'' Well, that's actually pretty easy to do, using the technique shown in this month's column.

Put simply, the CGI request will be split into two separate paths:

  1. The first path returns back a redirect to a URL that initially says ``search in progress... reload this page to check the status''. The user sees this page quickly after making the query.

  2. The second path actually performs the search in the background. When the search is complete, the page the user is reloading is updated to contain not ``search in progress'', but the results of the search.

This keeps the communication between the browser and the server to a minimum. When the user wants to see if the results are in, they just hit reload. In fact, on browsers that support client-pull, you can even make it autoreload every 5 or 10 seconds instead, quitting when the result is done! Of course, a manual reload will also suffice, so this method is compatible with non-client-pull browsers as well.

Some people suggest that you use a server-push for something like this. I don't like that because not all browsers support server-push, and it also ties up communication resources while the calculation is proceeding. But it does depend on the application (and I won't rule out that technique entirely... maybe a future column will speak to that).

In order to make this work, you have to create a ``results'' directory that is visible in your web tree, and writeable by the CGI script. Also, this results directory will have to be cleaned up regularly. (More on that later.)

So, I've hacked out this magical two-pronged CGI script, presented in [Listing One, below].

Lines 1 through 3 begin most of the programs I write, enabling taint checks, warnings, compile-time restrictions, and unbuffering standard output.

Lines 5 and 6 pull in the two modules I'll be needing: the CGI module (for CGI operation and HTML generation) and IO::File. Both of these modules are standard with the current release of Perl. If you have an older version of Perl, it's time to upgrade anyway.

Lines 8 to 15 create a subroutine &maybe_print_header, along with a static local variable $header_printed. Initially, this variable is 0, so the first time that the subroutine is invoked, an HTTP header is generated (using a routine from the CGI module). After that, it'll be skipped. I used a trick similar to this in earlier columns, but I stumbled upon this strategy while writing this program, and like it much better now.

Lines 17 to 22 define &ent, which creates the appropriate HTML entities for the arbitrary text strings presented in the argument. It'd be nice to be able to use CGI->escapeHTML, but that's listed as a private function (not on the manpage), so I can't count on it being available in future releases. (See the source for CGI.pm to see what I mean.)

Lines 24 through 32 define a ``death handler'', taking care of all user-created die and warn operators, as well as system deaths and warnings. Note that we print a header if one has not already been printed, and then a clean HTML version of the message passed in. If an already printed header said we were a text/plain file, this'll look ugly, but we have no way of finding out here.

Lines 34 and 35 define a couple of configuration constants pointing to the ``results directory'' I talked about earlier. The script needs to know both its position in URL space (for the browser redirect) as well as the filesystem path (to tell the search engine where to put the data). The exact mapping must be determined by consulting the webserver configuration. This directory must also be writable by this script, as well as any cleanup agent that will delete old queries after a sufficient time.

Line 37 grabs the only form parameter that we are expecting.

Lines 38 through 47 handle the case where the invocation of this script has not come from filling out the form (or perhaps filling out the form with an empty search). In this case, we want to dump out a clean form to be filled out.

Line 39 prints the header, with MIME type defaulting to text/html. (If the header had already been printed, then this will do nothing.)

Lines 40 through 47 print the bulk of the response, using the HTML shortcuts provided from the CGI module.

Line 41 generates the page title as well as the initial header stuff. Note that we're calling this a ``simulated search'', because I'm not actually doing anything useful.

Line 42 prints the H1 header with the right title.

Lines 43 through 46 print the form, bounded by horizontal rules (hr).

Line 43 prints the beginning of the form. Note that we are overriding the CGI module's default of a POST method, changing it to a GET method. This is so that the response redirect (which will be a GET method) doesn't trigger an error message on standards-compliant browsers. Nonstandard browsers (such as Netscape Communicator and MSIE) pay no attention to the apparent mismatch between POST and GET from a redirect. So testing with merely these two browsers will not reveal a problem. (Good thing some of us actually read the specifications in order to write model code.)

Line 44 prints out a fill-in field for the search string. Line 45 generates the submit button, with no special labels.

Line 47 ends the HTML output, and this is the last line of execution when we're generating the form. (The structure of if statements keeps us from executing any further.)

When we have a good search string, the rest of the code in lines 48 to 88 kick in.

Line 49 computes a ``session'' identifier, from the processID and the current time of day, compressed into a hex string. One of my first columns did this as well, so I just clipped the code from there. This is not a terribly secure value (it can easily be guessed), so you'd probably want to throw something from rand() in there as well in real code. It just has to be unique every time we hit this particular statement no matter what process we are running in.

Lines 50 through 53 construct various URLs and file paths based on this session ID. $session_html will be the name of the HTML file of the final result, and for a short time, the intermediate page that says that we're not done yet. $DIR_html contains this name with a proper path, while $DIR_html_tmp has this same name with ".tmp" attached so we can update the file cleanly. Finally, $URL_html contains the same name as $DIR_html, but mapped into the webserver's space so that we can redirect the client's browser to the results file.

Lines 54 through 65 create an HTML containing the ``search in progress page''. The name of this file is one of the strings we just created, and the filehandle to the file is created in lines 55 and 56. This filehandle is a lexically-scoped filehandle, and will automatically be flushed and closed at the end of this block.

Lines 57 through 64 dump the right HTML code into the page. Lines 58 through 60 create the HEAD portion of the file.

The generated HTML includes a client-pull refresh header. The ``5'' here refers to a client fetch every 5 seconds until the result is complete. You can increase this number for fewer hits to the server, or decrease it to get less latency between the completion of the search and the time the browser notices. For browsers that do not understand refresh, the user can manually ``reload'' the page until the real search information has appeared.

Lines 61 through 63 generate the visible portion of the HTML in the page.

Once the ``search in progress'' page has been generated, line 65 ``forks'' the CGI script into two separate threads. If this fails, nothing happens, and we get an immediate death to show that.

Line 67 detects which of the two processes we've now become. In the parent process, $childpid will be a non-zero value, and in the child process, $childpid will be 0.

The parent process does one simple thing. It sends back an external redirect header (HTTP status 302) so that the browser is sent to the ``search in progress'' page. It then exits so that the connection between the CGI script and the web server is finished, releasing the browser to view the page. This is performed in line 68.

The child process has a bit more work. (Then again, making your kids work for you while you delegate seems to be a good thing anyway.) Lines 70 through 87 contain the child's tasks.

First, and very important, lines 70 and 71 close STDIN and STDOUT. When these handles are closed, the web server can accurately determine that no further communication with the browser is desired. The most common mistake I've seen in the CGI newsgroup on this kind of task is this particular failure. Notice that STDERR is left alone. This is so that die messages can still end up in the webserver log, which is a good thing.

Line 72 simulates the actual search. In a real program, we'd have some calculations going on here... perhaps some backquotes around some long-running process to get the result. Since this is only a demo, we'll just go away for 10 seconds.

Lines 73 through 84 create the search results page into a temporary file near where we want the result to be. It must be near the final result because we want a rename to succeed to replace the ``search in progress'' page we've generated earlier.

Lines 76 through 82 generate the content of this page, similar to the content of the ``search in progress'' page. Note that for this demo, I'm merely echoing back the parameters with a cute message. In a real program, this section of code would take the results of the search or other action and format it properly.

Note the use of the &ent routine in line 80. This ensures that HTML-significant characters in the search string are not going to mess up the output, even for this demo. Remember this when you are writing the real output generator.

Lines 85 and 86 rename the newly created file over the top of the existing ``search in progress'' file. The next client pull (or manually initiated reload) will then see the file, and we're done.

Don't forget to set up the cron job to clean up $DIR from time to time. On a busy system, you may want to kill any queries older than an hour every hour. On a slow system, daily preenings will probably suffice.

Thanks to Dale Bewley of Bewley Internet Solutions (dale@bewley.net) for some help with testing and the idea to use client pull. He said he wanted to see his name in the credits, so here it is.

Also, a special thanks to my fellow Perl trainer, the unstoppable Tom Phoenix (rootbeer@teleport.com), for not only his continued promotion of these Web Techniques columns for solutions to Usenet queries, but also for noticing that I hadn't yet written a column like this one.

Listing One

        =1=     #!/home/merlyn/bin/perl -Tw
        =2=     use strict;
        =3=     $| = 1;
        =4=     
        =5=     use CGI qw/:standard/;
        =6=     use IO::File;
        =7=     
        =8=     BEGIN {
        =9=       my $header_printed = 0;
        =10=    
        =11=      sub maybe_print_header {
        =12=        print header(@_) unless $header_printed;
        =13=        $header_printed = 1;
        =14=      }
        =15=    }
        =16=    
        =17=    ## return $_[0] encoded for HTML entities
        =18=    sub ent {
        =19=      local $_ = shift;
        =20=      s/["<&>"]/"&#".ord($&).";"/ge;  # entity escape
        =21=      $_;
        =22=    }
        =23=    
        =24=    ## death handler
        =25=    $SIG{"__DIE__"} = $SIG{"__WARN__"} = sub {
        =26=      my $why = shift;
        =27=      chomp $why;
        =28=      $why = ent $why;
        =29=      maybe_print_header();
        =30=      print "ERROR: $why\n";
        =31=      exit 0;
        =32=    };
        =33=    
        =34=    my $DIR = "/home/merlyn/Html/pic/results/";
        =35=    my $URL = "http://www.stonehenge.com/pic/results/";;
        =36=    
        =37=    my $searchstring = param("search"); # the search item
        =38=    unless (defined $searchstring and length $searchstring) {
        =39=      maybe_print_header();
        =40=      print
        =41=        start_html("-title" => "Simulated Search"),
        =42=        h1("Simulated Search"),
        =43=        hr, start_form("-method" => "GET"),
        =44=        p, "Search for: ", textfield("-name" => "search"),
        =45=        p, submit,
        =46=        end_form, hr,
        =47=        end_html;
        =48=    } else {
        =49=      my $session = unpack("H*", pack("Nn", time, $$)); # 12 hex chars
        =50=      my $session_html = "$session.html";
        =51=      my $DIR_html = "$DIR$session_html";
        =52=      my $DIR_html_tmp = "$DIR$session_html.tmp";
        =53=      my $URL_html = "$URL$session_html";
        =54=      {
        =55=        my $out = IO::File->new($DIR_html,"w") or
        =56=          die "Cannot create $DIR_html: $!";
        =57=        print $out
        =58=          start_html("-title" => "Search in progress",
        =59=                     "-head" => ["<meta http-equiv=refresh content=5>"],
        =60=                     ),
        =61=          h1("Search in progress"),
        =62=          p("The search is still in progress.  Please reload this page."),
        =63=          end_html;
        =64=        ## implicit close
        =65=      }
        =66=      defined(my $childpid = fork) or die "Cannot fork: $!";
        =67=      if ($childpid) {              # parent does:
        =68=        print redirect($URL_html);
        =69=      } else {                      # child does:
        =70=        open STDIN, "</dev/null";
        =71=        open STDOUT, ">/dev/null";
        =72=        sleep 10;                   # simulate search time
        =73=        {
        =74=          my $out = IO::File->new($DIR_html_tmp,"w") or
        =75=            die "Cannot create $DIR_html_tmp: $!";
        =76=          print $out
        =77=            start_html("Search results"),
        =78=            h1("Search results"),
        =79=            p("I've found the item ",
        =80=              ent($searchstring),
        =81=              ", but I can't tell you where :-)."),
        =82=            end_html;
        =83=          ## implicit close
        =84=        }
        =85=        rename $DIR_html_tmp, $DIR_html or
        =86=          die "Cannot rename $DIR_html_tmp to $DIR_html: $!";
        =87=      }
        =88=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.