Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 34 (Feb 1999)

Back in this column in February 1997, I wrote an anonymizing proxy server in roughly 100 lines of Perl code. You could point your web browser at it, and you'd get a different IP address from your browser, as well as all telltale identifying information removed (like cookies). The downside of this tiny proxy server was that every fetched URL caused a process to be forked.

If a real web server were to do that, it'd probably be fairly narrow in application, so most modern webservers perform preforking. That is, they fork a number of servers ready to respond to an incoming request. Then as the requests come in, there's no latency from the time it takes to fork and get ready to listen.

Perl has enough tools to do this as well, so I decided to rewrite that proxy server as a pre-forking server. That alone would make it cool enough for a new column here, but I went one item better.

Most modern web browsers accept compressed data (most commonly in the gzip format), automatically uncompressing it to display to the user. Because text compresses very well (especially large HTML pages), we can make this proxy server automatically detect a gzip-savvy browser, and compress all text downloads on the fly!

That way, if I'm running this proxy server on a machine with good net connections (like a T-1 or better), I can use it as a speed-enhancing proxy on my normal modem links. And in fact, in practice, I found it to make downloading noticably zippier -- almost spooky actually.

And this new improved proxy server code is in [listing one, below]. Please note that because of the length of the listing, I'll be a little more terse than usual in describing individual constructs, concentrating instead on the overall strategy and flow of control.

Lines 1 through 4 start most of the code I write, making the program taint-safe, enabling compile-time restrictions, and unbuffering standard output.

Lines 6 and 7 are partially automatically generated by the RCS version control software. I'm creating two globals that have the version information in them. The variable $VERSION has just the version number, handy to print out when the server starts up and to add to the LWP user-agent string when we fetch other pages.

Lines 9 through 11 are my copyright information. It's a good idea to make the licensing terms explicit, because the default these days is pretty restrictive. Here, I'm declaring that this program can be adapted and included just like the Perl source code.

Lines 14 through 18 define a little utility function that takes each line of text in possibly multiple arguments and prefixes it with the current time and the process-ID number. This utility function permits the normal die and warn operators to have better logging information, as reconfigured in lines 19 and 20. Line 21 sets up the signals using a subroutine defined later.

Lines 23 through 29 define configuration constants to control logging of various events. These flags are all examined for their true/false value. At a minimum, logging the child management and the transaction management is interesting -- the other items are generally for debugging (or if you're insanely curious and don't mind reams of output).

Lines 31 through 35 control other configuration constants. Line 32 defines the hostname to which this server will bind. Obviously, you won't want to use www.stonehenge.com for your binding, so change this to something appropriate. For testing, you can make this localhost, which will prevent anyone away from your system from connecting to it. Line 33 selects the port number. If you leave it at 0, the system picks a random port number, which you can determine from the startup messages. You can also force it to be a particular port number, but if that port number is held by another process, the server cannot start.

Lines 34 and 35 control the performance as a trade-off of resources. The more servers you start, the more likely you can handle more hits effectively, but then you also use up valuable process slots and swap memory. Line 35 is mostly out of paranoia, telling each process to quit and restart after a certain number of usages. You want to keep this fairly high, because the first hit for each child process takes about three to five times as much CPU as the remaining hits.

The first real line of output is line 38. Not much to say there. Line 40 invokes &main to do the real work, which should never return. Again, defensive programming tells me to put an exit after that, just in case.

Lines 44 through 64 define the top-level algorithm as subroutine &main. The main code needs the assistance of the HTTP::Daemon module (from LWP) and creates a local hash called %kids (described below).

Lines 48 through 50 establish the master socket. This is an instance of HTTP::Daemon. The resulting proxy URL is dumped to standard error in line 50, and will be something like http://www.stonehenge.com:44242/. This socket (and the containing object information) are inherited by all the kids.

Speaking of the kids, we create them in lines 52 through 54. A simple foreach loop (spelled f..o..r but pronounced ``foreach'') invokes &fork_a_slave repeatedly. The process ID number becomes the key to the hash, with a corresponding value being the word slave. We can then detect any erroneous wait values that way, since they won't be one of our slaves. Not likely, but paranoid programming is handy and scales well.

Lines 55 to 63 loop forever, waking up whenever a kid dies, and restarting it so that there's always $SLAVE_COUNT kids waiting for web connections. There's a sleep in here so that if we go belly up while trying to fork, at least we don't thrash.

Lines 66 to 75 set up a signal group so that if any individual process takes a hit with SIGHUP or SIGINT or SIGTERM, the whole group takes the same hit. This way we can kill any process to bring down the entire family without having to use some hack like killall.

Lines 77 to 84 create a particular child slave. Line 81 forks (or perhaps not if there are no remaining process slots), saving the $pid to determine the parent versus the child. If the $pid is zero, the child runs off to the &child_does routine. The parent returns the non-zero PID value so that the top-level routines can record this as a good slave.

Speaking of &child_does, this routine (defined in lines 86 through 113) handles the top-level processing in each child slave. We're handed the master socket (kept in $master), which we'll use to get the incoming requests from the web clients that are using us as a proxy. Line 89 defines a counter to make sure we don't handle more requests than necessary.

Lines 92 through 110 define a repeating loop executing for $MAX_PER_SLAVE times. Line 93 is executed in parallel by all children. The operating system lets only one of the many children have an exclusive ``flock'' on the master socket. This is needed because we can have only one child at a time execute the accept on the socket in line 95. So, in an idle moment, we've got one child waiting in line 95, and the rest waiting in line 93. When the connection comes in, the child waiting in line 95 moves on, unlocking the port in line 97. The operating system then non-deterministically selects one of the other kids waiting at line 93 to move forward. It's pretty slick.

But, let's get back to that incoming connection. In line 98, we'll save away the CPU times and wall-clock time for later statistics. And line 100 tells the log where we got a connection from. The bulk of the work is done in &handle_one_connection (described below). But then we're ready to see how much work it took, so lines 103 to 107 compute the differential CPU user and system time, child versions of the same, and the difference in wall clock seconds. When that's all displayed, we loop back up and become one of the kids waiting in line 93 again.

By the way, there's nothing necessarily fair about the algorithm. It's quite possible that a particular child will get back up to line 93 at exactly the right moments often enough to handle many more requests than others. But that's OK, because the kids are all basically the same anyway. It does have a slight bearing on how often a child gets killed and restarted, though.

When the child finally dies, it logs that in line 111, and then exits with a nice error code in line 112.

Lines 115 to 129 provide the top-level code for a particular single connection to a web client. The request comes in via line 119, which should be an HTTP::Request object. If not, line 120 causes us to die. This is somewhat incomplete though, because we could just reset and try again, or try to figure out what was wrong. But this was close enough for my testing purposes. It does permit a denial-of-service attack on this proxy though, since some 3v1l d00d could connect and spit garbage at my proxy, forcing the kid to die. Oh well.

Line 122 calls down to get the actual response, and lines 123 to 126 log it if necessary. Line 127 sends the response back to the browser, and the connection gets closed down in line 128, permitting the browser to know that we're all through.

Lines 131 to 151 validate the request before performing the real action, using some basic checks. If the scheme isn't something that LWP can handle, it gets rejected in lines 139 through 141. If it's not really a proxy request (showing up as a relative URL), that's also rejected, in lines 143 through 145. Otherwise, we go fetch it via the call in line 149.

Note the comment in line 135: this proxy is promiscuous. Any web client can use it if they can connect to it on the net. This means they also inherit my IP addresses for IP-based authorization, and that could be seriously bad. If you're gonna use this program for real, either hide it well (heh) or add some code in here to notice the peer address or whatever you'd like to do.

Well, it's about time that we really fetched the request, now that we know it's good, legal, and ready to fetch. Lines 156 through 192 define the routine to handle this, along with a static global variable to hold the useragent.

Lines 159 to 165 set up the user agent (the virtual web client) once per child process. We'll give it an agent type of the LWP version number concatenated with the version number of this proxy server. Also, if there are any environment variables defining a further real proxy server, they'll be consulted in line 163.

Lines 167 to 170 handle some logging if needed. Then line 172, as simple as it looks, does 90% of the work of this program: fetching the requested URL (finally!).

Now that we've got a response, let's hack it a bit. If it's a good response (line 174) that's a text-ish file (line 175) and it's not already encoded (line 176) and the browser accepts gzip-encoded transfers (line 177), we'll replace the content with a gzip-equivalent, thus reducing the transmission time at the slight expense of some CPU horsepower at both ends.

Line 178 brings in the Compress::Zlib module (found in the CPAN). This defines the memGzip routine used in line 180 to compress the content. If the routine succeeds, then we'll update the content, length, and encoding in lines 182 through 184, and log it (perhaps) in lines 185 through 187.

And that brings us to the end! Whew.

If you want to try this program out:

  1. change the $HOST value to something appropriate,

  2. fire it up in a window,

  3. note the proxy address URL dumped out after master is ... right at the startup,

  4. put that address into your proxy settings for your browser, and

  5. go surfing!

You should see the proxy server as it takes each hit, along with the CPU usage for each transaction. The first hit will take a little longer than the rest, because some of the LWP code (and possibly the Compress::Zlib code) needs to be brought in.

Some possible areas of expansion:

However, these would have all made the program even longer than it is, so I'll leave those as an exercise to you (or perhaps for a future column). Enjoy!

Listing One

        =1=     #!/home/merlyn/bin/perl -Tw
        =2=     use strict;
        =3=     $ENV{PATH} = join ":", qw(/usr/ucb /bin /usr/bin);
        =4=     $|++;
        =5=     
        =6=     my $VERSION_ID = q$Id$;
        =7=     my $VERSION = (qw$Revision$ )[-1];
        =8=     
        =9=     ## Copyright (c) 1996, 1998 by Randal L. Schwartz
        =10=    ## This program is free software; you can redistribute it
        =11=    ## and/or modify it under the same terms as Perl itself.
        =12=    
        =13=    ### debug management
        =14=    sub prefix {
        =15=      my $now = localtime;
        =16=    
        =17=      join "", map { "[$now] [${$}] $_\n" } split /\n/, join "", @_;
        =18=    }
        =19=    $SIG{__WARN__} = sub { warn prefix @_ };
        =20=    $SIG{__DIE__} = sub { die prefix @_ };
        =21=    &setup_signals();
        =22=    
        =23=    ### logging flags
        =24=    my $LOG_PROC = 1;               # begin/end of processes
        =25=    my $LOG_TRAN = 1;               # begin/end of each transaction
        =26=    my $LOG_REQ_HEAD = 0;           # detailed header of each request
        =27=    my $LOG_REQ_BODY = 0;           # header and body of each request
        =28=    my $LOG_RES_HEAD = 0;           # detailed header of each response
        =29=    my $LOG_RES_BODY = 0;           # header and body of each response
        =30=    
        =31=    ### configuration
        =32=    my $HOST = 'www.stonehenge.com';
        =33=    my $PORT = 0;                   # pick next available user-port
        =34=    my $SLAVE_COUNT = 8;            # how many slaves to fork
        =35=    my $MAX_PER_SLAVE = 20;         # how many transactions per slave
        =36=    
        =37=    ### main
        =38=    warn("running version ", $VERSION);
        =39=                   
        =40=    &main();
        =41=    exit 0;
        =42=    
        =43=    ### subs
        =44=    sub main {                      # return void
        =45=      use HTTP::Daemon;
        =46=      my %kids;
        =47=    
        =48=      my $master = HTTP::Daemon->new(LocalPort => $PORT, LocalAddr => $HOST)
        =49=          or die "Cannot create master: $!";
        =50=      warn("master is ", $master->url);
        =51=      ## fork the right number of children
        =52=      for (1..$SLAVE_COUNT) {
        =53=        $kids{&fork_a_slave($master)} = "slave";
        =54=      }
        =55=      {                             # forever:
        =56=        my $pid = wait;
        =57=        my $was = delete ($kids{$pid}) || "?unknown?";
        =58=        warn("child $pid ($was) terminated status $?") if $LOG_PROC;
        =59=        if ($was eq "slave") {      # oops, lost a slave
        =60=          sleep 1;                  # don't replace it right away (avoid thrash)
        =61=          $kids{&fork_a_slave($master)} = "slave";
        =62=        }
        =63=      } continue { redo };          # semicolon for cperl-mode
        =64=    }
        =65=    
        =66=    sub setup_signals {             # return void
        =67=    
        =68=      setpgrp;                      # I *am* the leader
        =69=      $SIG{HUP} = $SIG{INT} = $SIG{TERM} = sub {
        =70=        my $sig = shift;
        =71=        $SIG{$sig} = 'IGNORE';
        =72=        kill $sig, 0;               # death to all-comers
        =73=        die "killed by $sig";
        =74=      };
        =75=    }
        =76=    
        =77=    sub fork_a_slave {              # return int (pid)
        =78=      my $master = shift;           # HTTP::Daemon
        =79=    
        =80=      my $pid;
        =81=      defined ($pid = fork) or die "Cannot fork: $!";
        =82=      &child_does($master) unless $pid;
        =83=      $pid;
        =84=    }
        =85=    
        =86=    sub child_does {                # return void
        =87=      my $master = shift;           # HTTP::Daemon
        =88=    
        =89=      my $did = 0;                  # processed count
        =90=    
        =91=      warn("child started") if $LOG_PROC;
        =92=      {
        =93=        flock($master, 2);          # LOCK_EX
        =94=        warn("child has lock") if $LOG_TRAN;
        =95=        my $slave = $master->accept or die "accept: $!";
        =96=        warn("child releasing lock") if $LOG_TRAN;
        =97=        flock($master, 8);          # LOCK_UN
        =98=        my @start_times = (times, time);
        =99=        $slave->autoflush(1);
        =100=       warn("connect from ", $slave->peerhost) if $LOG_TRAN;
        =101=       &handle_one_connection($slave); # closes $slave at right time
        =102=       if ($LOG_TRAN) {
        =103=         my @finish_times = (times, time);
        =104=         for (@finish_times) {
        =105=           $_ -= shift @start_times; # crude, but effective
        =106=         }
        =107=         warn(sprintf "times: %.2f %.2f %.2f %.2f %d\n", @finish_times);
        =108=       }
        =109=   
        =110=     } continue { redo if ++$did < $MAX_PER_SLAVE };
        =111=     warn("child terminating") if $LOG_PROC;
        =112=     exit 0;
        =113=   }
        =114=   
        =115=   sub handle_one_connection {     # return void
        =116=     use HTTP::Request;
        =117=     my $handle = shift;           # HTTP::Daemon::ClientConn
        =118=   
        =119=     my $request = $handle->get_request;
        =120=     defined($request) or die "bad request"; # XXX
        =121=   
        =122=     my $response = &fetch_request($request);
        =123=     warn("response: <<<\n", $response->headers_as_string, "\n>>>")
        =124=       if $LOG_RES_HEAD and not $LOG_RES_BODY;
        =125=     warn("response: <<<\n", $response->as_string, "\n>>>")
        =126=       if $LOG_RES_BODY;
        =127=     $handle->send_response($response);
        =128=     close $handle;
        =129=   }
        =130=   
        =131=   sub fetch_request {             # return HTTP::Response
        =132=     use HTTP::Response;
        =133=     my $request = shift;          # HTTP::Request
        =134=   
        =135=     ## XXXX needs policy here
        =136=     my $url = $request->url;
        =137=   
        =138=     if ($url->scheme !~ /^(https?|gopher|ftp)$/) {
        =139=       my $res = HTTP::Response->new(403, "Forbidden");
        =140=       $res->content("bad scheme: @{[$url->scheme]}\n");
        =141=       $res;
        =142=     } elsif (not $url->rel->netloc) {
        =143=       my $res = HTTP::Response->new(403, "Forbidden");
        =144=       $res->content("relative URL not permitted\n");
        =145=       $res;
        =146=     } else {
        =147=       ## validated request, get it!
        =148=       warn("processing url is $url") if $LOG_TRAN;
        =149=       &fetch_validated_request($request);
        =150=     }
        =151=   }
        =152=   
        =153=   BEGIN {                         # local static block
        =154=     my $agent;                    # LWP::UserAgent
        =155=   
        =156=     sub fetch_validated_request { # return HTTP::Response
        =157=       my $request = shift;                # HTTP::Request
        =158=   
        =159=       $agent ||= do {
        =160=         use LWP::UserAgent;
        =161=         my $agent = LWP::UserAgent->new;
        =162=         $agent->agent("proxy/$VERSION " . $agent->agent);
        =163=         $agent->env_proxy;
        =164=         $agent;
        =165=       };
        =166=       
        =167=       warn("fetch: <<<\n", $request->headers_as_string, "\n>>>")
        =168=         if $LOG_REQ_HEAD and not $LOG_REQ_BODY;
        =169=       warn("fetch: <<<\n", $request->as_string, "\n>>>")
        =170=         if $LOG_REQ_BODY;
        =171=   
        =172=       my $response = $agent->simple_request($request);
        =173=   
        =174=       if ($response->is_success and
        =175=           $response->content_type =~ /text\/(plain|html)/ and
        =176=           not ($response->content_encoding || "") =~ /\S/ and
        =177=           ($request->header("accept-encoding") || "") =~ /gzip/) {
        =178=         require Compress::Zlib;
        =179=         my $content = $response->content;
        =180=         my $new_content = Compress::Zlib::memGzip($content);
        =181=         if (defined $new_content) {
        =182=           $response->content($new_content);
        =183=           $response->content_length(length $new_content);
        =184=           $response->content_encoding("gzip");
        =185=           warn("gzipping content from ".
        =186=                (length $content)." to ".
        =187=                (length $new_content)) if $LOG_TRAN;
        =188=         }
        =189=       }
        =190=   
        =191=       $response;
        =192=     }
        =193=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.