Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 18 (October 1997)

Sometimes, people come in through the back door. No, I don't mean at your house. But suppose you set up a nice web-site with a great front-page graphic, and then people browse away, looking at all the stuff, and they start bookmarking some of the later pages. No big deal, you say, but then they start handing out those URLs to their friends, or worse yet, put their hotlist directly on the web.

Now, people from all over are coming in to your site, never having seen your wonderful front page. OK, so only vanity requires you to have them see that graphic. But there are more real-world situations as well.

Suppose the front page contains a legal disclaimer that applies to the entire site. Sure, you can copy that same note to each page, but that'll just aggravate people.

Or suppose the entire site site is sponsored (for which the sponsor expects or requires some credit) or made possible from others work? Surely, those sponsor notices must be seen. In particular, suppose you have ads of some sort (ugh) that need to be acknowledged in some way.

Well, then you must make sure that these URLs that don't point at the front door are never used, so that people don't ``come in through the back door''.

How? Well, one technique is to mangle the URLs slightly, so that each URL points at a place in the tree for a limited amount of time. After that time has passed, the URL ``expires'', and cannot be used again. This is possible if you don't serve the URLs directly, but instead are willing to let every document served be handled through a CGI script. Where might you find such a script? Read on.

Put simply, every unexpired URL ends up looking like:

    http://www.stonehenge.com/cgi/WT/1234567/col01.html

where 1234567 will be the Unix time of day at which this URL was generated. The script at /cgi/WT gets invoked, and the rest of the URL is made available to it as data. This gives the script a chance to make a policy decision about the URL. If the URL is recent enough, the CGI script tickles the server into handing the client a page from a secret tree, like http://www.stonehenge.com/merlyn/WebTechniques/col01.html.

If the URL isn't recent enough, the extra path is ignored, and the script generates an error message coaxing the user back to the top page. If the script is invoked without any extra path, the top page is shown anyway, so that gives us a place to link to from other parts of the site.

The details of this script are given in [Listing one, below].

Lines 1-3 start nearly every non-trivial program I write, enabling warnings, taint-checks, and turning on compiler restrictions.

Line 4 pulls in the URI::URL module, so that I can compute relative and absolute URLs relatively easy.

Lines 6 through 9 provide a few constants that I might want to change if I put this script in different environments. Nothing below this should require customization.

Line 7 defines a base URL (must end in slash) where the documents are actually kept. Note that this URL must be permitted to the ultimate readers, but not publicized. In fact, if it ever leaks out, the users can go to this URL directly.

For the purposes of testing my program, I pointed this particular script at my own on-line article archive on my server.

Line 8 defines the number of seconds that a valid URL can be used as-is. From this period to double this period, a URL goes through a ``soft failure'' -- a new URL is generated that points at the same place, but has been made ``up to date''. After that time, the URL is simply invalid, and will merely bring up an error document point the user to the top of the tree again.

Lines 11 to 16 define an entity-encoding algorithm that turns an arbitrary string into something that is safe to send to a web-browser. The &ent routine encodes double-quotes, less-thans, ampersands, and greater-thans. (The two double-quotes in the search string are merely to make it symmetric... a subtle touch of asthetics.)

Lines 18 through 20 create $info, containing a clean path of the particular desired document within the secret tree. This variable initially comes from the PATH_INFO environment variable, which will be everything after the script name (if any). If the extra path stuff is missing, / is substituted, and the name is adjusted to always begin with dot-slash.

Lines 22 through 26 construct a URL from the invocation parameters for this particular CGI. This is needed to perform an external redirect back to this script to add or update the timestamp part of the URL. If I had used CGI.pm, I could have simply referred to $query-GTself_url, but I wanted to avoid the overhead of pulling that entire library in just for this one operation.

Line 22 creates a URI::URL object using the url function, and each of the following lines just adds something to that object. Finally, in line 26, the URI::URL object is converted into a string, restoring it to a canonical form, which is then concatenated with a slash.

So far, we have the address of ourself ($self_url) and a path to a particular virtual document ($info). If we glued those two strings back together, it'd look a lot like the original query. But that's not the point of this CGI. Instead, we'll look at $info as a pointer into the real document tree (beginning relative to $BASE), and do the right thing with it.

First, we need to see if it's a current URL, or an expired URL. That's done by stripping out the leading digits from $info, putting them into $when in lines 28 and 29. If there were no leading digits, then $when remains 0.

Next, lines 33 through 52 determine if we have an expired URL, using the expression in lines 34 through 36. There are two ways for a URL to be considered expired. If the URL has illegal pathname components (backing up could reveal the pathname of the $BASE tree), we treat it as expired to get the right error message. If the URL is not the ``top-of-tree'' URL (here, ./), and the URL timestamp is more than double the expiration period, then it's definitely expired as well. Note that a $when containing 0 falls into this category, so random URL pointers into the tree are automatically expired pointers.

If the URL is expired, we need to tell the user what to do. Lines 37 and 38 construct HTML-safe strings from the requested URL and the ``self'' URL, and lines 40 through 49 print a response text based on that information. Note that the entry in line 47 both displays the proper URL to return to, and generates it as a link so the user can go directly there without retyping. Nice touch.

Line 50 exits the program if the response text was generated. (This would be a bad thing if this script were used with Apache's mod_perl, but if you're using that, you're probably clever enough to look for stuff like this anyway.)

If we make it to line 52, it's time to deliver some document via redirection. If the requested URL is ``fresh enough'', line 56 provides an initial prefix in front of $info (line 53) to perform an ``internal redirect''. In this case, the web server fetches and returns the URL as if that was the originally requested URL. If the requested URL is ``stale'' (but not expired, which was handled above), then we want to freshen up the URL, by sending an external redirect. In this case, we need to refer to this script, together with a correct timestamp (the current time) inserted in the middle. The result is a URI::URL in $location that reflects the appropriate prefix and suffix based on the freshness of the original requested URL.

And finally, lines 59 and 60 print the redirection header.

So, presuming I've installed this at a URL like http://www.stonehenge.com/cgi/WT, I can now invoke this URL to get to the top of the real tree at http://www.stonehenge.com/merlyn/WebTechniques/. And that would let me browse all the pages below that (presuming they use relative URLs) without any change to the existing pages.

Except for a little hitch. What if the user fakes up a URL that has the right timestamp but a bad path after it? The server will gleefully hand back an error message that reveals that oh-so-protected virtual tree! So much for security.

Well, there's a quick and easy solution with Apache: define alternate Error documents. The details about this are in the Apache documentation, in the description of the Core features. For this example, it suffices to add these lines to the .htaccess file at the root of the tree being served:

        ErrorDocument 401 "That document is not available
        ErrorDocument 403 "That document is not available
        ErrorDocument 404 "That document is not available

Sure, it's not very helpful, but at least you won't see the real URL.

Because every new invocation of /cgi/WT generates a new (probably unique) time-oriented URL, I could use this to correlate a ``visit'' to my website by noticing all the similar timestamps in successive hits, and derive statistics like ``number of hits per visit'' and ``number of visits per day'' instead of more meaningless ``number of hits per day''. Some sites do this already (I believe www.pathfinder.com does this, for example, because I can see the ugly URLs).

While I was writing this article, I took a brief survey of my friends about the technique, and nearly all of them agreed that they hate it when sites do this. But if you have a compelling reason, here's one way to make it work. Enjoy!

Listing One

        =1=     #!/home/merlyn/bin/perl -wT
        =2=     
        =3=     use strict;
        =4=     use URI::URL;
        =5=     
        =6=     ## configuration
        =7=     my $BASE = "/merlyn/WebTechniques/"; # must end in slash
        =8=     my $VALID_SECONDS = 60 * 60 * 4; # four hours
        =9=     ## end configuration
        =10=    
        =11=    ## return $_[0] encoded for HTML entities
        =12=    sub ent {
        =13=      local $_ = shift;
        =14=      $_ =~ s/["<&>"]/"&#".ord($&).";"/ge;  # entity escape
        =15=      $_;
        =16=    }
        =17=    
        =18=    my $info = $ENV{PATH_INFO};
        =19=    $info = "/" unless defined $info;
        =20=    $info = ".$info";               # always "./" prefix
        =21=    
        =22=    my $self_url = url("http:";);
        =23=    $self_url->host($ENV{SERVER_NAME}) if defined $ENV{SERVER_NAME};
        =24=    $self_url->port($ENV{SERVER_PORT}) if defined $ENV{SERVER_PORT};
        =25=    $self_url->path($ENV{SCRIPT_NAME} || "/cgi/$0");
        =26=    $self_url = "$self_url/";       # note that $self_url is a string now
        =27=    
        =28=    my $when = 0;
        =29=    $when = $1 if $info =~ s!^\./(\d+)/!./!;
        =30=    
        =31=    ## catchall if illegal url (attempt to back up over top)
        =32=    ## or expired (and not one of the entries into the tree)
        =33=    if (
        =34=        (index("/$info/", "/../") > -1) or
        =35=        $info ne "./" and
        =36=        time > $when + 2 * $VALID_SECONDS) { # hard expired URL, say so
        =37=      my $r_html = ent("$self_url$info");
        =38=      my $s_html = ent($self_url);
        =39=    
        =40=      print <<"EOF";
        =41=    Content-type: text/html
        =42=    Status: 404 Not Found
        =43=    
        =44=    <HTML><TITLE>Expired URL</TITLE></HEAD>
        =45=    <BODY><H1>Expired URL</H1>
        =46=    The requested URL $r_html has expired.  Please return to
        =47=    <A HREF="$s_html">$s_html</A> to start with a new unexpired URL.
        =48=    </BODY>
        =49=    EOF
        =50=      exit 0;
        =51=    }
        =52=    my $location =
        =53=      url($info,                    # $info is relative to...
        =54=          (time > $when + $VALID_SECONDS) ? # if too old...
        =55=          $self_url.time."/" :      # this script and time (external redirect)
        =56=          $BASE                     # or use as-is (internal redirect)
        =57=         )->abs;                    # made absolute
        =58=    
        =59=    print "Location: $location\n";
        =60=    print "\n";
        =61=

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.