Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 33 (Aug 2000)

[suggested title: Taint so easy, is it?]

If you've been reading my columns for any length of time, you've probably seen me mention ``taint mode'', usually briefly while I'm describing a ``hash-bang'' line of something like:

        #!/usr/bin/perl -Tw

which turns on warnings (the -w) and ``taint mode'' (the -T). But what is taint mode?

Taint mode is a security feature of Perl, and includes two levels of operation. First, while taint mode is in effect, some operations are forbidden. One of these is that $ENV{PATH} cannot contain any world-writeable directories when firing off a child process (like with backticks or system). Should your program attempt an unsafe action, the program aborts (via die) immediately, before the action has a chance to create a potential security violation. You could have included code to check this yourself, but by having Perl perform the checks ensures a consistency and a ``best practices'' level of competence that you may not have the capability or resources to include explicitly.

The second level of operation is much more interesting and unique to Perl (amongst all the popular languages I know of), in which Perl keeps track of a ``distrust'' of each scalar value in the program. Every item of data coming from input sources (command line arguments, environment variables, locale information, some system calls, and all file input) is marked ``tainted''.

For example, the following operations all generate tainted data:

  $t1 = <STDIN>;
  $t2 = $ENV{USER};
  $t3 = $ARGV[2];
  @t4 = <*.txt>;

In each of these examples, the data has come ``from the outside world'', and is therefore treated as potentially dangerous. Once data is tainted, the taint propogates to any data derived from the tainted data:

  $t5 = $t4[0];
  $t6 = "/home/$t2";
  chomp($t1);
  @x = ("help", "me", $t3, "please");

Note that tainting is on a per-scalar basis. So $x[2] is tainted, not the entire array @x.

Once data is marked tainted, nearly any attempt to use the data to affect the outside world will be blocked, causing an immediate die with a taint violation. For example, invoking rename where either the source name or destination name is tainted is considered dangerous. This permits normal operations:

  rename $x[0], $x[1];

But not operations that involve tainted data (recall that $x[2] is tainted from earlier):

  rename $x[0], $x[2];

What this means is that data that comes in from the outside world cannot trivially affect the outside world as well. Why is this important?

Well, the typical use of taint mode is to enable programs that act on behalf of other users to operate in a safer manner. For example, a ``setuid'' or ``setgid'' program borrows the privileges of its owner for the duration of execution, allowing an ordinary user to act as root (or some other user) for a selected set of operations. Or a CGI program, executing as the web server ID (typically nobody), is acting with that user's privileges on behalf of a request from any web client, generally without direct access to the server except through the web server.

In both of these cases, it's important that input data be checked so as not to permit the user who invokes the program from borrowing the privileges of the executing user ID to perform unintended actions.

For example, it'd be pretty dangerous to rename a file based on the input from a CGI form:

  use CGI qw(param);
  ...
  my $source = param('source');
  my $dest = param('destination');
  rename $source, $dest;

Now perhaps the author of this CGI script believed that since the form contained only radio buttons or pop-up menus that were clearly defined that this would be a safe program. But in reality, a person with intent to damage or break in could just as easily invoke this script passing arbitrary data in source and destination, and potentially rename any file to which the web userid has access!

With taint mode enabled, the CGI parameters (having been derived from either reading STDIN or an environment variable) are marked tainted, and therefore the rename operation would fail before it has committed potential damage. (To enable taint mode on a CGI script, just include -T in the #! line, as shown earlier.) And that's exactly the safest thing to do here.

But obviously, there are times when input data must in fact legitimately affect the outside world. Here's where the next feature of taint mode comes in. As a sole exception, the results of a regular expression memory reference (usually accessed as the numeric variables like $1 and $2 and so on) are never tainted, even though the match may have been performed on tainted data. This gives us the ``carefully guarded gate in the fence'', when used properly. For example:

  my $source = param('source');
  unless ($source =~ /^(gilligan|skipper|professor)$/) {
    die "unexpected source $source\n";
  }
  $safe_source = $1;

Here, $source is expected to be one of gilligan, skipper, or professor. If not, we'll die before executing the next statement, which copies the captured memory into $safe_source. (Note the parens in the regular expression match are performing double duty, needed for both proper precedence regarding the vertical bar and the beginning and ending of string anchors, as well as having the side-effect of setting up the first backreference memory. Sometimes, you get lucky.)

The value of $safe_source is now legitimate to be used in the rename operation earlier, as it came from a regular expression memory, and not directly from input data. In fact, we could even have assigned it back over $source (a common thing to do):

  $source = $1; # source now untainted

Of course, we'd have to perform a similar operation on $destination to complete the operation.

So, if someone attempts to give us an incorrect value for the source parameter, like ginger, the program aborts. Certainly, this program would have aborted with or without taint mode, but in taint mode it works only because we added the extra code to perform a regular expression match, during which we needed to think about what the possible legal values for the string might have been.

And that brings up the next point: we typically can't perform an explicit match against a known list of values. More often, the data is a user specified value that needs to fit a general description, but again, regular expressions are pretty good at matching many things.

So, let's say the $source there came from a text field box, rather than a pop-up menu, permitting an arbitrary string. How do we pass that along to the rename operator? Well, first we have to decide what a legimate string might be. For example, let's restrict to filenames that contain only \w-matching characters, including a dot (as long as the dot is not the first character). That'd be like this:

  $source = param('source');
  $source =~ /^(\w[\w.]*)$/ or die;
  $source = $1;

Once again, if the string is not as expected, we die. And only if we haven't died will we continue on to use $1 which has now been verified to be a name of the form that we expect.

Note that it's very imporant to test the result of the regular expression match, because $1 (and the other memory variables) is set only when you have a successful regular expression match. Otherwise, you get an earlier match, and that's definitely bad news:

  ## bad code do not use ##
  $param('source') =~ /^(\w[\w.]*)$/;
  $source = $1;
  ## bad code do not use ##

A slightly more compact way of writing this correctly might be:

  my ($source) = param('source') =~ /^(\w[\w.]*)$/
    or die "bad source";

Here, I'm using $1 implicitly as the list context result of the regular expression match, and declaring the variable that will hold it, and checking for errors, all in one compact statement.

The regular expression pattern should be as restrictive as you can get. For example, if you use something like /(.*)/s, you've effectly removed any of the benefits of taint mode for that particular data, making it potentially possible for someone to hijack your program in unintended ways.

So, I hope this gives you a bit of insight into how to use taint mode, and why it is useful. If this column 'taint enough for you, I suggest you check out the perlsec manpage (perhaps using the command perldoc perlsec at a prompt). Until next time, enjoy your new security knowledge.


Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.