Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 26 (Jun 1999)

There's no doubt that Perl is excellent at gluing together outside-world programs and data, but Perl is also pretty good at pure programming problems that you might find in a beginning computer science course. For example, let's take a look at that common task in your typical first year computer course, finding prime numbers.

As I recall from my early math years, a prime number is any integer greater than 1 that is divisible by only 1 and itself, and no other integers. Warning: I'm not a math geek, so if that's not precisely it, I'm sure someone will let me know.

So, let's generate some prime numbers using a very direct translation of this specification:

    GUESS: for (my $guess = 2; $guess <= 10000; $guess++) {
      for (my $divisor = 2; $divisor < $guess; $divisor++) {
        next GUESS unless $guess % $divisor;
      }
      print "$guess\n";
    }

The outer loop is named GUESS, so that we can skip forward with respect to this loop even from within the inner loop. The variable $guess is my try at each prime number. I'm establishing a lexical local variable for this loop, initializing it to 2, and incrementing it by one until it hits 10,000.

The inner loop tries all the possible divisors from 2 through the number we're checking in $guess. The expression $guess % $divisor is true when there's a remainder, meaning that $guess is not divisible by $divisor.

On my system, this program takes about 13 CPU seconds just to get all the primes up to 10000. Obviously, to do 100,000 numbers this way would be prohibitively expensive.

But, let's apply a little math to the selection of numbers to speed things up a bit. For one thing, it's silly to try any divisor that's bigger than the square root of the guess. (The higher divisor would have needed another factor that was smaller than the square root in order for their product to be less than the guess.) So, let's change that inner loop a bit:

    GUESS: for (my $guess = 2; $guess <= 10000; $guess++) {
      for (my $divisor = 2; $divisor * $divisor <= $guess; $divisor++) {
        next GUESS unless $guess % $divisor;
      }
      print "$guess\n";
    }

Here, notice that the ending condition of the inner loop has been changed, so that we'll drop out of the $divisor checking when a divisor is more than the square root of the potential prime in $guess.

And yes, that now gets it down to under a half CPU second on my system. If I bump the upper limit to 100,000, we're back up to 8 CPU seconds again, though. Let's try another optimization. After the prime number '2', there are no further prime numbers that are even, and there is also no reason to try any divisors that are even. (As my mathbooks often say, ``proof of this is left to the reader''.) So, let's change the loops a bit:

    print "2\n";
    GUESS: for (my $guess = 3; $guess <= 100_000; $guess += 2) {
      for (my $divisor = 3; $divisor * $divisor <= $guess; $divisor += 2) {
        next GUESS unless $guess % $divisor;
      }
      print "$guess\n";
    }

Here, I'll have to print 2 explicitly, because it would make the algorithm rather messy to try to include it inside. Then we'll try every odd number starting with 3 as potential primes. Similarly, the inner loop tries all odd numbers starting with 3 as potential divisors, but still stopping when the square of the divisor exceeds the candidate prime.

And, as I suspected, that cut the time down in about half, to under 5 CPU seconds. That's still not gonna scale well for getting primes under say, a million, or 10 million. To do that, we need to do something other than brute force.

The next major hack is to introduce the classic ``Sieve of Eratosthenes'' for finding primes. To do this, we first make an array that is all zeros up to the highest prime we'll be finding. Then, starting at the number 2, we set every multiple of 2 to a non-zero value, indicating that there's no way that the number is prime. Sliding forward, we do the same for the next item, 3, setting all multiples of 3 to non-zero. That looks like this:

    my $UPPER = 100_000;
    my @sieve = (0) x ($UPPER + 1);
    GUESS: for (my $guess = 2; $guess <= $UPPER; $guess++) {
      next GUESS if $sieve[$guess];
      print "$guess\n";
      for (my $mults = $guess * 2; $mults <= $UPPER; $mults += $guess) {
        $sieve[$mults] = 1;
      }
    }

The upper bound of potential primes is needed in a few places, so I set up the variable $UPPER to hold it. This variable is uppercase, the traditional indication that we're really looking at a read-only variable established as a program-wide constant. Next, the @sieve variable is initialized to contain the number 0 from $sieve[0] to $sieve[$UPPER]. Because this list is one larger than the value of $UPPER, I need to add one. The x operator replicates the list on the left (just a 0) enough times to make the @sieve long enough.

The GUESS loop again tries potential candidates from 2 to the upper limit. If $sieve[$guess] is already set, it must have been set on a prior candidate, so we're looking at a multiple of a previously found prime number, so we'll just bounce on to the next guess. Otherwise, this guess is printed, and the fun begins.

For all the multiples of this newly found prime, I'm walking through the array, setting the value of that multiple to 1. Note that there's no multiplication or division here (if you ignore the $guess * 2, which I could have written as $guess + $guess. That's what makes this algorithm particularly CPU-friendly.

And, as I suspected, we're down under 2 CPU seconds here again. Upping this program to the primes under a million puts me into the 15 CPU second region. Additionally, I'm starting to page fault quite a bit, because we're creating a million scalars in memory. Really, all we needed was a single bit. Luckily, we can represent a bit-vector directly, using the vec operator:

    my $UPPER = 1_000_000;
    my $sieve = "";
    GUESS: for (my $guess = 2; $guess <= $UPPER; $guess++) {
      next GUESS if vec($sieve,$guess,1);
      print "$guess\n";
      for (my $mults = $guess * 2; $mults <= $UPPER; $mults += $guess) {
        vec($sieve,$mults,1) = 1;
      }
    }

Here, $UPPER is like the previous program, but we're now using a single scalar $sieve to hold the bit data. This scalar's string value will be treated as a series of bits, initially all 0.

The rest of the program structure remains unchanged from the previous version, but now we've changed the array operations to vec operations. vec($sieve,$guess,1) is true when the corresponding single bit at position $guess is true. And we can use the same construct on the left side of an assignment to get that bit set or cleared. Fairly slick, and fairly compact.

Oddly enough, we're still at 15 CPU seconds for this program, but the pagefaults are now greatly reduced. This still seemed high, but then I realized that I'm setting a bunch of already set bits. We don't need to hit every multiple of $guess, because any multiple that was less than the square of $guess had already been taken out earlier!

Fixing that, we get

    my $UPPER = 1_000_000;
    my $sieve = "";
    GUESS: for (my $guess = 2; $guess <= $UPPER; $guess++) {
      next GUESS if vec($sieve,$guess,1);
      print "$guess\n";
      for (my $mults = $guess * $guess; $mults <= $UPPER; $mults += $guess) {
        vec($sieve,$mults,1) = 1;
      }
    }

which indeed reduces down to around 13 CPU seconds. Not a huge gain, but every bit helps as we get bigger. We're finding 78,498 primes, after all!

If you're curious about knowing a lot more about prime numbers (perhaps more than you wanted to know), I stumbled across the web page at http://www.utm.edu/research/primes/, which seems to be a great resource.

So, as you can now see, a first-year computer science problem can be solved handily by Perl, demonstrating that Perl is good for both theoretical things, and practical things. Until next time, enjoy!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Unix Review Column 26 (Jun 1999)