Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 6 (January 1996)

One of the most important tasks in managing data is getting it into some sort of sensible order. Perl provides a fairly powerful sort operator, which has a tremendous amount of flexibility. I'm going to talk about some sorts of sorts, and hopefully you'll sort everything out by the time you're finished reading. (And no, despite the similarity to my name, I will not talk about ``random sorts''.)

Let's take a simple case. I have the words in a list somewhere in the perl program, and I want to sort them into alphabetical (technically, ascending ASCII) order. Easy enough:

        @somelist = ("Fred","Barney","Betty","Wilma");
        @sortedlist = sort @somelist;

This puts the value of (``Barney'',``Betty'',``Fred'',``Wilma'') into @sortedlist. If I had had these names in a file, I could have read them from the file:

        #!/usr/bin/perl
        @somelist = <>; # read everything
        @sortedlist = sort @somelist;
        print @sortedlist;

In this case, @somelist (and thus @sortedlist) will also have newlines at the end of each name. That's OK here, because it won't affect the sort order, and it makes printing them out that much easier.

Of course, I can shorten this a bit:

        #!/usr/bin/perl
        @somelist = <>;
        print sort @somelist;

Or even further:

        #!/usr/bin/perl
        print sort <>;

(I suppose this is what gives Perl its reputation for being cryptic.) Here, I've used no variables at all. However, it does indeed sort everything being read, and print the result.

These sorts of sorts are fine if the data is textual. However, if the data is numeric, we'll get a bad order. That's because comparing 15 with 3 as strings will place 15 before 3, not after it. Because the default sort is textual, we need some other way to tell sort to sort numerically, not textually.

Anything besides a textual sort of the values of the element of the list has to be handled with a ``sort subroutine''. The way this works is simple -- at some point, when perl is looking at two elements from the larger list, trying to figure out how to order those two elements, it has to perform a comparison of some sort (heh). By default, this is an ASCII string comparison. However, you can give your own comparison function using a sort subroutine, with the following interface rules:

1. your sort subroutine will be called repeatedly with two elements of the larger list.

2. the two elements will appear in the scalar variables $a and $b. (No need to make them local or look at @_.)

3. you need to ``compare'' $a and $b in the sort subroutine, and decide which is bigger.

4. you need to return -1, 0, or +1, depending on whether $a is ``less than'', ``equal to'', or ``greater than'' $b, using your comparison operation.

Those of you familiar with the qsort() library function from C should recognize this stuff. In fact, Perl uses the qsort() function, so it's no surprise.

So, here's a sort subroutine that does the job in comparing $a and $b numerically, rather than as text:

        sub numerically {
                if ($a < $b) { -1; }
                elsif ($a == $b) { 0; }
                else { +1; }
        }

Now, all we have to do is tell Perl to use this sort subroutine as the comparison function, rather than the built-in ASCII ascending sort. That's done by placing the name of the subroutine (without any leading ampersand) between the keyword ``sort'' and the list of things to be sorted. For example:

        @newlist = sort numerically 32,1,4,8,16,2;

And now, instead of the list coming out in ASCII order (as it would if I had left out the ``numerically'' word), I get the powers of two in proper numeric sequence in @newlist.

The comparison of $a and $b numerically to generate one of -1, 0, or +1, is performed often enough that Larry Wall believed it warranted its own operator, <=>, which has come to be known as the ``spaceship operator'' for reasons I would rather not discuss. So, I can shorten ``numerically'' down to this:

        sub numerically {
                $a <=> $b;
        }

Now this is short enough that it seems a waste to have to define a separate subroutine, and in fact Perl allows an even more compact notation: the inline sort block, which looks like this:

        @newlist = sort { $a <=> $b; } @oldlist;

The interface to this inline sort block is exactly as I've described for the subroutine above. It's just a little more compact. Personally, I use this style whenever the sort subroutine is under 40 characters or so, and break down to create a real subroutine above that.

Let's look at reading a list of numbers from the input again:

        #!/usr/bin/perl
        print sort numerically <>;
        sub numerically { $a <=> $b; }

Now, if I present this program with a list of numbers, I'll get the sorted list of numbers. This is functionally equivalent to a Unix ``sort'' command with a ``-n'' switch.

Let's get a little crazier. Suppose I have a file that has people's names in the first column, and bowling scores in the second column:

        Fred 210
        Barney 195
        Betty 200
        Wilma 170
        Dino 30

and that I want to sort this file based on bowling scores. Well, getting the data into the program is pretty simple:

        #!/usr/bin/perl
        @data = <>;

but each element of @data looks like: ``Fred 210\n'', and so on. How do I sort this list @data, but look only at the number and not the name?

Well, I'd need to pull the number out of the string. How do I do that? One way is with split:

        $a = "Fred 210\n";
        ($name,$score) = split /\s+/, $a;

Here, I split $a by whitespace, yielding a two element list. The first element goes into $name (which I really don't care about) and the second element goes into $score. There. Now all I have to do is tell Perl to look at just the score:

        sub scores {
                ($name_a,$score_a) = split /\s+/, $a;
                ($name_b,$score_b) = split /\s+/, $b;
                $score_a <=> $score_b;
        }

and in fact, this will do it!

        #!/usr/bin/perl
        sub scores { ... } # as above
        print sort scores <>;

So, what's wrong with this picture? Well, it'd be just fine if we only looked at each entry in the list once. However, after we're done comparing Fred's score to Barney (and decide Fred is better), we also have to compare Fred's score to Betty's score. That means that we've had to split Fred's data twice so far. In fact, for a huge list, it'll have to perform the very same split over and over and over again.

There's a few ways out of this. One is to compute a separate array that has only the scores, and then sort that array. Let's look at that first.

The goal is to first read the data, and then compute an associative array whose keys represent a particular element of the array, and values represent the precomputed scores. Then, we are reducing the problem to one of an associative array lookup instead of a (perhaps) expensive split.

        @data = <>; # read data
        foreach (@data) {
                ($name,$score) = split; # get score
                $score{$_} = $score; # record it
        }

Now, $score{``Fred 210\n''} will be just 210, and so on, for each of the original elements of @data.

Next, we have to use the information. We need a subroutine that, given two elements of @data in $a and $b, looks up the corresponding scores in %score, and compares those numerically:

        sub score {
                $score{$a} <=> $score{$b};
        }

and this indeed does it. Let's put it all together:

        #!/usr/bin/perl
        @data = <>; # read data
        foreach (@data) {
                ($name,$score) = split; # get score
                $score{$_} = $score; # record it
        }
        print sort {
                $score{$a} <=> $score{$b};
        } @data;

Note that in this version, I recoded the sort subroutine as an inline block instead. (I'm just trying to give you a lot of alternative notations to play with.)

Another way to tackle the problem is to massage the list into a list of pairs of things. The second element of each pair (actually, an anonymous list of two elements) will be the computed sort determinant, and the first element will be the original data value (so we can get back to the original data). This is best handled with the ``map'' operator (not available in older Perl versions).

        @pairs = map {
                ($name, $score) = split;
                [ $_, $score ];
        } @data;

Here, the block of code is executed for each element of @data, with $_ set to the element. This causes each element to be split into $name and $score, and then I build a two-element anonymous list from the $score and the original value $_. These are collected into a new list. If @data had five elements, then @pairs has five elements, each of which is a reference to a two-element anonymous list. Ouch!

The next step is to sort the @pairs list. Within the sort subroutine, $a and $b will be references to two-element lists. The second element of each list is the sort key, and is addressed like $a->[1]. So, we get a sort subroutine like this:

        sub mangle {
                $a->[1] <=> $b->[1];
        }

and the sort looks like this:

        @sorted = sort mangle @pairs;

Now, @sorted is still the same pairs of data, but sorted according to the scores (did you forget we were still working with the scores?). I have to peel away the anonymous lists to get back the original data, while still preserving the order. Easy -- map to the rescue again:

        @finally = map {
                $_->[0];
        } @sorted;

This is because $_ will be each element of @sorted -- a reference to an anonymous list, and therefore $->[0] will fetch the first element of the anonymous list pointed to by $_, which is the original data. Whew!

Of course, in the Perl tradition, I can shove all this stuff together in a very lisp-like way. You've got to read this back to front to see what is happening:

        #!/usr/bin/perl
        print
                map { $_->[0] }
                sort { $a->[1] <=> $b->[1] }
                map {
                        ($name,$score) = split;
                        [$_,$score];
                } <>;

Eeek. But hey, it works!

One last optimization: I can put the split directly inside the anonymous list creation:

        #!/usr/bin/perl
        print
                map { $_->[0] }
                sort { $a->[1] <=> $b->[1] }
                map { [$_, (split)[1] ] }
                <>;

which works because the split here is being pulled apart by a ``literal slice'' -- only the second element of the list remains after we slice it up.

Perl provides some powerful sorting techniques, which can really be a boon once mastered. I hope I have inspired you more than I've confused you. Next time, something entirely different.

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Unix Review Column 6 (January 1996)