Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 11 (November 1996)

One of the things that Perl excels at is manipulating lists of data. In fact, with the addition of references to Perl 5.0, list manipulation became even easier. In this column, I will show you some simple list manipulation techniques that you may find handy. After all, there is no reason to reinvent the wheel -- someone else has probably already done the hard work. (And often when you reinvent the wheel, you end up with casters.)

One of the most common tasks to perform on a list is to remove duplicates. This question comes up a lot on the ``comp.lang.perl.misc'' newsgroup. The most efficient way (and even the simplest to write, go figure!) is to use a hash (formerly called ``associative array'') to keep track of the list elements seen so far, and reject additional copies of those elements. That'd look something like this:

        %temp = ();
        @list = grep ++$temp{$_} < 2, @list;

Here, I clear out a hash called %temp by assigning it the empty list. The next statement does all the hard work, and there's a lot going on here, so let's take it apart bit by bit.

First, I'm presuming the data I want to reduce to unique elements is in the @list array. I next perform a ``grep'' operation on the current value of this list. Similar to its Unix command counterpart, the grep operator selects and rejects various elements of this list based on a true/false expression. Here, the expression is an evaluation of a hash element access, using $_ as the key. The grep operator gives each value in @list one at a time to $_, and the evaluations are sequential. So, the first time $_ is, say, the string ``fred'', the value of ++$temp{``fred''} is an increment from undef (because the element cannot be found in the hash) to ``1''. Now, since this value is less than 2, the full expression returns true, and the element is included at that point.

On second and subsequent occurances of ``fred'', the value returned by ++$temp{``fred''} is 2 or more. Because the expression then evaluates to false, those elements are thus rejected. The result of the grep operator is then assigned to the @list variable, and we have accomplished the task at hand.

Someone asked recently on the comp.unix.shell newsgroup about how to remove the duplicate names from a list of mailing addresses. Now, while a simple sort command with a ``-u'' switch would have performed this operation, I decided to tackle this problem with a Perl script, to give an added value solution. Namely, the mailing addresses uniqueness check should be ``case insensitive'', since fred@BIG.COM is the same address as fred@big.com. With just a little more programming than the previous example, we can accomplish this rather easily. Here's a sample of that:

        @list = split /\n/, <<END;
        fred@big.com
        barney@little.com
        fred@BIG.COM
        barney@ivy.edu
        END
        # ... later in the program ...
        {
                my %temp;
                foreach (@list) {
                        $temp{lc} = $_;
                }
                @list = sort values %temp;
        }

Now, the first few lines set up the @list variable. The block of the last seven lines form a temporary area where I can create a new %temp variable. This variable is guaranteed to start as empty, similar to the previous code snippet. However, by wrapping the variable definition in a block, the variable will automatically go away at the end of the block, thus reclaiming the memory used by the variable, *and* if the name conflicts with an existing variable, our newly created local variable takes precedence. Neat trick for temporary variables.

After the empty local variable %temp is created, I then loop through the @list variable. Each iteration through the loop sets a local $_ to a different value. Within the loop is a simple hash element setting, where I'm putting the key as the lowercased contents of $_ (using the lc function), and $_ (the original email address) becomes the value. By doing it this way, I force two email addresses that are identical except for a change in case (upper vs. lower) to both try to occupy the same hash element. Since this is impossible, the second email address overwrites the first, and I get one value instead.

After the loop is completed, I pull out the values, sorted ``ascii-betically''. Why not the keys? Well, the keys have all been lowercased, but the values will preserve at least *one* of the preferred capitalizations of the email address. Pretty cool, eh?

Now, for something slightly different. Another task that comes up frequently is removing a list of things from another list of things. This might be called ``set subtraction'' or ``set difference'' if you are into those sorts of words. Well, once again, the hash comes to the rescue. In fact, generally whenever you think of set ``somethings'', you can almost immediately think ``the set is the keys of a hash'' and be pretty close. So, here's some code to perform a set subtraction between @list and @stoplist:

        %temp = ();
        @temp{@list} = ();
        foreach (@stoplist) {
                delete $temp{$_};
        }
        @list = sort keys %temp;

Once again, I have the temporary %temp hash variable, clearing it out so that I can work from a clean slate. The next step creates elements within %temp for all the keys of @list, using a hash slice operation. The values will all be undef (the undefined value), but that's irrelevant for this particular solution. The foreach loop causes every element present in @stoplist to be deleted from the %temp hash. It's OK if some of these elements aren't present -- the delete operator just ignores them.

At one time, the syntax of:

        delete @temp{@stoplist};

was being discussed amongst the Perl developers, which would eliminate the loop, but this hasn't been implemented to my knowledge. Maybe in Perl 6.0, perhaps?

Finally, the new list is gathered by taking the remaining keys from %temp, sorting them ``ascii-betically''. This will be the result of @list, without the elements of @stoplist. As a side-effect, any duplicate elements of @list will be removed as well.

We can wrap up this routine into a subroutine, like so:

        sub remove {
                my($listref,$stopref) = @_;
                my(%temp);
                
                @temp{@$listref} = ();
                foreach (@$stopref) {
                        delete $temp{$_};
                }
                sort keys %temp;
        }

Notice here that the first two parameters are *references* to lists, not lists themselves. This allows for rather efficient passing of the list into the subroutine. The first parameter is stored into $listref, and the second into $stopref, and %temp is created as an empty, local hash.

The $listref is de-referenced in the next statement, using a notation of @$listref, which asks for the list denoted by $listref. Similarly, the $stopref is dereferenced with @$stopref. The last expression evaluated in this subroutine (sort keys %temp) becomes the return value for the subroutine.

Here's a sample invocation of this subroutine:

        @input = qw(fred barney betty);
        @guys = remove(
                \@input,
                [qw(betty wilma pebbles)]
        );

We have @input being a few different names, and we're invoking the ``remove'' routine, passing it a reference (made with backslash) to the @input variable, and an anonymous list created from the three names. The result in @guys will be the set difference between the @input and the literal list.

How about set intersection, where an element has to be in both lists to make it through? That'd be something like this:

        %temp = ();
        @temp{@list_one} = (1) x @list_one;
        @result = grep $temp{$_}, @list_two;

The first step clears our now-famous %temp hash. (Or, if this were truly a temporary variable, you wouldn't need this.) The next step creates a hash slice into %temp, with an interesting property. The left side is a list belonging to @list_one -- the right side is a list of 1's, replicated (with the x operator) to be long enough to exactly equal the list on the left side. This has the effect that all elements of %temp with keys of the elements of @list_one are now set to a value of 1.

The result is then computed by noting which elements of @list_two have corresponding elements in @list_one, using those elements in %temp which have been set in the previous step.

I hope this gives you some sets of ideas to manipulate lists. By the way, this column, along with all my prior Perl columns in this magazine, are available on my website at:

        http://www.teleport.com/~merlyn/UnixReview/

Yes, the capitalization is important. Enjoy!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Unix Review Column 11 (November 1996)