Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Linux Magazine Column 51 (Sep 2003)

[suggested title: ``The basics of hashes'']

When I first started playing with Awk more than two decades ago, I was amazed at the ease with which common tasks could be easily solved through the use of its ``array'' datatype. Prior to that, I had experienced only BASIC and C arrays, where the index was a small integer. But these Awk arrays could have arbitrary strings as keys!

I remained mystified at how Awk could manage to find the value for my key quickly. It took some study later to learn about hash tables and complex data structures, but luckily I didn't need to know how it was done ``under the hood'' to be able to use them effectively.

As Larry Wall copied a lot of what was right about Awk when designing and building the first version of Perl, one of the things he brought along was Awk's idea of an array with arbitrary string keys. He chose to call those associative arrays, to distinguish them from the more traditional small-integer-index arrays which he also implemented in this new Perl language.

After a half-dozen years of Perl development, the term associative array was optimized to just hash for the Perl version 5 releases. I'm glad for that, because I use the term frequently when describing nearly every complex program I write.

A hash is a mapping of keys to values. We create elements in hashes by telling the hash that ``this value goes with that key''. Later on, we can ask the hash ``now what value goes with this key?'', and the hash can respond quickly. Even though that sounds simple, just being able to do that helps us solve lots of problems with relative efficiency.

The values of a hash can be any scalar or reference, including strings, numbers, the special undef value, or a reference to another data structure. Let's put the references aside for a moment, and stay with just the basics.

The keys of a hash must be a string. If we try to use a number as a key, it gets ``stringified''. So, using any of 2 or 2.0 or 2e0 as keys would result in accessing the same hash element because they both produce the same string. However, using the strings "2" and "2.0" and "2e0" will access different elements, because the strings themselves are different. Also, if undef is used as a key, it becomes the empty string. If you have warnings enabled, you'll likely get a warning message on that.

The keys of a hash are always unique, because the keys uniquely identify a particular element. If you try to insert a new element that has an existing key, it'll simply overwrite the previous value. The values of a hash may be duplicated, however. More than one element of a hash can have a value of fred, for example.

The name of a hash is like all other Perl variables: alphanumerics and underscore, of any length, possibly including package markers (double-colon). The hash name is preceded by a percent-sign when referring to the entire hash (for certain functions that access the hash as a whole), such as %age. The hash name is preceded by a dollar sign and followed by a key expression enclosed in curly-braces when referring to an individual element, such as $age{"fred"}.

Inserting items into a hash is as simple as assigning them:

  $age{"fred"} = 27;
  $age{"wilma"} = 24;

If a key is reused, the new value overwrites the old value:

  $age{"fred"} = 28;

If the key is entirely an alphanumeric string (including underscore), the quotes can often be dropped:

  $age{fred} = 28; # same

But don't get into this habit in general, because this breaks when there's no longer a simple alphanumeric string:

  $age{Mr. Slate} = 35; # BROKEN
  $age{"Mr. Slate"} = 35; # correct

And now for the magic of the hash: to get the value back, just ask for the key:

  print "Fred is ", $age{fred}, " years old.\n";

The hash can quickly locate the 28 value corresponding to the fred key. OK, so I was mystified at how it did this when I was younger, and I can still pretend to be mystified by it now. In practical terms, while it will get a bit slower and slower as you put hundreds and thousands of items into the hash, it won't get nearly as slow as it would be if you were to simply do a linear search for the item in a list of items. It just works, and works well.

If you ask for a key that isn't in the hash, the response is an undef value. And unlike awk, this doesn't change the hash at all. But undef can also be a value in the hash! To tell the difference between the undef that is returned when the key is absent from the undef that is returned because you stored it there, use exists:

  if (exists $age{dino}) {
    print "We know dino's age... it's $age{dino}.\n";
  } else {
    print "We don't know dino's age.\n";
  }

If we had a lot of ages to initialize into the hash, we could certainly assign each one individually, as shown above. A faster way to initialize an entire hash is to assign a list to the %-form of the hash, as in:

  %age = ('fred', 28, 'wilma', 24, 'Mr. Slate', 35);

Here, the list of six items is broken into three pairs of items. Within each pair, the first item is a key, and the second is a corresponding value. Later items override earlier items, so this assignment:

  %age = ('fred', 27, 'wilma', 24, 'fred', 28);

results in Fred having the age of 28. As an alternative syntax, we can write the original initialization using the fat-arrow:

  %age = ('fred' => 28, 'wilma' => 24, 'Mr. Slate' => 35);

which clarifies which items are paired. As a further simplification, any item to the left of a fat arrow which is simply an alphanumeric-and-underscore name can have its quotes removed:

  %age = (fred => 28, wilma => 24, 'Mr. Slate' => 35);

Again note that Mr. Slate required special quoting.

We can process all the names without knowing which names got inserted by using the keys() function:

  foreach my $key (keys %age) {
    print "$key is $age{$key} years old.\n";
  }

The list of keys comes out as some random ordering of the three keys. The foreach loop puts each of these names in turn into the $key variable, and then the corresponding value is looked up inside the double-quoted string.

Why is it a random order? Well, the primary purpose in life for a hash is to find a value for a given key. To do that, it organizes the keys internally into ``buckets'', and when ask for all the keys, it just dumps the buckets in their internal order. We can fix the indeterminacy by adding a sort in front of the keys operator:

  foreach my $key (sort keys %age) {
    print "$key is $age{$key} years old.\n";
  }

This output is now sorted alphabetically by keys (the names). Another common task is to sort by values, not by keys. Let's say we wanted the output sorted in terms of increasing ages. We can do that with a user-defined sort:

  sort { $age{$a} <=> $age{$b} ) keys %age

Here, we're taking the list of keys (the names), and sorting them according to their corresponding age. If we put this expression inside the foreach loop, we get our desired ``sort by age'' display.

A useful bonus effect of having a missing key return undef for a value is that undef is the ``null element'' for both addition and concatenation. For example, let's increase an element by three:

  $age{wilma} += 3;

The existing value of $age{wilma} is consulted (24), incremented by 3 (yielding 27), and then the result is stored back as the updated value for the given key. But what if the key is absent?

  $age{dino} += 3;

The existing value of $age{dino} doesn't exist, so we start with undef. But undef plus 3 is 3, so we store a 3 back into $age{dino}, creating an element where an element didn't previous exist!

This is very useful when counting or summarizing. For example, counting words that appear in a file is as simple as:

  %count = (); # clear the hash out
  while (<>) {
    my @words = split /\W+/, $_; # break on non-alphanums
    foreach my $word (@words) {
      $count{$word} += 1;
    }
  }

The first time a word is encountered, the hash lookup for that word as a key returns undef, which is incremented to 1, and the value of 1 is stored for that word. When the same word is encountered a second time, the existing value of 1 is incremented and updated to 2, and so on. Typically, we'd write that middle line using the ++ operator, as in:

      $count{$word}++;

without any change in meaning. When this loop is done, we can then display the results:

  foreach my $word (
    sort { $count{$b} <=> $count{$a} } keys %count
  ) {
    print "$word was seen $count{$word} times\n";
  }

We can also take advantage of the uniqueness property of hash keys to filter out duplicates from a list. For example, suppose we need to examine lines of a file, generating an output that consists of only one line even if the line is duplicated. Easy enough:

  %seen = ();
  while (<>) {
    $seen{$_} = 1;
  }
  foreach (sort keys %seen) {
    print;
  }

For each line being read, we set a hash element in %seen with a key of that line to a value of 1. If the same line is seen more than once, we're just overwriting the one with another one. When we're done looking at all the lines, we dump all the keys. Of course, this loses the original input order. If that's important, we can print each line as we see it, then ignore any subsequent appearances:

  %seen = ();
  while (<>) {
    next if $seen{$_};
    print;
    $seen{$_} = 1;
  }

On the first appearance of a line, $seen{$_} will be undef, which appears to be false, so we won't take the next operation. After printing the line, we set the value to 1, and remaining lines will be ignored.

And for today's trivia question, how do we print only the last occurrence of each line? Again, a little magic helps. First, we create a hash that holds the line number of each line, updated and overwritten as we keep seeing each line:

  %seen = ();
  while (<>) {
    $seen{$_} = $.;
  }

Here, we're using the $. magic variable to hold the line number of the most recently read line. Now all we need to do is sort by line number:

  print foreach sort { $seen{$a} <=> $seen{$b} } keys %seen;

And we're done!

I hope that this little overview of a small but powerful feature of Perl has inspired and enlightened you. Until next time, enjoy!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Linux Magazine Column 51 (Sep 2003)