Figuring phone-y words (dec 99)

Unix Review Column 29 (Dec 1999)

[suggested title: Figuring Phone-y Words]

Every day, we seem barraged with phone numbers that "spell" things, like "please dial 1-ZZZ-HE-MUST-PAY to force your older brother to pay for the call!". That's because since nearly day one of dial phone service (back when it was really a dial), we have these letters that go along with each of the digits.

On one of the many mailing lists I follow, a question came up the other day: someone wanted to know if there were any large sets of words that collided all to the same number. I thought that'd be a perfect job for Perl, and in a short time, came up with a quick program to scan through the standard Unix dictionary in /usr/dict/words to find the longest such collision list (or lists, if more than one). Since the program illustrates some basic data reduction techniques, I thought I'd pass it along to you as well.

First comes the most critical part. Given an arbitrary string, like "merlyn", what are the digits used to construct that? Well, we need to know that m is 6, and e is 3, and so on. Now, there are many slow ways to do this, but the fastest way is a tr operator:

      
col29.pl

  $_ = "merlyn";
  tr[abcdefghijklmnoprstuvwxy]
    [222333444555666777888999];
  print;

which prints 637596. Here, I've used the feature that permits tr operands to be delimited with arbitrary balancing punctuation, together with the optional whitespace that can be between the old list and the new list when we use such delimiters. This gives a nice visual layout to let me verify that I've got the right characters with the right translation.

Let's wrap this up in a subroutine, adding two additional features: (1) uppercase will be treated as lowercase, and (2) if the string has anything other than this list (like a Q or Z or punctuation), we'll return an undef:

      
col29.pl

  sub translate {
    local $_ = lc shift;
    return unless
      tr[abcdefghijklmnoprstuvwxy]
	[222333444555666777888999]
      == length;
    $_;
  }

Mapping uppercase to lowercase was easy, using the lc operator on the result of shift-ing the @_ argument array.

The "bad character" provision is handled by noting that the number of characters translated by tr (its return value) should be equal to the length of the string, and returning undef if not.

Now we need to walk the dictionary. That's not particularly hard; we just need to translate each word, and then record the results. If we use a hash keyed by the translated number, each element of the hash can have a value of an arrayref of all the words that matched. That'd look like this:

      
col29.pl

  my %num_to_words;
  @ARGV = "/usr/dict/words" unless @ARGV;
  while (<>) {
    chomp;
    next unless
      my $translate = translate($_);
    push @{$num_to_words{$translate}}, $_;
  }

We'll take the command line arguments in @ARGV as the list of files to process, defaulting to /usr/dict/words if none. And we'll treating the value of $num_to_words{$translate} as an arrayref, pushing each new found word onto the end. If there are no entries (such as initially), Perl stuffs an empty arrayref into the value, allowing the push to proceed.

So, if the dictionary consisted entirely of merlyn, Randal, and pamfan, we'd have a data structure like this:

      
col29.pl

  %num_to_words = (
    "637596" => ["merlyn"],
    "725325" => ["Randal", "pamfan"],
  );

Every key in this hash is the translated number. Every value is an arrayref, consisting of the words that had that particular translated number. If there's more than one entry, we have had a collision. The longer that collision list is, the more we're interested in it.

For example, note that the made-up word "pamfan" collides with Randal, making a list of two items. That's more interesting to us than merlyn which seems to map into its own universe there.

So now it's time to walk the resulting data structure and find the longest of those collision lists. First, we'll need a loop, and the length of each item:

      
col29.pl

  for my $number (keys %num_to_words) {
    my $length = @{$num_to_words{$number}};
    ...
  }

This uses the arrayref from the value of the hash, dereferences it as an array name in a scalar context, which results in the number of items in that array. If there were five items that collided to the value in $number, we'd have 5 here.

And we'll want to keep knowing about the longest as we scan through. One way to do this is to keep two things up-to-date as we scan, first initialized outside the loop:

      
col29.pl

  my $maxlength = 0;
  my @longest;

We'll use $maxlength to say what the longest length is, and keep pushing items onto @longest that meet that length. If we get a new $length that's bigger, we start over, like this:

      
col29.pl

  ...
    if ($length > $maxlength) {
      $maxlength = $length;
      @longest = $number;
    } elsif ($length == $maxlength) {
      push @longest, $number;
    }
  ...

This means that if the length of the item we're looking at is longer than the length of the longest item we've seen so far (initially true because $maxlength is initialized to 0 outside the loop), then set the longest length to this new length, and remember the item in the array of longest translated numbers. However, if it's only a tie with the longest item we've seen so far, then push it onto the end of the list, along with the others.

Now we've gotten all the numbers that have the largest set of collisions in @longest. As it turns out, there's only one in the standard /usr/dict/words, but let's keep pretending there might be many, to continue through the end of the code.

Time to dump the data out:

      
col29.pl

  for my $number (sort @longest) {
    print
      "$number: ",
      join(" ", sort @{$num_to_words{$number}}),
      "\n";
  }

For each of the translated numbers in the longest array, we'll print the number itself, and then a list of all the items that collided to that number.

And when you put it all together, you get the final code:

      
col29.pl

  use strict;

      
col29.pl

  my %num_to_words;
  @ARGV = "/usr/dict/words" unless @ARGV;
  while (<>) {
    chomp;
    next unless
      my $translate = translate($_);
    push @{$num_to_words{$translate}}, $_;
  }

      
col29.pl

  my $maxlength = 0;
  my @longest;
  for my $number (keys %num_to_words) {
    my $length = @{$num_to_words{$number}};
    if ($length > $maxlength) {
      $maxlength = $length;
      @longest = $number;
    } elsif ($length == $maxlength) {
      push @longest, $number;
    }
  }

      
col29.pl

  for my $number (sort @longest) {
    print
      "$number: ",
      join(" ", sort @{$num_to_words{$number}}),
      "\n";
  }

      
col29.pl

  sub translate {
    local $_ = lc shift;
    return unless
      tr[abcdefghijklmnoprstuvwxy]
	[222333444555666777888999]
      == length;
    $_;
  }

And now for the answer to the original puzzle. The greatest number of words in /usr/dict/words that map into the same phone digits is the one list consisting of:

      
col29.pl

  22737: acres bards barer bares baser bases caper capes cards cares cases

And that's no phone-y baloney! Until next time, enjoy!

Figuring phone-y words (dec 99)

Copyright Notice

Unix Review Column 29 (Dec 1999)

About Randal L. Schwartz