Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Unix Review Column 29 (Dec 1999)
[suggested title: Figuring Phone-y Words]
Every day, we seem barraged with phone numbers that ``spell'' things, like ``please dial 1-ZZZ-HE-MUST-PAY to force your older brother to pay for the call!''. That's because since nearly day one of dial phone service (back when it was really a dial), we have these letters that go along with each of the digits.
On one of the many mailing lists I follow, a question came up the
other day: someone wanted to know if there were any large sets of
words that collided all to the same number. I thought that'd be a
perfect job for Perl, and in a short time, came up with a quick
program to scan through the standard Unix dictionary in
/usr/dict/words
to find the longest such collision list (or lists,
if more than one). Since the program illustrates some basic data
reduction techniques, I thought I'd pass it along to you as well.
First comes the most critical part. Given an arbitrary string, like
``merlyn'', what are the digits used to construct that? Well, we need
to know that m
is 6, and e
is 3, and so on. Now, there are many
slow ways to do this, but the fastest way is a tr
operator:
$_ = "merlyn"; tr[abcdefghijklmnoprstuvwxy] [222333444555666777888999]; print;
which prints 637596. Here, I've used the feature that permits tr
operands to be delimited with arbitrary balancing punctuation,
together with the optional whitespace that can be between the old list
and the new list when we use such delimiters. This gives a nice
visual layout to let me verify that I've got the right characters with
the right translation.
Let's wrap this up in a subroutine, adding two additional features:
(1) uppercase will be treated as lowercase, and (2) if the string has
anything other than this list (like a Q or Z or punctuation), we'll
return an undef
:
sub translate { local $_ = lc shift; return unless tr[abcdefghijklmnoprstuvwxy] [222333444555666777888999] == length; $_; }
Mapping uppercase to lowercase was easy, using the lc
operator
on the result of shift
-ing the @_
argument array.
The ``bad character'' provision is handled by noting that the number of
characters translated by tr
(its return value) should be equal to
the length of the string, and returning undef
if not.
Now we need to walk the dictionary. That's not particularly hard; we just need to translate each word, and then record the results. If we use a hash keyed by the translated number, each element of the hash can have a value of an arrayref of all the words that matched. That'd look like this:
my %num_to_words; @ARGV = "/usr/dict/words" unless @ARGV; while (<>) { chomp; next unless my $translate = translate($_); push @{$num_to_words{$translate}}, $_; }
We'll take the command line arguments in @ARGV
as the list of files
to process, defaulting to /usr/dict/words
if none. And we'll
treating the value of $num_to_words{$translate}
as an arrayref,
pushing each new found word onto the end. If there are no entries
(such as initially), Perl stuffs an empty arrayref into the value,
allowing the push
to proceed.
So, if the dictionary consisted entirely of merlyn
, Randal
, and
pamfan
, we'd have a data structure like this:
%num_to_words = ( "637596" => ["merlyn"], "725325" => ["Randal", "pamfan"], );
Every key in this hash is the translated number. Every value is an arrayref, consisting of the words that had that particular translated number. If there's more than one entry, we have had a collision. The longer that collision list is, the more we're interested in it.
For example, note that the made-up word ``pamfan'' collides with Randal,
making a list of two items. That's more interesting to us than
merlyn
which seems to map into its own universe there.
So now it's time to walk the resulting data structure and find the longest of those collision lists. First, we'll need a loop, and the length of each item:
for my $number (keys %num_to_words) { my $length = @{$num_to_words{$number}}; ... }
This uses the arrayref from the value of the hash, dereferences it as
an array name in a scalar context, which results in the number of
items in that array. If there were five items that collided to the
value in $number
, we'd have 5 here.
And we'll want to keep knowing about the longest as we scan through. One way to do this is to keep two things up-to-date as we scan, first initialized outside the loop:
my $maxlength = 0; my @longest;
We'll use $maxlength
to say what the longest length is, and keep
pushing items onto @longest
that meet that length. If we get
a new $length
that's bigger, we start over, like this:
... if ($length > $maxlength) { $maxlength = $length; @longest = $number; } elsif ($length == $maxlength) { push @longest, $number; } ...
This means that if the length of the item we're looking at is longer
than the length of the longest item we've seen so far (initially true
because $maxlength
is initialized to 0 outside the loop), then set
the longest length to this new length, and remember the item in the
array of longest translated numbers. However, if it's only a tie with
the longest item we've seen so far, then push it onto the end of the
list, along with the others.
Now we've gotten all the numbers that have the largest set of
collisions in @longest
. As it turns out, there's only one in the
standard /usr/dict/words
, but let's keep pretending there might be
many, to continue through the end of the code.
Time to dump the data out:
for my $number (sort @longest) { print "$number: ", join(" ", sort @{$num_to_words{$number}}), "\n"; }
For each of the translated numbers in the longest array, we'll print the number itself, and then a list of all the items that collided to that number.
And when you put it all together, you get the final code:
use strict;
my %num_to_words; @ARGV = "/usr/dict/words" unless @ARGV; while (<>) { chomp; next unless my $translate = translate($_); push @{$num_to_words{$translate}}, $_; }
my $maxlength = 0; my @longest; for my $number (keys %num_to_words) { my $length = @{$num_to_words{$number}}; if ($length > $maxlength) { $maxlength = $length; @longest = $number; } elsif ($length == $maxlength) { push @longest, $number; } }
for my $number (sort @longest) { print "$number: ", join(" ", sort @{$num_to_words{$number}}), "\n"; }
sub translate { local $_ = lc shift; return unless tr[abcdefghijklmnoprstuvwxy] [222333444555666777888999] == length; $_; }
And now for the answer to the original puzzle. The greatest number of
words in /usr/dict/words
that map into the same phone digits is the
one list consisting of:
22737: acres bards barer bares baser bases caper capes cards cares cases
And that's no phone-y baloney! Until next time, enjoy!