Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Unix Review Column 44 (Nov 2002)
[suggested title: ``Automatically extending your data'']
Perl is great at parsing data, and bringing it into memory-based data structures for reformatting or analysis (``data reduction'').
One of Perl's features that permits relatively easy creation of complex data structures is ``autovivification'', a mouthful to say, but roughly means ``data structures get expanded as necessary''.
A frequent first reaction when I present autovivification in the courses I teach is ``isn't that dangerous?'' and ``how can I turn that off?'' and ``doesn't that violate 'use strict'?''. Well, the answers are ``no'' and ``you don't'' and ``no''. But let me explain, by first going back to something simpler.
Ever since the earliest versions of Perl, you've been able to say
$count[3] = 14;
as the first statement of your program. What happens here is that
Perl realizes that @count
doesn't exist, and creates it, and then
this new @count
isn't long enough to hold 4 elements yet (elements
0 through 3), so Perl extends the array to include those elements, and
puts the value 14 at $count[3]
.
Similarly, the code
$seen{'Dino'} = 7;
has also always worked, first creating the hash %seen
if it didn't
exist, then adding a hash element with a key of Dino
, and finally
putting 7 as the corresponding value.
The value of this ``automatic creation of variables'' is that I don't need to first predetermine all the possible index values for a given hash or array, create the structure properly, and then run my program. I can simply run my program, and the data structures expand as necessary to hold the values.
Admittedly, language purists cringe at even this Perl capability, but let's just presume that they don't understand that the ``P'' in Perl stands for 'Practical' (at least in one telling of the story).
When Perl references were added mid-way through Perl's life, Larry Wall extended the definition of this autoextension of variables to include not-yet-defined references, and this is the action defined as ``autovification''. This is best illustrated as an example:
$myreference = undef; $myreference->[3] = 14;
Here, $myreference
is being treated as an array reference in the second
statement. However, at the moment, that's undef
. But just as
Perl creates variables where necessary, and extends arrays and hashes
as needed, Perl will also plug in a pointer to an empty anonymous
array here. It's as if these statements were rewritten as:
$myreference = undef; $myreference = []; # inserted via autovificiation $myreference->[3] = 14;
Now the 14 is being inserted as the 4th element (element index 3) of
the anonymous array, pointed to by the $myreference
variable.
Again, this is really just a continuation of the prior behavior: ``extend the data structures as necessary so that the show can go on''. Formally, the rule is:
If a variable containing undef
is being used in an assignment as if
it were a reference to a data structure, a reference to an empty data
structure of the appropriate type is placed into that variable before
the operation continues.
And the result is that we create data structures as needed. For example, this also works:
$myreference = undef; $myreference->{'Dino'} = 7;
Note however that we'll end up with a hash reference in
$myreference
, not an array reference. This hash reference
initially points at an anonymous empty hash, which is then nearly
immediately extended to include an element with a key of Dino
and a
value of 7.
The type of reference is determined by the type of the object we're
trying to point at, not by the previous contents of the variable. In
fact, the previous contents of the variable must be undef
, or the
rule given above doesn't apply. So, this sequence is guaranteed to
fail:
$myreference = undef; $myreference->[3] = 14; $myreference->{'Dino'} = 7; # fails
We're trying to use the now-present array reference in $myreference
as if it were a hash reference. This can't work (ignoring the
soon-to-be-removed pseudohash feature, anyway), and will throw a
runtime exception.
The examples above deliberately put an undef
into the variable,
but the undef
that is present in a newly-created variable would
have worked just as well:
my $newreference; $newreference->[3] = 14;
And recall that a new element of an array or hash also has this same
sort of undef
:
my @pointers; $pointers[42]->{'Dino'} = 7;
Here, $pointers[42]
doesn't exist, so Perl first extends the
@pointers
array to include that element. But then the element is
being used as if it were a hash ref, so Perl places an anonymous hash
reference into $pointers[42]
, and continues the operation. If we
consistently placed only hash references into this array, we'd have a
dynamically allocated array of hashrefs.
Of course, you can drop that arrow, because it's between two
``subscripty kind of things'' (technical terms), so it's more commonly
written as $pointers[42]{'Dino'}
. And even the quotes aren't
necessary there, since the hash element is an alphanumeric symbol, so
we can reduce that further to $pointers[42]{Dino}
safely.
An action might invoke multiple levels of autovivification. For example, let's look at the following code:
my $source = "red"; my $destination = "yellow"; my $length = 35;
$lengths{$source}{$destination} = $length;
The hash element $lengths{red}
is being used as a hash reference,
dereferenced, and the element with a key of yellow
of that hash is
being given the value 35. Now, if this is the first few steps of the
program, %lengths
won't even exist, so it first gets created.
Then, since $lengths{red}
doesn't exist, it gets installed with a
value of a reference to an empty hash (via autovificiation). Finally,
the element with a key of yellow
in that hash is given the value of
35, and we're done. This is more commonly encountered in a loop:
while (<DATA>) { my ($source, $destination, $length) = split; $lengths{$source}{$destination} = $length; } # more code here later __END__ red yellow 35 red green 19 purple blue 12 blue orange 18
Note that once the first line is processed, creating a hash reference
for $lengths{red}
, the second line doesn't create a new hash
reference, because $lengths{red}
is already defined. So the
elements with keys of yellow and green are both in the same hash, referenced
by the hash element of $lengths{red}
.
A variant on this for tabulation purposes involves the automatic
initialization to undef
for a variable with respect to an operator
like +=
. For example, the following code sums a list of numbers:
while (<DATA>) { my ($number) = split; $sum += $number; } print "$sum\n"; __END__ 3 5 19
The first time through the loop, $sum
is uninitialized, and
therefore guaranteed to be undef
, but this happens to be the
perfect base value for +=
, treating the undef
like a 0 because
addition is a mathematical operation. We can apply this to a complex
data reduction:
while (<DATA>) { my ($source, $destination, $hits) = split; $total_hits{$source}{$destination} += $hits; } # more code here later __END__ red yellow 35 red green 19 red yellow 12 blue red 18 blue red 8
Just like the previous summing example, we'll now be adding up a summation. But we're summing the totals organized by the pair of source crossed with destination. Looking at the first invocation:
$total_hits{red}{yellow} += 35;
Since %total_hits
is empty at this point, Perl first extends the
hash to include a hashref at $total_hits{red}
. This hashref
initially points to an empty hash, but then gets extended to include
an element at the key of yellow
. However, since the value at this
key is being used in a +=
, the initial undef
value is treated as
0, and then 35 gets added, resulting in 35. This 35 is then stored in
place of the initial undef
, and we're done. When the third step
is executed:
$total_hits{red}{yellow} += 12;
the value of 35 is added to 12, yielding 47, and that becomes the updated value.
The important point here is that you write what you want it to do, and it just works. That's the nice thing about Perl. It very often just Does The Right Thing. So, be mystified by autovification no more: learn to embrace it, use it, and like it! Until next time, enjoy!