Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 44 (Nov 2002)

[suggested title: ``Automatically extending your data'']

Perl is great at parsing data, and bringing it into memory-based data structures for reformatting or analysis (``data reduction'').

One of Perl's features that permits relatively easy creation of complex data structures is ``autovivification'', a mouthful to say, but roughly means ``data structures get expanded as necessary''.

A frequent first reaction when I present autovivification in the courses I teach is ``isn't that dangerous?'' and ``how can I turn that off?'' and ``doesn't that violate 'use strict'?''. Well, the answers are ``no'' and ``you don't'' and ``no''. But let me explain, by first going back to something simpler.

Ever since the earliest versions of Perl, you've been able to say

  $count[3] = 14;

as the first statement of your program. What happens here is that Perl realizes that @count doesn't exist, and creates it, and then this new @count isn't long enough to hold 4 elements yet (elements 0 through 3), so Perl extends the array to include those elements, and puts the value 14 at $count[3].

Similarly, the code

  $seen{'Dino'} = 7;

has also always worked, first creating the hash %seen if it didn't exist, then adding a hash element with a key of Dino, and finally putting 7 as the corresponding value.

The value of this ``automatic creation of variables'' is that I don't need to first predetermine all the possible index values for a given hash or array, create the structure properly, and then run my program. I can simply run my program, and the data structures expand as necessary to hold the values.

Admittedly, language purists cringe at even this Perl capability, but let's just presume that they don't understand that the ``P'' in Perl stands for 'Practical' (at least in one telling of the story).

When Perl references were added mid-way through Perl's life, Larry Wall extended the definition of this autoextension of variables to include not-yet-defined references, and this is the action defined as ``autovification''. This is best illustrated as an example:

  $myreference = undef;
  $myreference->[3] = 14;

Here, $myreference is being treated as an array reference in the second statement. However, at the moment, that's undef. But just as Perl creates variables where necessary, and extends arrays and hashes as needed, Perl will also plug in a pointer to an empty anonymous array here. It's as if these statements were rewritten as:

  $myreference = undef;
  $myreference = []; # inserted via autovificiation
  $myreference->[3] = 14;

Now the 14 is being inserted as the 4th element (element index 3) of the anonymous array, pointed to by the $myreference variable.

Again, this is really just a continuation of the prior behavior: ``extend the data structures as necessary so that the show can go on''. Formally, the rule is:

If a variable containing undef is being used in an assignment as if it were a reference to a data structure, a reference to an empty data structure of the appropriate type is placed into that variable before the operation continues.

And the result is that we create data structures as needed. For example, this also works:

  $myreference = undef;
  $myreference->{'Dino'} = 7;

Note however that we'll end up with a hash reference in $myreference, not an array reference. This hash reference initially points at an anonymous empty hash, which is then nearly immediately extended to include an element with a key of Dino and a value of 7.

The type of reference is determined by the type of the object we're trying to point at, not by the previous contents of the variable. In fact, the previous contents of the variable must be undef, or the rule given above doesn't apply. So, this sequence is guaranteed to fail:

  $myreference = undef;
  $myreference->[3] = 14;
  $myreference->{'Dino'} = 7;  # fails

We're trying to use the now-present array reference in $myreference as if it were a hash reference. This can't work (ignoring the soon-to-be-removed pseudohash feature, anyway), and will throw a runtime exception.

The examples above deliberately put an undef into the variable, but the undef that is present in a newly-created variable would have worked just as well:

  my $newreference;
  $newreference->[3] = 14;

And recall that a new element of an array or hash also has this same sort of undef:

  my @pointers;
  $pointers[42]->{'Dino'} = 7;

Here, $pointers[42] doesn't exist, so Perl first extends the @pointers array to include that element. But then the element is being used as if it were a hash ref, so Perl places an anonymous hash reference into $pointers[42], and continues the operation. If we consistently placed only hash references into this array, we'd have a dynamically allocated array of hashrefs.

Of course, you can drop that arrow, because it's between two ``subscripty kind of things'' (technical terms), so it's more commonly written as $pointers[42]{'Dino'}. And even the quotes aren't necessary there, since the hash element is an alphanumeric symbol, so we can reduce that further to $pointers[42]{Dino} safely.

An action might invoke multiple levels of autovivification. For example, let's look at the following code:

  my $source = "red";
  my $destination = "yellow";
  my $length = 35;
  $lengths{$source}{$destination} = $length;

The hash element $lengths{red} is being used as a hash reference, dereferenced, and the element with a key of yellow of that hash is being given the value 35. Now, if this is the first few steps of the program, %lengths won't even exist, so it first gets created. Then, since $lengths{red} doesn't exist, it gets installed with a value of a reference to an empty hash (via autovificiation). Finally, the element with a key of yellow in that hash is given the value of 35, and we're done. This is more commonly encountered in a loop:

  while (<DATA>) {
    my ($source, $destination, $length) = split;
    $lengths{$source}{$destination} = $length;
  }
  # more code here later
  __END__
  red yellow 35
  red green 19
  purple blue 12
  blue orange 18

Note that once the first line is processed, creating a hash reference for $lengths{red}, the second line doesn't create a new hash reference, because $lengths{red} is already defined. So the elements with keys of yellow and green are both in the same hash, referenced by the hash element of $lengths{red}.

A variant on this for tabulation purposes involves the automatic initialization to undef for a variable with respect to an operator like +=. For example, the following code sums a list of numbers:

  while (<DATA>) {
    my ($number) = split;
    $sum += $number;
  }
  print "$sum\n";
  __END__
  3
  5
  19

The first time through the loop, $sum is uninitialized, and therefore guaranteed to be undef, but this happens to be the perfect base value for +=, treating the undef like a 0 because addition is a mathematical operation. We can apply this to a complex data reduction:

  while (<DATA>) {
    my ($source, $destination, $hits) = split;
    $total_hits{$source}{$destination} += $hits;
  }
  # more code here later
  __END__
  red yellow 35
  red green 19
  red yellow 12
  blue red 18
  blue red 8

Just like the previous summing example, we'll now be adding up a summation. But we're summing the totals organized by the pair of source crossed with destination. Looking at the first invocation:

  $total_hits{red}{yellow} += 35;

Since %total_hits is empty at this point, Perl first extends the hash to include a hashref at $total_hits{red}. This hashref initially points to an empty hash, but then gets extended to include an element at the key of yellow. However, since the value at this key is being used in a +=, the initial undef value is treated as 0, and then 35 gets added, resulting in 35. This 35 is then stored in place of the initial undef, and we're done. When the third step is executed:

  $total_hits{red}{yellow} += 12;

the value of 35 is added to 12, yielding 47, and that becomes the updated value.

The important point here is that you write what you want it to do, and it just works. That's the nice thing about Perl. It very often just Does The Right Thing. So, be mystified by autovification no more: learn to embrace it, use it, and like it! Until next time, enjoy!


Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.