Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 30 (Feb 2000)

[suggested title: Deep copying, not Deep secrets]

One of the modules I encountered in the CPAN had a problem with creating multiple objects. Each object got created fine, but on further use simultaneously in the program, some of the data from one object mysteriously shows up in the second one, even if the first one is freed!

Upon inspection, I found that the object was being initialized from a shallow copy of a template, and I told the author that he needed to deep copy the template instead. He was puzzled by the suggestion, and if you aren't familiar with these two terms, I bet you are a little confused now as well.

What is deep copying and why do we need it? Let's start with a simple example, and work back to the problem I posed a moment ago.

For example, let's grab all the password information into a simple hash, with the key being the username, and the value being an array ref pointing to the nine element value returned by getpwent(). On a first cut, we quickly hack out something like this:

  while (@x = getpwent()) {
    $info{$x[0]} = \@x;
  }
  for (sort keys %info) {
    print "$_ => @{$info{$_}}\n"
  }

What? Where did all the data go? We stored a reference to the data into the hash value. Well, maybe this will make it clearer:

  while (@x = getpwent()) {
    $info{$x[0]} = \@x;
    print "$info{$x[0]}\n";
  }

On my machine, this printed ARRAY(0x80ce7fc) dozens of times, once for every user on the system. So what does that mean? It means that we are reusing the same array repeatedly, and therefore we don't have dozens of arrayrefs pointing to distinct arrays, we have a single array with many pointers to the same data. So on the last pass through the gathering loop, @x is emptied, and therefore all the array references are pointing to the identical empty data.

This is because we made a shallow copy of the reference to @x into the hash value, which is a copy of only the top level pointer, but not the contents. What we really needed was not only a copy of the reference, but a copy of to what the reference pointed. That's simple enough here:

  while (@x = getpwent()) {
    $info{$x[0]} = [@x];
  }
  for (sort keys %info) {
    print "$_ => $info{$_} => @{$info{$_}}\n"
  }

And now notice, we've got a distinct arrayref for each hash element, pointing to an independent copy of the 9 elements of the array originally contained in @x. This worked because we created a new anonymous arrayref with the expression [@x], which also gives this anonymous array an initial value made of copies of the elements of @x.

So that's a basic deep copy: copying not only the top level pointer, but also all the things within the data structure to maintain complete independence.

Actually there was one other way to ensure unique subelements in this example, and I'll show it for completeness lest my Perl hacking friends get irritated. You don't need to copy anything if you just generate the data in a distinct array in the first place:

  while (my @x = getpwent()) {
    $info{$x[0]} = \@x;
  }
  for (sort keys %info) {
    print "$_ => $info{$_} => @{$info{$_}}\n"
  }

Here, each pass through the loop starts with a brand-new completely distinct lexical @x rather than reusing the old existing variable. So when a reference is taken to it, and it falls out of scope at the bottom of the loop, the variable automatically remains behind as an anonmous variable.

But let's get back to deep copying. Here's another example. Let's suppose Fred and Barney are sharing a house:

  $place = {
    Street => "123 Shady Lane",
    City => "Granite Junction",
  };
  $fred = {
    First_name => "Fred",
    Last_name => "Flintstone",
    Address => $place,
  };
  $barney = {
    First_name => "Barney",
    Last_name => "Rubble",
    Address => $place,
  };

Now note that $fred->{Address}{City} is ``Granite Junction'' just as we might expect it, as is $barney->{Address}{City}. But we've done a shallow copy from $place into both of the Address element values. This means that there's not two copies of the data, but just one. We can see this when we change one of the values. Let's let Fred move to his own place:

  $fred->{Address}{Street} = "456 Palm Drive";
  $fred->{Address}{City} = "Bedrock";

Looks safe enough. But what happened to Barney? He moved along with Fred!

  print "@{$barney->{Address}}{qw(Street City)}\n";

This prints Fred's new address! Why did that happen? Once again, the assignment of $place as the address in both cases made a shallow copy: both data structures shared a common pointer to the common street and city data. Again, a deep copy would have helped:

  $place = {
    Street => "123 Shady Lane",
    City => "Granite Junction",
  };
  $fred = {
    First_name => "Fred",
    Last_name => "Flintstone",
    Address => {%$place},
  };
  $barney = {
    First_name => "Barney",
    Last_name => "Rubble",
    Address => {%$place},
  };
  $fred->{Address}{Street} = "456 Palm Drive";
  $fred->{Address}{City} = "Bedrock";
  print "@{$barney->{Address}}{qw(Street City)}\n";

There... now each Address field is a completely disconnected copy, so when we update one, the other stays pure. This works because just like the [@x] construct, we are creating a new independent anonymous hash and taking a reference to it.

But what if $place was itself a deeper structure? That is, suppose the street address was made up of a number and a name:

  $place = {
    Street => {
      Number => 123,
      Name => "Shady Lane",
    },
    City => "Granite Junction",
  };
  $fred = {
    First_name => "Fred",
    Last_name => "Flintstone",
    Address => {%$place},
  };
  $barney = {
    First_name => "Barney",
    Last_name => "Rubble",
    Address => {%$place},
  };

We've now done something that's not quite a deep copy, but also not a shallow copy. Certainly, the hash at $fred->{Address} is different from $barney->{Address}. But they both contain a value that is identical to the $place->{Street} hashref! So if we move Fred just down the street:

  $fred->{Address}{Street}{Number} = 456;

then Barney moves along with him again! Now, we could fix this problem by applying the logic for copying the address one more time to the street structure:

  $fred = {
    First_name => "Fred",
    Last_name => "Flintstone",
    Address => {
      Street => {%{$place->{Street}}},
      City => $place->{City},
    },
  };

But as you can see, it's getting more and more convoluted. And what if we change City to be another structure, or added another level to Street. Bleh.

Fortunately, we can write a simple general-purpose deep copier with a recursive subroutine. Here's a simple little deep copy routine:

  sub deep_copy {
    my $this = shift;
    if (not ref $this) {
      $this;
    } elsif (ref $this eq "ARRAY") {
      [map deep_copy($_), @$this];
    } elsif (ref $this eq "HASH") {
      +{map { $_ => deep_copy($this->{$_}) } keys %$this};
    } else { die "what type is $_?" }
  }

This subroutine expects a single item: the top of a tree of hashrefs and listrefs and scalars. If the item is a scalar, it is simply returned, since a shallow copy of a scalar is also a deep copy. If it's an arrayref, we create a new anonymous array from the data. However, each element of this array could itself be a data structure, so we need a deep copy of it. The solution is straightforward: simply call deep_copy on each item. Similarly, a new hashref is constructed by copying each element, including a deep copy of its value. (The hash key is always a simple scalar, so it needs no copy, although that would have been easy enough to add.) To see it work, let's give some data:

  $place = {
    Street => {
      Number => 123,
      Name => [qw(Shady Lane)],
    },
    City => "Granite Junction",
    Zip => [97007, 4456],
  };
  $place2 = $place;
  $place3 = {%$place};
  $place4 = deep_copy($place);

Hmm. How do we see what we've done, and what's being shared? Let's add a call to the standard library module, Data::Dumper:

  use Data::Dumper;
  $Data::Dumper::Indent = 1;
  print Data::Dumper->Dump(
    [$place, $place2, $place3, $place4],
    [qw(place place2 place3 place4)]
  );

And that generates on my system:

  $place = {
    'City' => 'Granite Junction',
    'Zip' => [
      97007,
      4456
    ],
    'Street' => {
      'Name' => [
        'Shady',
        'Lane'
      ],
      'Number' => 123
    }
  };
  $place2 = $place;
  $place3 = {
    'City' => 'Granite Junction',
    'Zip' => $place->{'Zip'},
    'Street' => $place->{'Street'}
  };
  $place4 = {
    'City' => 'Granite Junction',
    'Zip' => [
      97007,
      4456
    ],
    'Street' => {
      'Number' => 123,
      'Name' => [
        'Shady',
        'Lane'
      ]
    }
  };

Hey, look at that. Data::Dumper let me know that $place2 is a shallow copy of $place, while $place3 is an intermediately copied value. Notice the elements of $place inside $place3. And since $place4 contains no previously seen references, we know it's a completely independent deep copy. Success! (The ordering of the hash elements is inconsistent, but that's immaterial and undetectable in normal use.)

Now, this simple deep_copy routine will break if there are recursive data pointers (references that point to already seen data higher in the tree). For that, you might look at the dclone method of the Storable module, found in the CPAN.

So, when you use = with a structure, be sure you know what you are doing. You may need a deep copy instead of that shallow copy. For further information, check out your online documentation with perldoc perldsc and perldoc perllol and even perldoc perlref and perldoc perlreftut for the basics. Until next time, enjoy!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Unix Review Column 30 (Feb 2000)