Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Unix Review Column 30 (Feb 2000)
[suggested title: Deep copying, not Deep secrets]
One of the modules I encountered in the CPAN had a problem with creating multiple objects. Each object got created fine, but on further use simultaneously in the program, some of the data from one object mysteriously shows up in the second one, even if the first one is freed!
Upon inspection, I found that the object was being initialized from a shallow copy of a template, and I told the author that he needed to deep copy the template instead. He was puzzled by the suggestion, and if you aren't familiar with these two terms, I bet you are a little confused now as well.
What is deep copying and why do we need it? Let's start with a simple example, and work back to the problem I posed a moment ago.
For example, let's grab all the password information into a simple
hash, with the key being the username, and the value being an array
ref pointing to the nine element value returned by getpwent()
. On
a first cut, we quickly hack out something like this:
while (@x = getpwent()) { $info{$x[0]} = \@x; } for (sort keys %info) { print "$_ => @{$info{$_}}\n" }
What? Where did all the data go? We stored a reference to the data into the hash value. Well, maybe this will make it clearer:
while (@x = getpwent()) { $info{$x[0]} = \@x; print "$info{$x[0]}\n"; }
On my machine, this printed ARRAY(0x80ce7fc)
dozens of times, once
for every user on the system. So what does that mean? It means that
we are reusing the same array repeatedly, and therefore we don't have
dozens of arrayrefs pointing to distinct arrays, we have a single
array with many pointers to the same data. So on the last pass
through the gathering loop, @x
is emptied, and therefore all the
array references are pointing to the identical empty data.
This is because we made a shallow copy of the reference to @x
into the hash value, which is a copy of only the top level pointer,
but not the contents. What we really needed was not only a copy of
the reference, but a copy of to what the reference pointed. That's
simple enough here:
while (@x = getpwent()) { $info{$x[0]} = [@x]; } for (sort keys %info) { print "$_ => $info{$_} => @{$info{$_}}\n" }
And now notice, we've got a distinct arrayref for each hash element,
pointing to an independent copy of the 9 elements of the array
originally contained in @x
. This worked because we created a new
anonymous arrayref with the expression [@x]
, which also gives this
anonymous array an initial value made of copies of the elements of
@x
.
So that's a basic deep copy: copying not only the top level pointer, but also all the things within the data structure to maintain complete independence.
Actually there was one other way to ensure unique subelements in this example, and I'll show it for completeness lest my Perl hacking friends get irritated. You don't need to copy anything if you just generate the data in a distinct array in the first place:
while (my @x = getpwent()) { $info{$x[0]} = \@x; } for (sort keys %info) { print "$_ => $info{$_} => @{$info{$_}}\n" }
Here, each pass through the loop starts with a brand-new completely
distinct lexical @x
rather than reusing the old existing
variable. So when a reference is taken to it, and it falls out of
scope at the bottom of the loop, the variable automatically remains
behind as an anonmous variable.
But let's get back to deep copying. Here's another example. Let's suppose Fred and Barney are sharing a house:
$place = { Street => "123 Shady Lane", City => "Granite Junction", }; $fred = { First_name => "Fred", Last_name => "Flintstone", Address => $place, }; $barney = { First_name => "Barney", Last_name => "Rubble", Address => $place, };
Now note that $fred->{Address}{City}
is ``Granite Junction'' just
as we might expect it, as is $barney->{Address}{City}
. But
we've done a shallow copy from $place
into both of the Address
element values. This means that there's not two copies of the data,
but just one. We can see this when we change one of the values.
Let's let Fred move to his own place:
$fred->{Address}{Street} = "456 Palm Drive"; $fred->{Address}{City} = "Bedrock";
Looks safe enough. But what happened to Barney? He moved along with Fred!
print "@{$barney->{Address}}{qw(Street City)}\n";
This prints Fred's new address! Why did that happen? Once again, the
assignment of $place
as the address in both cases made a shallow
copy: both data structures shared a common pointer to the common
street and city data. Again, a deep copy would have helped:
$place = { Street => "123 Shady Lane", City => "Granite Junction", }; $fred = { First_name => "Fred", Last_name => "Flintstone", Address => {%$place}, }; $barney = { First_name => "Barney", Last_name => "Rubble", Address => {%$place}, }; $fred->{Address}{Street} = "456 Palm Drive"; $fred->{Address}{City} = "Bedrock"; print "@{$barney->{Address}}{qw(Street City)}\n";
There... now each Address
field is a completely disconnected copy,
so when we update one, the other stays pure. This works because just
like the [@x]
construct, we are creating a new independent
anonymous hash and taking a reference to it.
But what if $place
was itself a deeper structure? That is, suppose
the street address was made up of a number and a name:
$place = { Street => { Number => 123, Name => "Shady Lane", }, City => "Granite Junction", }; $fred = { First_name => "Fred", Last_name => "Flintstone", Address => {%$place}, }; $barney = { First_name => "Barney", Last_name => "Rubble", Address => {%$place}, };
We've now done something that's not quite a deep copy, but also not a
shallow copy. Certainly, the hash at $fred->{Address}
is
different from $barney->{Address}
. But they both contain a
value that is identical to the $place->{Street}
hashref! So
if we move Fred just down the street:
$fred->{Address}{Street}{Number} = 456;
then Barney moves along with him again! Now, we could fix this problem by applying the logic for copying the address one more time to the street structure:
$fred = { First_name => "Fred", Last_name => "Flintstone", Address => { Street => {%{$place->{Street}}}, City => $place->{City}, }, };
But as you can see, it's getting more and more convoluted. And what
if we change City
to be another structure, or added another level
to Street
. Bleh.
Fortunately, we can write a simple general-purpose deep copier with a recursive subroutine. Here's a simple little deep copy routine:
sub deep_copy { my $this = shift; if (not ref $this) { $this; } elsif (ref $this eq "ARRAY") { [map deep_copy($_), @$this]; } elsif (ref $this eq "HASH") { +{map { $_ => deep_copy($this->{$_}) } keys %$this}; } else { die "what type is $_?" } }
This subroutine expects a single item: the top of a tree of hashrefs
and listrefs and scalars. If the item is a scalar, it is simply
returned, since a shallow copy of a scalar is also a deep copy. If
it's an arrayref, we create a new anonymous array from the data.
However, each element of this array could itself be a data structure,
so we need a deep copy of it. The solution is straightforward: simply
call deep_copy
on each item. Similarly, a new hashref is
constructed by copying each element, including a deep copy of its
value. (The hash key is always a simple scalar, so it needs no copy,
although that would have been easy enough to add.) To see it work,
let's give some data:
$place = { Street => { Number => 123, Name => [qw(Shady Lane)], }, City => "Granite Junction", Zip => [97007, 4456], }; $place2 = $place; $place3 = {%$place}; $place4 = deep_copy($place);
Hmm. How do we see what we've done, and what's being shared? Let's
add a call to the standard library module, Data::Dumper
:
use Data::Dumper; $Data::Dumper::Indent = 1; print Data::Dumper->Dump( [$place, $place2, $place3, $place4], [qw(place place2 place3 place4)] );
And that generates on my system:
$place = { 'City' => 'Granite Junction', 'Zip' => [ 97007, 4456 ], 'Street' => { 'Name' => [ 'Shady', 'Lane' ], 'Number' => 123 } }; $place2 = $place; $place3 = { 'City' => 'Granite Junction', 'Zip' => $place->{'Zip'}, 'Street' => $place->{'Street'} }; $place4 = { 'City' => 'Granite Junction', 'Zip' => [ 97007, 4456 ], 'Street' => { 'Number' => 123, 'Name' => [ 'Shady', 'Lane' ] } };
Hey, look at that. Data::Dumper
let me know that $place2
is a
shallow copy of $place
, while $place3
is an intermediately
copied value. Notice the elements of $place
inside $place3
.
And since $place4
contains no previously seen references, we know
it's a completely independent deep copy. Success! (The ordering of
the hash elements is inconsistent, but that's immaterial and
undetectable in normal use.)
Now, this simple deep_copy
routine will break if there are
recursive data pointers (references that point to already seen data
higher in the tree). For that, you might look at the dclone
method
of the Storable
module, found in the CPAN.
So, when you use =
with a structure, be sure you know what you are
doing. You may need a deep copy instead of that shallow copy. For
further information, check out your online documentation with
perldoc perldsc
and perldoc perllol
and even perldoc perlref
and perldoc perlreftut
for the basics. Until next time, enjoy!