Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Perl Journal magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Download this listing! | ||
Perl Journal Column 10 (Mar 2004)
[Suggested title: ``A Better Data::Dumper'']
A few years ago, I stared quite heavily at the source to the core
module Data::Dumper
, enough to make my eyes hurt. I was trying to
``reverse engineer'' the output, so that I could write an undumper that
would evaluate the resulting string of Perl code and get the original
values back, without unleashing the full Perl expression evaluator. I
succeeded in that, although the results were far too slow to be useful
in a practical sense.
However, while I was staring at Data::Dumper
's guts, I noticed that
there seemed to be no provision for noticing that a scalar reference
was a reference to a scalar that existed as the value of another array
or hash element, and thus dumped those values incorrectly. For
example:
use Data::Dumper; $Data::Dumper::Purity = 1; # try your hardest my @values = qw(zero one two three); my $ref_to_element = \$values[1]; my $all = [$ref_to_element, \@values]; print Dumper($all);
which results in:
$VAR1 = [ \'one', [ 'zero', ${$VAR1->[0]}, 'two', 'three' ] ];
The problem is that $VAR1->[0]
is a reference to one copy
of 'one'
, while $VAR->[1]->[1]
is a different copy
of 'one'
, so changing one won't change the other. The link
between the two elements has been severed.
I immediately reported the bug to the Perl developers, but this
three-year-old bug has not yet been fixed. Rather than simply raise
the issue again, I decided it was time to whip out the coding palette,
and provide some reference code that can do references correctly,
especially since the problem also seems to exist in the YAML
library and the Perl debugger's x
function as well. (Only
Storable
seemed to do the right thing: good for them.)
And I'll have to say it was a fun exercise, which I bring to you as [listing one, below]. Because the listing is rather long, I'll focus on some of the key points rather than my usual rambling style.
The goal is simple: write an uneval
routine, such that the sequence
of:
use Data::Stringer qw(uneval); my $string = uneval(@some_list); my @new_list = eval $string;
results in @new_list
being a deep copy of @some_list
, even if
the list contains scalars, references to arrays, references to hashes,
and blessed references of those. And of course, references to the
thing must not result in the thing being copied, but being referenced
instead. For example, the above data gets dumped as:
use Data::Stringer; my @values = qw(zero one two three); my $ref_to_element = \$values[1]; my $all = [$ref_to_element, \@values]; print uneval $all;
which results in the string of:
my (@X806f84, @X810114, @X8133a4); @X806f84 = ('zero', 'one', 'two', 'three'); @X810114 = (00, \@X806f84); @X8133a4 = (\@X810114); $X810114[0] = \$X806f84[1]; @X8133a4;
Although this string isn't quite as pretty as the Data::Dumper
version, it's more accurate. Notice the next-to-last line, which
forces the first element of the result array to be a reference to the
second element of the nested array. That's the crucial piece missing
in the Data::Dumper
version.
The dumping strategy is rather simple-minded, and broken into two main
passes. On the first pass, we walk the supplied list of values,
recursively, creating a symbol table %stab
, declared in line 17.
This is accomplished with a queue of values to be processed in line
31. The %stab
hash will end up being populated with three kinds of
entries. Scalars have a key of $X
followed by a hex address of the
actual symbol table address (as returned by stringifying a reference
to that item). Similarly, arrays and hashes have @X
and %X
followed by the hex address, respectively.
Both arrays and hashes hold the reference to the value as the value in
the %stab
hash. The scalars are a bit different: their value is a
one or three element arrayref. The first element is a reference to
the original scalar value. The second and third elements are
populated when we find a scalar with that address as a value within
an array or hash that we're scanning. The second element is a name
like the keys of %stab
(and should map to an entry when pass 1 is
complete), and the third element is the array index or hash key. This
is the missing piece of Data::Dumper
and friends: the record of
where a scalar might live if not as a separate symbol table location.
The recursion comes about from the core of pass_1_item
, defined in
lines 35 to 68. Each item to be dumped is a reference to a scalar,
array, or hash. Line 38 constructs the appropriate %stab
key using
the ref_to_label
routine. This routine is defined down in lines
202 to 218, and uses overload::StrVal
to ensure that we can extract
an unoverloaded string value for the reference even if the class has a
stringification overload method. $id
is the hex address, usually
beginning in 0x
. Line 208 converts this string into a suitable
identifier component. Lines 209 to 217 sort out the core type (not
considering whether or not the reference is blessed), and return back
a variable name of the appropriate type to hold the value.
Back up in pass_1_item
, we check this string again (line 41) to see
its native type. If it's a scalar, line 42 stores the value (possibly
autovivifying an array ref: thank you Perl!), and move on. If the
scalar value is also a reference, then we need to dump the referenced
scalar, so the reference is pushed onto the working @queue
(line
43).
For an array or hash, things get a bit more complicated, because we
must keep track of any elements in case they are referenced from
somewhere else. The code is similar. First, store the reference into
%stab
(lines 45 or 54), then walk through the values (beginning in
lines 46 and 55). For each element, we take its address and create a
%stab
entry, noting the containing data structure and key or index
used to access the value (lines 49 and 58). And, if the element is a
reference to somewhere, we also add it to the work queue (lines 50 and
59).
Speaking of the work queue, we have to allow for the possibility of mutually recursive and self-recursive data structures:
my @one = qw(won one); my @two = qw(two too to); push @one, \@two; push @two, \@one; my $string = uneval(@one);
As we're scanning @one
, we'll need to follow the reference to
@two
at the end. But when we get to the end of @two
, we don't
want to scan over @one
again. Line 39 handles the duplicate
scanning rejection, by simply refusing to scan any particular scalar,
array, or hash more than once.
The first item dumped is the input parameters. Because the input
parameters need to be dumped as the output, we retain the %stab
key
in line 30 being returned from the first invocation of pass_1_item
at line 66. This particular array name will be the designated output
array as well.
Once pass 1 is complete, every scalar, array, and hash that belongs to
the dumped set has been identified and copied to the virtual symbol
table. To dump the data, we merely need to walk this virtual symbol
table. The pass_2
routine (lines 74 to 85) manages the process.
The steps can be seen as: declaring the variables, initializing those
variables (except for deferred entries), handling the deferred items,
blessing any blessed references, and then evaluating the designated
top-level array as a result.
First, the declarations are dumped, using pass_2_declarations
defined in lines 118 to 126. A single my
construct encloses all
scalar, array, and hash names, except for those scalars that exist as
elements of another array or hash.
Then the bulk of the work comes out of the initialization phase,
starting with pass_2_initializations
defined in lines 128 to 133.
Key-value pairs from %stab
are passed in to
pass_2_initialization
, defined starting in line 142. If it's a
scalar (line 146), it's a simple assignment, unless the value was an
element of a larger data structure, in which case it simply
disappears.
The value for any scalar (variable or element) might be a reference to
an element of an array or hash, however, and this is where
pass_2_value
comes in to help. Looking back to the definition
(starting in line 87), we see that references to scalars are handled
specially. If the reference is to a scalar is an element of an array
or hash already seen, then line 95 will have a 3-element list, setting
$place
and $index
to the actual scalar's location. In that
case, we can't provide a scalar value for this initialization.
Instead, we add a @deferred
element which does the initialization
after all other initializations are complete, and return a 00
value
instead. This double-0 value is just a 0, but gives an indication to
me staring at the output that this value will be replaced during the
deferred stage, just as we saw in the example earlier.
Array and hash initialization works similarly in lines 152 through 165, except that we have to keep track of which element we are looking at, in case the deferred initialization needs to reference an element of a larger structure (as the example earlier did).
Once the core initializations are complete, we go back to dump out the deferred initializations (if any). This patches up all the values that were dumped as ``00'' during the initialization sub-pass, to point at the elements of arrays and elements of hashes as needed.
Then, it's time for a blessing or two, perhaps. Lines 135 to 140 call
pass_2_blessing
for each %stab
entry, defined in lines 173 to
188. If it's a scalar, we need to get the actual element out of the
array ref, noting it's location for the proper blessing if it's also
an array or hash element.
Lines 182 to 187 determine the proper blessed class, getting around any issues with an overloaded stringification once again. And if the value is blessed, the proper blessing is generated in line 184.
The only thing left to describe is quote_scalar
which generates a
nice printable representation of a scalar. The undef
value is a
simple undef
return. Otherwise, if the value is safe as a number,
the number form is preferred. Otherwise, a single-quoted string is
conjured up. I seem to recall that there are numbers that do not
stringify well, but I couldn't figure out how to construct one in time
for this article deadline. But the die
check at the end is the
protection in that case anyway.
So, there it is. A better Data::Dumper
that handles references to
arrays and references to hashes. Of course, the real Data::Dumper
has a lot more bells and whistles, so I hope that the authors of
data-dumping routines will use this code as a model, rather than
hoping that I will replace their code eventually. Until next time,
enjoy!
Listings
=1= package Data::Stringer; =2= =3= use 5.006; =4= use strict; =5= use warnings; =6= =7= require Exporter; =8= =9= our @ISA = qw(Exporter); =10= =11= our @EXPORT = qw(uneval); =12= =13= our $VERSION = '0.01'; =14= =15= require overload; =16= =17= my %stab; =18= =19= ## $stab{'@x0x123456'} = \@thevalue =20= ## $stab{'%x0x123456'} = \%thevalue =21= ## $stab{'$x0x123456'} = [\$thevalue] =22= ## $stab{'$x0x123456'} = [\$thevalue, $aggregate, $index] # for elements =23= =24= BEGIN { =25= =26= my @queue; =27= =28= sub uneval { =29= %stab = @queue = (); =30= my $label = pass_1_item(\@_); # prime the pump =31= pass_1_item(shift @queue) while @queue; # drain the pump =32= return pass_2($label); # dump the result =33= } =34= =35= sub pass_1_item { =36= my $ref = shift; =37= =38= my $label = ref_to_label($ref); =39= return $label if $stab{$label}; # already seen =40= =41= if ($label =~ /^\$/) { # scalar =42= $stab{$label}[0] = $ref; =43= push @queue, $$ref if ref $$ref; =44= } elsif ($label =~ /^\@/) { # array =45= $stab{$label} = $ref; =46= for my $index (0..$#$ref) { =47= for ($ref->[$index]) { # carefully creating alias, not copy =48= my $thislabel = ref_to_label(\$_); =49= $stab{$thislabel} = [\$_, $label, $index]; =50= push @queue, $_ if ref $_; =51= } =52= } =53= } elsif ($label =~ /^%/) { # hash =54= $stab{$label} = $ref; =55= for my $key (keys %$ref) { =56= for ($ref->{$key}) { # carefully creating alias, not copy =57= my $thislabel = ref_to_label(\$_); =58= $stab{$thislabel} = [\$_, $label, $key]; =59= push @queue, $_ if ref $_; =60= } =61= } =62= } else { =63= die "Cannot process $label yet"; =64= } =65= =66= return $label; =67= } =68= } =69= =70= BEGIN { =71= =72= my @deferred; =73= =74= sub pass_2 { =75= my $result_label = shift; =76= =77= @deferred = (); =78= return join("", =79= pass_2_declarations(), =80= pass_2_initializations(), =81= map("$_\n", @deferred), =82= pass_2_blessings(), =83= "$result_label;\n", =84= ); =85= } =86= =87= sub pass_2_value { =88= my $value = shift; =89= my $set_place = shift; =90= my $set_index = shift; =91= =92= if (ref $value) { =93= my $label = ref_to_label($value); =94= if ($label =~ /^\$/) { # it is a scalar, so it might be an element =95= (my ($value, $place, $index) = @{$stab{$label}}) >= 1 or die; =96= if ($place) { =97= if ($place =~ /^[@%]/) { =98= push(@deferred, =99= element_of($set_place, $set_index) . " = \\" . =100= element_of($place, $index) . ";"); =101= return "00"; # placeholder for a deferred action =102= } else { =103= die "dunno place $place"; =104= } =105= } else { =106= return "\\$label"; # no place in particular =107= } =108= } else { =109= return "\\$label"; =110= } =111= } else { =112= return quote_scalar($value); =113= } =114= } =115= =116= } =117= =118= sub pass_2_declarations { =119= return join("", =120= "my (", =121= join(", ", =122= grep { =123= /^[\@%]/ or /^\$/ and not $stab{$_}[1] =124= } keys %stab), =125= ");\n"); =126= } =127= =128= sub pass_2_initializations { =129= return join("", =130= map(pass_2_initialization($_, $stab{$_}), =131= sort keys %stab), =132= ); =133= } =134= =135= sub pass_2_blessings { =136= return join("", =137= map(pass_2_blessing($_, $stab{$_}), =138= sort keys %stab), =139= ); =140= } =141= =142= sub pass_2_initialization { =143= my $label = shift; =144= my $value = shift; =145= =146= if ($label =~ /^\$/) { # scalar =147= if (@$value > 1) { # it's an element: =148= return ""; =149= } else { =150= return "$label = ".pass_2_value(${$value->[0]}).";\n"; =151= } =152= } elsif ($label =~ /^\@/) { # array =153= return "$label = (".join(", ", =154= map { =155= pass_2_value($value->[$_], $label, $_); =156= } 0..$#$value, =157= ).");\n"; =158= } elsif ($label =~ /^%/) { # hash =159= return "$label = (".join(", ", =160= map { =161= pass_2_value($_) . =162= " => " . =163= pass_2_value($value->{$_}, $label, $_); =164= } keys %$value, =165= ).");\n"; =166= } else { =167= die "Cannot process $label yet"; =168= } =169= =170= } =171= =172= =173= sub pass_2_blessing { =174= my $label = shift; =175= my $value = shift; =176= =177= ## get to the proper location of an element for scalars =178= if ($label =~ /^\$/) { =179= $label = element_of($value->[1], $value->[2]) if @$value > 1; =180= $value = $value->[0]; =181= } =182= my ($package) = overload::StrVal($value) =~ /^(.*)=/; =183= if (defined $package) { # it's blessed =184= return "bless \\$label, ".quote_scalar($package), ";\n"; =185= } else { =186= return ""; =187= } =188= } =189= =190= sub element_of { =191= my $label = shift; =192= my $index = shift; =193= if ($label =~ s/^\@/\$/) { =194= return "$label\[".quote_scalar($index)."\]"; =195= } elsif ($label =~ s/^%/\$/) { =196= return "$label\{".quote_scalar($index)."\}"; =197= } else { =198= die "Cannot take element_of($label, $index)"; =199= } =200= } =201= =202= sub ref_to_label { =203= my $ref = shift; =204= =205= ## eventually do something with $realpack =206= my ($realpack, $realtype, $id) = =207= (overload::StrVal($ref) =~ /^(?:(.*)\=)?([^=]*)\(([^\(]*)\)$/) or die; =208= s/^0x/X/ or s/^/X/ for $id; =209= if ($realtype eq "SCALAR" or $realtype eq "REF") { =210= return "\$$id"; =211= } elsif ($realtype eq "ARRAY") { =212= return "\@$id"; =213= } elsif ($realtype eq "HASH") { =214= return "%$id"; =215= } else { =216= die "dunno $ref => $realpack $realtype $id"; =217= } =218= } =219= =220= sub quote_scalar { =221= local $_ = shift; =222= if (!defined($_)) { =223= return "undef"; =224= } =225= { =226= no warnings; =227= if ($_ + 0 eq $_) { # safe as a number... =228= return $_; =229= } =230= if ("$_" == $_) { # safe as a string... =231= s/([\\\'])/\\$1/g; =232= return '\'' . $_ . '\''; =233= } =234= } =235= die "$_ is not safe as either a number or a string"; =236= } =237= =238= 1;