Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Perl Journal magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Perl Journal Column 10 (Mar 2004)

[Suggested title: ``A Better Data::Dumper'']

A few years ago, I stared quite heavily at the source to the core module Data::Dumper, enough to make my eyes hurt. I was trying to ``reverse engineer'' the output, so that I could write an undumper that would evaluate the resulting string of Perl code and get the original values back, without unleashing the full Perl expression evaluator. I succeeded in that, although the results were far too slow to be useful in a practical sense.

However, while I was staring at Data::Dumper's guts, I noticed that there seemed to be no provision for noticing that a scalar reference was a reference to a scalar that existed as the value of another array or hash element, and thus dumped those values incorrectly. For example:

    use Data::Dumper;
    $Data::Dumper::Purity = 1; # try your hardest
    my @values = qw(zero one two three);
    my $ref_to_element = \$values[1];
    my $all = [$ref_to_element, \@values];
    print Dumper($all);

which results in:

    $VAR1 = [
              \'one',
              [
                'zero',
                ${$VAR1->[0]},
                'two',
                'three'
              ]
            ];

The problem is that $VAR1->[0] is a reference to one copy of 'one', while $VAR->[1]->[1] is a different copy of 'one', so changing one won't change the other. The link between the two elements has been severed.

I immediately reported the bug to the Perl developers, but this three-year-old bug has not yet been fixed. Rather than simply raise the issue again, I decided it was time to whip out the coding palette, and provide some reference code that can do references correctly, especially since the problem also seems to exist in the YAML library and the Perl debugger's x function as well. (Only Storable seemed to do the right thing: good for them.)

And I'll have to say it was a fun exercise, which I bring to you as [listing one, below]. Because the listing is rather long, I'll focus on some of the key points rather than my usual rambling style.

The goal is simple: write an uneval routine, such that the sequence of:

    use Data::Stringer qw(uneval);
    my $string = uneval(@some_list);
    my @new_list = eval $string;

results in @new_list being a deep copy of @some_list, even if the list contains scalars, references to arrays, references to hashes, and blessed references of those. And of course, references to the thing must not result in the thing being copied, but being referenced instead. For example, the above data gets dumped as:

    use Data::Stringer;
    my @values = qw(zero one two three);
    my $ref_to_element = \$values[1];
    my $all = [$ref_to_element, \@values];
    print uneval $all;

which results in the string of:

    my (@X806f84, @X810114, @X8133a4);
    @X806f84 = ('zero', 'one', 'two', 'three');
    @X810114 = (00, \@X806f84);
    @X8133a4 = (\@X810114);
    $X810114[0] = \$X806f84[1];
    @X8133a4;

Although this string isn't quite as pretty as the Data::Dumper version, it's more accurate. Notice the next-to-last line, which forces the first element of the result array to be a reference to the second element of the nested array. That's the crucial piece missing in the Data::Dumper version.

The dumping strategy is rather simple-minded, and broken into two main passes. On the first pass, we walk the supplied list of values, recursively, creating a symbol table %stab, declared in line 17. This is accomplished with a queue of values to be processed in line 31. The %stab hash will end up being populated with three kinds of entries. Scalars have a key of $X followed by a hex address of the actual symbol table address (as returned by stringifying a reference to that item). Similarly, arrays and hashes have @X and %X followed by the hex address, respectively.

Both arrays and hashes hold the reference to the value as the value in the %stab hash. The scalars are a bit different: their value is a one or three element arrayref. The first element is a reference to the original scalar value. The second and third elements are populated when we find a scalar with that address as a value within an array or hash that we're scanning. The second element is a name like the keys of %stab (and should map to an entry when pass 1 is complete), and the third element is the array index or hash key. This is the missing piece of Data::Dumper and friends: the record of where a scalar might live if not as a separate symbol table location.

The recursion comes about from the core of pass_1_item, defined in lines 35 to 68. Each item to be dumped is a reference to a scalar, array, or hash. Line 38 constructs the appropriate %stab key using the ref_to_label routine. This routine is defined down in lines 202 to 218, and uses overload::StrVal to ensure that we can extract an unoverloaded string value for the reference even if the class has a stringification overload method. $id is the hex address, usually beginning in 0x. Line 208 converts this string into a suitable identifier component. Lines 209 to 217 sort out the core type (not considering whether or not the reference is blessed), and return back a variable name of the appropriate type to hold the value.

Back up in pass_1_item, we check this string again (line 41) to see its native type. If it's a scalar, line 42 stores the value (possibly autovivifying an array ref: thank you Perl!), and move on. If the scalar value is also a reference, then we need to dump the referenced scalar, so the reference is pushed onto the working @queue (line 43).

For an array or hash, things get a bit more complicated, because we must keep track of any elements in case they are referenced from somewhere else. The code is similar. First, store the reference into %stab (lines 45 or 54), then walk through the values (beginning in lines 46 and 55). For each element, we take its address and create a %stab entry, noting the containing data structure and key or index used to access the value (lines 49 and 58). And, if the element is a reference to somewhere, we also add it to the work queue (lines 50 and 59).

Speaking of the work queue, we have to allow for the possibility of mutually recursive and self-recursive data structures:

  my @one = qw(won one);
  my @two = qw(two too to);
  push @one, \@two;
  push @two, \@one;
  my $string = uneval(@one);

As we're scanning @one, we'll need to follow the reference to @two at the end. But when we get to the end of @two, we don't want to scan over @one again. Line 39 handles the duplicate scanning rejection, by simply refusing to scan any particular scalar, array, or hash more than once.

The first item dumped is the input parameters. Because the input parameters need to be dumped as the output, we retain the %stab key in line 30 being returned from the first invocation of pass_1_item at line 66. This particular array name will be the designated output array as well.

Once pass 1 is complete, every scalar, array, and hash that belongs to the dumped set has been identified and copied to the virtual symbol table. To dump the data, we merely need to walk this virtual symbol table. The pass_2 routine (lines 74 to 85) manages the process. The steps can be seen as: declaring the variables, initializing those variables (except for deferred entries), handling the deferred items, blessing any blessed references, and then evaluating the designated top-level array as a result.

First, the declarations are dumped, using pass_2_declarations defined in lines 118 to 126. A single my construct encloses all scalar, array, and hash names, except for those scalars that exist as elements of another array or hash.

Then the bulk of the work comes out of the initialization phase, starting with pass_2_initializations defined in lines 128 to 133. Key-value pairs from %stab are passed in to pass_2_initialization, defined starting in line 142. If it's a scalar (line 146), it's a simple assignment, unless the value was an element of a larger data structure, in which case it simply disappears.

The value for any scalar (variable or element) might be a reference to an element of an array or hash, however, and this is where pass_2_value comes in to help. Looking back to the definition (starting in line 87), we see that references to scalars are handled specially. If the reference is to a scalar is an element of an array or hash already seen, then line 95 will have a 3-element list, setting $place and $index to the actual scalar's location. In that case, we can't provide a scalar value for this initialization. Instead, we add a @deferred element which does the initialization after all other initializations are complete, and return a 00 value instead. This double-0 value is just a 0, but gives an indication to me staring at the output that this value will be replaced during the deferred stage, just as we saw in the example earlier.

Array and hash initialization works similarly in lines 152 through 165, except that we have to keep track of which element we are looking at, in case the deferred initialization needs to reference an element of a larger structure (as the example earlier did).

Once the core initializations are complete, we go back to dump out the deferred initializations (if any). This patches up all the values that were dumped as ``00'' during the initialization sub-pass, to point at the elements of arrays and elements of hashes as needed.

Then, it's time for a blessing or two, perhaps. Lines 135 to 140 call pass_2_blessing for each %stab entry, defined in lines 173 to 188. If it's a scalar, we need to get the actual element out of the array ref, noting it's location for the proper blessing if it's also an array or hash element.

Lines 182 to 187 determine the proper blessed class, getting around any issues with an overloaded stringification once again. And if the value is blessed, the proper blessing is generated in line 184.

The only thing left to describe is quote_scalar which generates a nice printable representation of a scalar. The undef value is a simple undef return. Otherwise, if the value is safe as a number, the number form is preferred. Otherwise, a single-quoted string is conjured up. I seem to recall that there are numbers that do not stringify well, but I couldn't figure out how to construct one in time for this article deadline. But the die check at the end is the protection in that case anyway.

So, there it is. A better Data::Dumper that handles references to arrays and references to hashes. Of course, the real Data::Dumper has a lot more bells and whistles, so I hope that the authors of data-dumping routines will use this code as a model, rather than hoping that I will replace their code eventually. Until next time, enjoy!

Listings

        =1=     package Data::Stringer;
        =2=     
        =3=     use 5.006;
        =4=     use strict;
        =5=     use warnings;
        =6=     
        =7=     require Exporter;
        =8=     
        =9=     our @ISA = qw(Exporter);
        =10=    
        =11=    our @EXPORT = qw(uneval);
        =12=    
        =13=    our $VERSION = '0.01';
        =14=    
        =15=    require overload;
        =16=    
        =17=    my %stab;
        =18=    
        =19=    ## $stab{'@x0x123456'} = \@thevalue
        =20=    ## $stab{'%x0x123456'} = \%thevalue
        =21=    ## $stab{'$x0x123456'} = [\$thevalue]
        =22=    ## $stab{'$x0x123456'} = [\$thevalue, $aggregate, $index] # for elements
        =23=    
        =24=    BEGIN {
        =25=    
        =26=      my @queue;
        =27=    
        =28=      sub uneval {
        =29=        %stab = @queue = ();
        =30=        my $label = pass_1_item(\@_);       # prime the pump
        =31=        pass_1_item(shift @queue) while @queue; # drain the pump
        =32=        return pass_2($label);                # dump the result
        =33=      }
        =34=    
        =35=      sub pass_1_item {
        =36=        my $ref = shift;
        =37=    
        =38=        my $label = ref_to_label($ref);
        =39=        return $label if $stab{$label}; # already seen
        =40=    
        =41=        if ($label =~ /^\$/) {      # scalar
        =42=          $stab{$label}[0] = $ref;
        =43=          push @queue, $$ref if ref $$ref;
        =44=        } elsif ($label =~ /^\@/) { # array
        =45=          $stab{$label} = $ref;
        =46=          for my $index (0..$#$ref) {
        =47=            for ($ref->[$index]) {  # carefully creating alias, not copy
        =48=              my $thislabel = ref_to_label(\$_);
        =49=              $stab{$thislabel} = [\$_, $label, $index];
        =50=              push @queue, $_ if ref $_;
        =51=            }
        =52=          }
        =53=        } elsif ($label =~ /^%/) {  # hash
        =54=          $stab{$label} = $ref;
        =55=          for my $key (keys %$ref) {
        =56=            for ($ref->{$key}) {    # carefully creating alias, not copy
        =57=              my $thislabel = ref_to_label(\$_);
        =58=              $stab{$thislabel} = [\$_, $label, $key];
        =59=              push @queue, $_ if ref $_;
        =60=            }
        =61=          }
        =62=        } else {
        =63=          die "Cannot process $label yet";
        =64=        }
        =65=    
        =66=        return $label;
        =67=      }
        =68=    }
        =69=    
        =70=    BEGIN {
        =71=    
        =72=      my @deferred;
        =73=    
        =74=      sub pass_2 {
        =75=        my $result_label = shift;
        =76=    
        =77=        @deferred = ();
        =78=        return join("",
        =79=                    pass_2_declarations(),
        =80=                    pass_2_initializations(),
        =81=                    map("$_\n", @deferred),
        =82=                    pass_2_blessings(),
        =83=                    "$result_label;\n",
        =84=                   );
        =85=      }
        =86=    
        =87=      sub pass_2_value {
        =88=        my $value = shift;
        =89=        my $set_place = shift;
        =90=        my $set_index = shift;
        =91=    
        =92=        if (ref $value) {
        =93=          my $label = ref_to_label($value);
        =94=          if ($label =~ /^\$/) { # it is a scalar, so it might be an element
        =95=            (my ($value, $place, $index) = @{$stab{$label}}) >= 1 or die;
        =96=            if ($place) {
        =97=              if ($place =~ /^[@%]/) {
        =98=                push(@deferred,
        =99=                     element_of($set_place, $set_index) . " = \\" .
        =100=                    element_of($place, $index) . ";");
        =101=               return "00";        # placeholder for a deferred action
        =102=             } else {
        =103=               die "dunno place $place";
        =104=             }
        =105=           } else {
        =106=             return "\\$label";    # no place in particular
        =107=           }
        =108=         } else {
        =109=           return "\\$label";
        =110=         }
        =111=       } else {
        =112=         return quote_scalar($value);
        =113=       }
        =114=     }
        =115=   
        =116=   }
        =117=   
        =118=   sub pass_2_declarations {
        =119=     return join("",
        =120=                 "my (",
        =121=                 join(", ",
        =122=                      grep {
        =123=                        /^[\@%]/ or /^\$/ and not $stab{$_}[1]
        =124=                      } keys %stab),
        =125=                 ");\n");
        =126=   }
        =127=   
        =128=   sub pass_2_initializations {
        =129=     return join("",
        =130=                 map(pass_2_initialization($_, $stab{$_}),
        =131=                     sort keys %stab),
        =132=                );
        =133=   }
        =134=   
        =135=   sub pass_2_blessings {
        =136=     return join("",
        =137=                 map(pass_2_blessing($_, $stab{$_}),
        =138=                     sort keys %stab),
        =139=                );
        =140=   }
        =141=   
        =142=   sub pass_2_initialization {
        =143=     my $label = shift;
        =144=     my $value = shift;
        =145=   
        =146=     if ($label =~ /^\$/) {        # scalar
        =147=       if (@$value > 1) {          # it's an element:
        =148=         return "";
        =149=       } else {
        =150=         return "$label = ".pass_2_value(${$value->[0]}).";\n";
        =151=       }
        =152=     } elsif ($label =~ /^\@/) {   # array
        =153=       return "$label = (".join(", ",
        =154=                                map {
        =155=                                  pass_2_value($value->[$_], $label, $_);
        =156=                                } 0..$#$value,
        =157=                               ).");\n";
        =158=     } elsif ($label =~ /^%/) {    # hash
        =159=       return "$label = (".join(", ",
        =160=                                map {
        =161=                                  pass_2_value($_) .
        =162=                                    " => " .
        =163=                                      pass_2_value($value->{$_}, $label, $_);
        =164=                                } keys %$value,
        =165=                               ).");\n";
        =166=     } else {
        =167=       die "Cannot process $label yet";
        =168=     }
        =169=   
        =170=   }
        =171=   
        =172=   
        =173=   sub pass_2_blessing {
        =174=     my $label = shift;
        =175=     my $value = shift;
        =176=   
        =177=     ## get to the proper location of an element for scalars
        =178=     if ($label =~ /^\$/) {
        =179=       $label = element_of($value->[1], $value->[2]) if @$value > 1;
        =180=       $value = $value->[0];
        =181=     }
        =182=     my ($package) = overload::StrVal($value) =~ /^(.*)=/;
        =183=     if (defined $package) {       # it's blessed
        =184=       return "bless \\$label, ".quote_scalar($package), ";\n";
        =185=     } else {
        =186=       return "";
        =187=     }
        =188=   }
        =189=   
        =190=   sub element_of {
        =191=     my $label = shift;
        =192=     my $index = shift;
        =193=     if ($label =~ s/^\@/\$/) {
        =194=       return "$label\[".quote_scalar($index)."\]";
        =195=     } elsif ($label =~ s/^%/\$/) {
        =196=       return "$label\{".quote_scalar($index)."\}";
        =197=     } else {
        =198=       die "Cannot take element_of($label, $index)";
        =199=     }
        =200=   }
        =201=   
        =202=   sub ref_to_label {
        =203=     my $ref = shift;
        =204=   
        =205=     ## eventually do something with $realpack
        =206=     my ($realpack, $realtype, $id) =
        =207=       (overload::StrVal($ref) =~ /^(?:(.*)\=)?([^=]*)\(([^\(]*)\)$/) or die;
        =208=     s/^0x/X/ or s/^/X/ for $id;
        =209=     if ($realtype eq "SCALAR" or $realtype eq "REF") {
        =210=       return "\$$id";
        =211=     } elsif ($realtype eq "ARRAY") {
        =212=       return "\@$id";
        =213=     } elsif ($realtype eq "HASH") {
        =214=       return "%$id";
        =215=     } else {
        =216=       die "dunno $ref => $realpack $realtype $id";
        =217=     }
        =218=   }
        =219=   
        =220=   sub quote_scalar {
        =221=     local $_ = shift;
        =222=     if (!defined($_)) {
        =223=       return "undef";
        =224=     }
        =225=     {
        =226=       no warnings;
        =227=       if ($_ + 0 eq $_) {         # safe as a number...
        =228=         return $_;
        =229=       }
        =230=       if ("$_" == $_) {           # safe as a string...
        =231=         s/([\\\'])/\\$1/g;
        =232=         return '\'' . $_ .  '\'';
        =233=       }
        =234=     }
        =235=     die "$_ is not safe as either a number or a string";
        =236=   }
        =237=   
        =238=   1;

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.