Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Linux Magazine Column 48 (May 2003)

A Perl program alters the outside world in some manner. Otherwise, there'd be no point in running the program. But sometimes, our Perl programs need a little memory to do their job: something that persists from one invocation to the next. How do we keep such values around?

If the value is simple enough, we can just write it out to a file:

  my $MEMORY = "memory-file";

  ... at beginning of program ...
  open M, "<$MEMORY" or die "Cannot open $MEMORY for reading: $!";
  { local $/; $value = <M> }
  close M;

  ... at end of program ...
  open M, ">$MEMORY" or die "Cannot open $MEMORY for writing: $!";
  print M $value;
  close M;

But there are a few problems with this technique.

First, the value has to be a simple scalar. That means it's not very interesting, and you really don't want to try to scale this up to multiple values by using separate files.

Second, because we're writing the string version of the value to a file, we'll run into slight problems storing floating point numbers accurately, because internal floating point numbers do not correlate with decimal strings on a precisely one-to-one basis.

Third, this technique breaks down if there are multiple instances of the program using the data. For example, two programs might both read the value, then update it, then write out their respective new values back to the file. The last one wins, and no trace is seen of the other invocation. And, for a small moment between opening the file for writing, and closing the filehandle, there's no data in the file (or maybe a partial chunk of the data). Thus, any other instance of the program will get the wrong results.

Just to keep it simple, let's ignore the multiple reader/writer problem for the moment. What can we do to save and restore complex data?

The oldest method of storing complex Perl data that is still in use today is the Data::Dumper module, which has enough power to convert a nearly-arbitary data structure into the code that it would take to recreate the data. The usage is rather simple:

  my $MEMORY = "memory-file";

  ... assign values to $complex_variable ...

  use Data::Dumper;
  open M, ">$MEMORY" or die;
  print M Data::Dumper->Dump([$complex_variable], ['$complex_variable']);
  close M;

Then later to restore the value, it's simply:

  do $MEMORY;

which will recreate the value of $complex_variable into a package variable with the same name. We have to use a package variable because a lexical variable defined in the $MEMORY file would not persist beyond the do operation.

The data within the $MEMORY file is reasonably human readable. In fact, one of the most common uses of Data::Dumper is to dump the data for people to interpret the values.

While Data::Dumper is powerful and can reconstruct nearly any complex data structure, it is also limited by its design.

First, because complete Perl code is used to recreate the data, we must invoke a full Perl parser on the file to restore it. This can be slow in some cases, as well as a security risk. (I addressed the security aspect of this problem in the Oct 2001 column.)

Second, because a dump is Perl code, it cannot be cross-platform, sharing data with Python, Ruby, Java, and C, for example.

Third, a dump is not really conducive to human editing for, say, a configuration file, any more than editing any other Perl code would be.

To address these issues, Brian Ingerson (of Inline fame) created ``Yet Another Markup Language'', also known as YAML. A YAML file contains a serialization of complex data, similar to the Data::Dumper file. However, instead of using Perl syntax constructs to delimit the data and define its structure, YAML uses a simple-to-parse set of punctuation and indentation to define the data. This structure is rich enough to represent arrays, hashes, and even blessed objects.

Because the YAML markup is not Perl code, there's no potential security problem with the execution of arbitrary code. Also, YAML handlers have already been written for Python, Ruby, Java, and C, as well as Perl. A complex data structure can be constructed in Ruby, and then YAML-stored and read into a Perl program, modified, and then written back out for a Python program to read!

The simplest interface to YAML from the Perl world is the YAML module. We can drop it in in place of the Data::Dumper version quite simply:

  my $MEMORY = "memory-file";

  ... assign values to $complex_variable ...

  use YAML qw(Dump);
  open M, ">$MEMORY" or die;
  print M Dump($complex_variable);
  close M;

But it's a bit simpler to use the DumpFile interface:

  use YAML qw(DumpFile);

  ... assign values to $complex_variable ...

  DumpFile($MEMORY, $complex_variable);

And restoring it is nearly as easy:

  use YAML qw(LoadFile);

  my ($complex_variable) = LoadFile($MEMORY);

The resulting YAML file is human readable and somewhat human editable. It's actually a quite nice design.

One disadvantage of YAML is that YAML is not included in the Perl core distribution. You must install YAML from the CPAN.

The YAML interface still has the disadvantage that floating point numbers are not precisely represented. And some overhead is added to a complex data structure so that the structure can be recognized by humans. To solve these two issues, let's look at Yet Another Way to store complex data: the Storable module.

Like Data::Dumper, Storable is included as part of the core Perl installation for recent versions of Perl. (Older versions of Perl should either be upgraded, or you can install Storable from the CPAN.)

But unlike Data::Dumper and YAML, the Storable interface produces a serialization that is not intended to be read by humans. Instead, it is a compact byte stream that accurately records scalars (both strings and numbers), complex data structures, and blessed objects. The usage is similar to YAML:

  use Storable qw(store retrieve);

  ... change $complex_variable ...

  store $complex_variable, $MEMORY;

  ... later ...

  $complex_variable = retrieve $MEMORY;

Note that the data is written with some sensitivity to the endianness of the processor architecture running the code. At a slight speed penalty, you can replace store with nstore, which writes the data in an architecture independent manner. And retrieve is smart enough to recognize that this has happened.

So far, we've been storing an entire complex value in the entire data file. This requires that any changes made to the data must rewrite the entire file. There are times when we know that we will be predominantly accessing and updating only a portion of the data. For these cases, we can segment the data using the DBM mechanism.

The DBM mechanism essentially puts a hash out on disk. In its simplest form (which has worked all the way back to versions of Perl that existed long before the web), we can associate a hash with a diskfile using dbmopen:

  dbmopen %db, $MEMORY, 0644 or die "Cannot tie $MEMORY: $!";

From this point until the end of the program, any access to the %db hash is mapped into DBM calls against the disk file. Because it's a hash, we can update arbitary keys with new values, remove key/value pairs, and even iterate over the entire hash.

The DBM hash is implemented using the tie mechanism, using one of the many DBM modules, as listed in the AnyDBM_File manpage. The dbmopen finds a suitable DBM implementation, then ties the variable to the appropriate interface. Depending on the DBM implementation, you may end up with a file with a name that ends in .db, or a pair of files that end in .dir and .pag. Also, some implementations have a limited key or limited key/value pair size, as small as 1024 bytes.

For example, if SDBM is the most sophisticated DBM you have installed for Perl, the dbmopen call above is automatically translated into:

  use Fcntl;
  use SDBM_File;
  tie %db, SDBM_File, $MEMORY, O_RDWR|O_CREAT, 0644 or die "...";

But unless you need to tweak some of these settings, the dbmopen call is often easier to type.

Using a DBM allows us to store a hash on to the disk. But the hash must have simple scalars for the keys and values, and cannot contain further complex structures.

But if the values of a hash can be arbitrary scalars, couldn't we take a complex data structure, serialize it, and then store that value as an element of the hash stored within a DBM? Certainly. But before you scurry off to write the code, you might want to look at the MLDBM module. This module does precisely that, and is included with all modern versions of Perl.

We use the tie interface again, replacing the SDBM_File module with MLDBM:

  use Fcntl;
  use MLDBM;
  tie %db, MLDBM, $MEMORY, O_RDWR|O_CREAT, 0644 or die "...";

Every time an assigment is made to an element of %db, the scalar is serialized using Data::Dumper (by default). If it's a simple scalar, then nothing much happens. But if it's a reference to a complex data structure, we get a single string that recreates that data structure, and that string is stored into the DBM.

When a value is pulled from an element of %db, the process is reversed: the value is eval'ed, resulting in a complex data structure in memory.

Because Data::Dumper is a bit slow, a bit dangerous, and a bit noisy, a better choice for the serializer is Storable. We can get that by replacing the use line above with;

  use MLDBM qw(Storable);

and now we use Storable and not Data::Dumper. We can also pick a specific DBM module with:

  use MLDBM qw(DB_File Storable);

It's important to note with MLDBM that the DBM is updated only when an element of the hash is written. If you assign a complex data structure as an element of the hash, and then later update a part of that complex data structure, the value is not reflected in the DBM! The MLDBM manpage gives some examples about what to do and not do to make this work the best.

Well, hopefully that'll get you started on your persistent data quest. If these means are not sophisticated enough for you, be sure to check the CPAN for cool tools like Tie::DBI (linking your tied hash to a DBI table), SPOPS and Tangram (mapping complex objects directly to a collection of tables), Attribute::Persistence (for a dirt-easy simple persistent interface), and Inline::Files (for a novel rewrite-your-program persistent storage). Until next time, enjoy!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Linux Magazine Column 48 (May 2003)