Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Linux Magazine Column 81 (May 2006)

[Suggested title: ``Packing it in'']

We dropped a few things between the second and third editions of the wildly popular Learning Perl book, simply because we wanted to make room for a few more relevant topics. As Perl matured, Perl users migrated from being primarily system administrators to authoring complete mission-critical applications, such as some of the code behind many of the websites that you visit frequent. One of the sections that didn't make the cut was the section on the confusing-but-important pack and unpack functions. Let's take a look at this small corner of Perl functionality.

The pack function's primary job is to turn Perl-managable data (numbers and strings) into a sequence of bits that might make sense to some external application. And unpack generally goes the other direction: take a bag-o-bits handed to us by some hostile real-world interface, and turn it into nice strings and numbers for further processing.

The options for pack and unpack are dizzying. In fact, as I was researching this article, I realized that I hadn't read the documentation for a few Perl releases, and it seems that they've snuck in about twice as many features as when I last looked. Too bad a pack format isn't quite Turing-complete, although I'm happy they aren't self-aware, as regular expressions seem to have become.

The easiest way to get into how pack and unpack work is to dive right into it. Take for example, packing a character:

  my $string = pack "CC", 65, 66;

If we print $string, we see an uppercase A followed by an uppercase B, presuming a nice ASCII environment, and not something odd like EBCDIC. This pack invocation works similar to sprintf: the first argument is a template, which defines how to interpret the remaining arguments, and the function returns the result.

In this case, the template consists of two C characters, each of which denote an unsigned character. For each one of these characters, we'll take the next element from the arguments (first 65, then 66), and ``pack'' them into the result. If we put a 65 value into an ASCII byte, we get a capital letter A, just as if we had said chr(65), and so that's the first byte of the result.

We can continue this:

  my $string = pack "CCCC", 65, 66, 67, 68;

but when we have the same item repeated like this, we can add a shortcut:

  my $string = pack "C4", 65, 66, 67, 68;

For many of the formatting letters, a trailing numeric value means ``repeat this many times''. (Some of the others interpret as a width, but we'll get to that in a moment.)

As with sprintf, if we ask for more values than we have, we get 0 padding. And if we don't use up everything, those extra values are simply ignored. To keep from having to count the number of elements, a special value of * means ``as many as you need'':

  my $string = pack "C*", @some_numbers;

We use unpack to go the other way, from a string to a list of individual numbers:

  my @numbers = unpack "C*", "Hello world!";

In this case, I end up with a series of 12 numbers beginning with 72 and ending with 33, being the ASCII values of each character of the string. If I wanted to skip the first two characters, I can use an ``x'' to skip a byte, and either ``xx'' or ``x2'' to skip two bytes:

  my @part = unpack "x2 C*", "Hello world!";

Now I get the values starting with the third character. What if I wanted only every other character? Like a regular expression, I can group parts of the format in parentheses (which can be nested):

  print pack "C*", unpack "(x C)*", "Hello world!";

And now I've picked out every other character, resulting in el ol!. Had I swapped the x and C, I'd get Hlowrd instead.

Note the use of whitespace in the last two formats. Whitespace can be introduced between format constructs for clarity. In fact, you can even add Perl-style comments, beginning with a pound-sign and terminated by a newline.

Another common format is n, which stands for a 16-bit integer in ``network'' (big-endian) order. The corresponding data element is again expected to be a numeric value, but the result is now two bytes of the string instead of one. The first byte is the ``high'' part of the 16-bit value, while the second byte is the ``low'' byte. For example, both of:

  my $data = pack "n", 1;
  my $data = pack "C*", 0, 1;

result in the same string, with a NUL byte followed by a byte having the ASCII value of 1 (a control-A). Similarly:

  my $data2 = pack "n", 256;
  my $data2 = pack "C*", 1, 0;

result in the same string as well: namely, a byte with the ASCII value of 1 (control-A) followed by a NUL byte. The first byte represents the high half of the 16-bit value.

The N value is similar, but packs a 32-bit value into 4-bytes, most significant byte first. Again, both of these result in the same value:

  my $data = pack "N",     65536 + 256 * 2 + 3;
  my $data = pack "C*", 0,     1,        2,  3;

If we want the little end first, we can use v and V in place of n and <N>:

  my $data_reversed = pack "V", 65536 + 256 * 2 + 3;
  my $data_reversed = pack "C*", 3, 2, 1, 0;

Here, the low byte comes first, followed by successively more significant bytes. The letter ``v'' comes from ``vax'' ordering, since little-endian order was used on the DEC VAX computer system. And probably because ``v'' was one of the few characters not already taken.

The L letter is a ``native'' unsigned 32-byte value. On big-endian machines, this letter will act like N, but on little-endian machines, this letter will act like V. You can use this to figure out your native byte order:

  print unpack "C*", pack "L", 0x04030201;

On little-endian machines, this prints 1234, but on big-endian machines, this prints 4321.

And as long as I introduced a hex value there, let's look at how to get that hex value into and out of a string. The only one I really use is H*, which unpacks a single string into its corresponding hex representation (of any length), or packs the hex string back into the original string:

  my $hello_as_hex = unpack "H*", "hello"; # "68656c6c6f"
  print pack "H*", $hello_as_hex; # say hello!

If we wanted those as pairs of characters, we can use the repetition marker:

  my @hexes = unpack "(H2)*", "hello";

Now we have qw(68 65 6c 6c 6f) as five separate elements. Joy.

Similarly, I can break a string into bits with B*:

  print unpack "B*", "hi!"; # "011010000110100100100001"

The first eight bits represent the letter h from highest to lowest bit. The other two characters similarly follow. Again, to see this easier, use a grouping:

  print "$_\n" for unpack "(B8)*", "hello world!";

which results in:

Wow... with a bit more work, I could turn that into a old-style 8-level paper tape.

Another really useful format is A, denoting a space-padded ASCII string, used nearly always with a specific width:

  my $value = pack "A10", "some";

The output value will be the string some followed by six spaces. The value is truncated if necessary. Replacing the A with a results in a NUL-padded value. Using Z insists that there be at least one NUL, so Z10 prints up to nine characters from the value, reserving the last character for a NUL.

When we use the values in an unpack, the corresponding value will have spaces trimmed for A, and NULs trimmed for <Z>. The a format does no trimming at all. For example:

  my ($hello, $world) = unpack "A6 A5", "Hello world";

In this case, Hello (with a trailing space) is considered for the first output, but the trailing space is stripped, resulting in Hello and world for the two values. For a6 a5, the trailing space would have been kept.

What if we wanted five characters, skip one, and get the next five? Just throw in an x to skip over the unwanted byte.

  my ($hello, $world) = unpack "a5 x a5", "Hello-world!";

We can also skip to an absolution position with @:

  my ($world) = unpack '@6 a5', "Hello world";

By skipping to position six (numbered starting from 0), we'll start picking up characters with the w.

We can skip to the end of the string with x*, and then back up one or more characters with X:

  my ($last) = unpack 'x* X a', "Hello world!";

And $last ends up with the last character of the string. We can even use X to interpret the same byte two different ways:

  my @pairs = unpack '(a X C)*', "Hello";

Now we'll get pairs out for each character in the string, consisting of the original character (from a), and then its byte value (from C) because we back up between interpreting the two formats.

The output of the Unix who command consists of an 8-character trailing-space-padded username field, followed by an 8-character trailing-space-padded terminal (tty) field, followed by the date and time of login. Using unpack, we can easily pull the lines apart:

  foreach (`who`) {
    chomp; # throw away trailing newline
    my ($user, $tty, $time) = unpack "A8 A8 A*", $_;
    ...
  }

However, who is merely interpreting the information from the utmp file, which on my system is defined by a C struct that looks like:

   #define UT_NAMESIZE     8
   #define UT_LINESIZE     8
   #define UT_HOSTSIZE     16
   struct utmp {
           char    ut_line[UT_LINESIZE];
           char    ut_name[UT_NAMESIZE];
           char    ut_host[UT_HOSTSIZE];
           time_t  ut_time;
   };

We can translate this into a pack format rather directly. Because we're talking about NUL-terminated strings, we'll use Z, and a time_t is a native ``long'' (generally), so that's something like Z8 Z8 Z16 L.

We can open up the utmp file (/var/run/utmp on my system), read it 36 bytes at a time, and unpack it to get at the data:

  open UTMP, "/var/run/utmp" or die;
  while (read(UTMP, my $buf, 36) > 0) {
    my ($line, $name, $host, $time) = unpack "Z8 Z8 Z16 L", $buf;
    next unless $name;
    printf "%-8s %-8s %s", $name, $line, scalar localtime $time;
    printf " (%s)", $host if $host;
    print "\n";
  }

And there we have a working who program. Of course, your utmp structure is likely different from mine, but the principles will be similar.

I hope you've enjoyed this little trip into the world of pack and unpack. For more information, pack and unpack are described in mind-numbing detail in perlfunc, and recent versions of Perl include perlpacktut as a gentle tutorial. Until next time, enjoy!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Linux Magazine Column 81 (May 2006)