Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 55 (Nov 2004)

[suggested title: ``Commonly used parsing techniques'']

I write a lot of Perl code that contains a significant amount of text parsing -- to find the interesting bits of the text (data reduction), or break the text into pieces (lexical analysis). Let's take a look at how I use regular expressions to accomplish those tasks.

The simplest style of matching and parsing is the regular expression being tested simply for its truth value, most commonly used against the value in $_. For example, a line of the output of the Unix who command looks like this:

      merlyn    tty42   Dec 7 19:41

which gives the username, login terminal, and the date and time of login. If I want to count how many times I've logged in, I can use:

  my $count;
  for (`who`) { # one line is in $_
    if (/^merlyn\b/) { # line begins with merlyn
      $count++; # count it
    }
  }
  print "merlyn is logged in $count times\n";

The regular expression returns true if merlyn is found at the beginning of the line, otherwise false. Although a simple true/false value return is nice and useful, more often I'll want to know something about the match. In this case, the match variables come in handy:

  if (/^(merlyn|root)\b/) { # it matches
    $admins{$1}++;
  }

Here, the match is looking for a line beginning with either merlyn or root, and that word ends up in $1, because of a valid match. If I want to capture the location as well, I can grab a bit more:

  if (/^(merlyn|root)\s+(\S+)/ {
    $admins{$1} .= "$2\n";
  }

If I have a successful match, the username is in $1, and the terminal is in $2. Each terminal is accumulated into a newline-terminated string for the corresponding user.

Now, at this point, the regex is messy enough that I probably should add some comments for the poor chap who has to maintain my code when I'm gone. Luckily, I can pop in to extended-regex mode, and include some insignificant whitespace:

  if (m{
    ^           # beginning of line
    (           # start $1
      merlyn    # merlyn
      |         # or
      root      # root
    )           # end $1
    \s+         # some whitespace
    (           # start $2
      \S+       # one or more non-whitespace
    )           # end $2
  }x) {
    $admins{$1}++;
  }

And now my illustrative text is alongside the parts being referenced, to hopefully make it more clear about exactly what's happening.

Rather than trying to remember what I meant for $1 and $2, I can name them at the time of the match:

  if (my ($user, $tty) = /^(merlyn|root)\s+(\S+)/) {
    $admins{$user} .= "$tty\n";
  }

Each of the $1, $2 (and so on) variables are returned as a list from a successful match, and I can assign those directly to easy-to-remember variable names. If the match fails, an empty list is returned, causing the variables to become undef. More importantly, a failed match means I have a zero-length list-assignment, which when viewed in a scalar context (as in when I am looking for a true/false value for the if), so the if also fails. I can use this to create a multi-way attempt to match text:


  if (my ($user, $tty) = /^(merlyn|root)\s+(\S+)/) {
    $admins{$user} .= "$tty\n";
  } elsif (my ($user, $tty) = /^(\S+)\s+(\S+)/) {
    $users{$user} .= "$tty\n";
  } else {
    warn "I don't understand $_";
  }

If the first match succeeds, I have the named variables set appropriately, localized to that branch of the match. If that fails, Perl goes on to try the next (more general) match, executing the appropriate code if that works. If the more general match had appeared before the specific match, then I'd never have had the specific match be triggered, so it's important to get those in the right order. If both matches fail, I get a nice error message so I know I've failed to account for some kind of line in the input.

The output line of who has three fields that are separated by some amount whitespace, although the third field (the date and time of login) also contains embedded whitespace. If I wanted to grab the entire line into its three parts, I'd have to use a full regular expression:

  for (`who`) {
    my ($user, $tty, $time) = m{
       ^       # beginning of line
       (\S+)   # user as $1
       \s+     # whitespace gap
       (\S+)   # tty as $2
       \s+     # whitespace gap
       (.*)    # date/time of login as $3
       $       # end of line
     }x or die "Weird who line: $!";
     ...; # rest of loop
   }

Here, the (.*) grabs the entire remainder of the line, including the embedded whitespace. Note that I'm careful to bail out of the program on unusual lines. A more gentler approach would have been to just issue a warn and then go on to the next line. Always consider what happens when text does not match.

If the line I'm parsing isn't ``mixed delimiters'' like that last example, but rather something consistent, like whitespace always delimiting non-whitespaced data, then I choose the simpler split operator:

  my @words = split /\s+/, $line;

In this case, the regular expression provides the description of the parts I throw away, leading one of my friends to call that the ``deliminator''. Here, I'm using \s+, and not a simple \s, so that a run of consecutive whitespace is considered one big delimiter, rather than a series of delimiters around empty fields. For an example, let's look at the differences between these two entries:

  my @x = split /:/,  "ab:cd:::ef:gh";
  my @y = split /:+/, "ab:cd:::ef:gh";

In the first case, I get two empty fields between cd and ef, because Perl found three delimiters there. In the second case, Perl found one big fat delimiter between cd and ef, so there is no empty field present.

Sometimes, it's easier to talk about what to keep instead of what to throw away. For that, I grab a //g match in a list context:

  my @words = "What did he say?" =~ /([A-Za-z']+)/g;

Here, the regular expression is dragged through string, picking up all matches, ignoring the gaps between that don't match. While I could have written this with split with some work, it gets harder to do so for the more general case.

Another common task is to take a string and break it down into its individual parts for further analysis. For example, suppose I wanted to pull apart that previous sentence as a series of four words followed by a punctuation mark, noting each category of thing as I go along? In other words, I want to end up with an array of arrayrefs that looks like:

     my @sentence = (
       [word => "What"],
       [word => "did"],
       [word => "he"],
       [word => "say"],
       [punct => "?"],
     );

There are two main ways to do this. From back in the old days, I did this by destructively nibbling away at the string:

  $_ = "What did he say?";
  my @sentence;
  while (length $_) {
    if (s/^([A-Za-z']+)//) {
      push @sentence, [word => $1];
    } elsif (s/^([,.?!])//) {
      push @sentence, [punct => $1];
    } elsif (s/^\s+//) {
      # ignore whitespace
    } else {
      die "I cannot parse the remainder of $_";
    }
  }

As each ``thing'' of interest is noted, it gets removed from the front of the string. Again, if I can't figure out what's there, the loop aborts.

The downside to this approach is that I'm destroying the string as I am parsing it. While that's fine for small strings (presuming I'm working on a copy and not the original), it's a real pain for larger strings for Perl to be continually removing bits and pieces from the front. But with modern Perl implementations, I can instead use the pos() value of the string as a virtual pointer, anchoring each new match there, and inchworming along in a similar way. Compare:

  $_ = "What did he say?";
  my @sentence;
  until (/\G$/gc) { # until pos at end of string
    if (/\G([A-Za-z']+)/gc) {
      push @sentence, [word => $1];
    } elsif (/\G([,.?!])/gc) {
      push @sentence, [punct => $1];
    } elsif (/\G\s+/gc) {
      # ignore whitespace
    } else {
      die "I cannot parse the remainder of ", /\G(.*)/gc;
    }
  }

On each of these matches, the pos() pointer for $_ is advanced if the match succeeds, or left alone if the match fails (thanks to the /c option). The pos() pointer provides the anchor for the start of each match as well, acting like the beginning of string did for the nibbling version. This is a very speedy technique, and a good one to keep in mind.

I hope you enjoyed this little journey into commonly used parsing techniques. Until next time, enjoy!


Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.