Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Unix Review Column 40 (Dec 2001)
[suggested title: ``Parsing interesting things'']
Someone recently popped into one of the newsgroups I frequent, and asked how to parse an INI file. You might have seen those before, with sections and keyword=value lines, like:
[login] timeout=30 remote=yes
[password] minlength=6
I think they started in the Microsoft world, since no sane Unix hacker
would have come up with something like that. No, we come up with
things like .Xdefaults
and sendmail.cf
and termcap
. But the
request seemed simple: parse the file, and gather the information into
a hash for quick access, two levels deep of course.
Now, I usually carry the banner here for ``use the CPAN'', and in fact, there are numerous CPAN modules that parse INI files (too many, I think). But let's take a different route here. Suppose we were parsing a file that wasn't already CPANned to death. What tools could we use?
Well, certainly Perl's regular expressions are pretty powerful in the
first place, and this task really wouldn't be that difficult with
hand-written code, but we can go a bit further and pull out a nifty
tool from the CPAN: the ``madman of Perl'' Damian Conway's
Parse::RecDescent
. This module permits extremely complex parsers
to be built by specifying a nice hierarchical description of the data
(as a grammar), and a series of actions to be taken as each portion of
the data is returned. I find it very simple to use, and whipped up a
parser in no time.
The key to a useful grammar is getting the description right, and what to do once you've seen that. First, let's look at a file. A file is a series of sections, so in the grammar language, that's given as:
file: sections
Actually, a file is a bit more than that. If we just used that, the grammer would match any prefix of the input that also had sections. So, we need to anchor that:
file: sections /\z/
Which says, match sections, and when you're done matching sections, match the end of the string. If you're not at the end of the string when you are done matching sections, this isn't a file that we want.
And now, sections
is zero or more sections, which we write as:
sections: section(s?)
with the (s?)
suffix meaning ``zero or more''. Very readable so far.
A section is a section marker (the square-bracket line) and some definitions:
section: section_marker definitions definitions: definition(s?)
And we've defined the definitions as well. So far, we've managed to capture the essence of an INI-like file, but we've not actually matched anything (except the end of string). That's because we've been constructing ``non-terminals''. Grammar rules can also contain ``terminals'' (like the end-of-string token above) to define specific things to match. Let's start with a section marker:
section_marker: /\[.*\]/
There. A section marker is a square-bracketed thingy. And what's a definition?
definition: key /=/ value
Yeah, it's a key and a value, separated by an equals. But what are those? Why, more terminals!
key: /\w+/ value: /.*/
And already with just a few lines of code, we've defined most of the
grammar. But now we need to introduce a bit more knowledge about
Parse::RecDescent
. Between each of the items of the rules, the
generated parser will be permitted to skip over the current skip
string, which is ``whitespace'' by default. This is fine for section markers:
we don't mind any preceding whitespace to be tossed. But it's a pain
if whitespace gets in-between the key and the rest of the line. Fortunately,
we can define that the skip string be altered for the remainder of a rule:
definition: key <skip: ''> /=/ value
which means that the string ''
(the empty string) is now the skip
string, meaning that the equals must be adjacent to the end of the
key, and the value starts immediately after the equals. Good!
We could stick all the rules above into a string $GRAMMAR
, and
then create a parser $PARSER
using these rules as:
use Parse::RecDescent; my $PARSER = Parse::RecDescent->new($GRAMMAR) or die;
This $PARSER
can then be used repeatedly to see if a file fits
the specifications. To do that, we call the top-level rule (file
)
as a method, passing it $INPUT
, the contents of the file in question:
if (defined(my $result = $PARSER->file($INPUT))) { print "It's a valid INI file!\n"; } else { print "No good.\n"; }
Now, if all we were doing was verifying well-formedness, that's
enough. But we wanted to also use the data as it was parsed. To do
that, we need to also know that every rule is like a subroutine call,
and passes back the last value evaluated. By default, that's the
string matching the terminal (or $1
if it's included), or whatever
value the last subrule returns. (For the repetitions above, an array
ref is returned of all the matches, if any.) However, we can include
some Perl code enclosed in a block as the last rule, and then that
will be the return value.
For example, we really don't want the brackets included in the section
marker, so we can select (using $1
) them away:
section_marker: /\[(.*)\]/
There. Now the brackets are not part of the return value. If we
didn't know that $1
is automatically returned, we could return
it explicitly:
section_marker: /\[(.*)\]/ { $1 }
which says to perform the regex match, and if it succeeds, evaluate
the block. As long as the block doesn't return undef
, it's also
considered a ``match'', and as the last thing in a rule, it's also the
overall value of the rule.
But what about the definitions? We want to note both the key and the
value, so we'll use some sort of Perl block at the end of the rule.
And we can return an arrayref of the two items just fine, but we need
to access the ``value'' of the key and value subrules through the
magical %item
hash. The keys to this hash are the names of the
subrules. (Sorry for the overloading of the key/value terms here.)
definition: key <skip: ''> /=/ value { [$item{key}, $item{value}] }
And now a definition is an arrayref, consisting of the found key, and its found value. (If there's more than one item called ``key'', then you must resort to positional syntax, but it's almost always easier and clearer to just invent a new non-terminal name for that particular slot.)
And similarly, a section needs the name of the section and all of the definitions of that section.
section: section_marker definitions { [$item{section_marker}, $item{definitions}] }
Note that definitions
will already be an arrayref of individual
definitions, which are themselves references to two-element arrays.
All this stacking is taken care of automatically by the parser built
by Parse::RecDescent
!
And finally, the fun part. A file wants to be all the sections. And we could just punt and return that:
file: sections /\z/ { $item{sections} }
which will then be an arrayref pointing to a list of sections, each section being an arrayref pointing to a list of definitions in that section, each definition being an arrayref pointing to a key/value tuple. But let's convert this into a hash for quick access:
file: sections /\z/ { my %return; my $sections = $item{sections}; for my $section (@$sections) { my ($section_marker, $definitions) = @$section; for my $definition (@$definitions) { my ($key, $value) = @$definition; for ($return{$section_marker}{$key}) { if (not defined $_) { $_ = $value; } elsif (not ref $_) { $_ = [$_, $value]; } else { push @$_, $value; } } } } \%return; }
Wow. What was that? Well, first we define a hash to be returned (as
a hashref), and then walk the multiple levels of the arrayrefs of
arrayrefs of tuples. The interesting part starts in the middle, which
is merely aliasing $return{$section_marker}{$key}
to $_
for the
rest of the inner loop. If that value isn't defined, then this is the
first time we've seen a keyword under a given section, so we stuff the
value. If it's already defined, then we've seen the same keyword
twice. In this case, I decided to turn the value into an arrayref, so
that the values are individually extractable. And finally, if it's
already an arrayref, then we just push the latest hit onto the end.
The return value of calling the file
method is now either this
hashref, or undef
. So to get the ``timeout'' parameter from
the example INI file above, we'd say:
my $timeout = $result->{login}{timeout};
Because the names are case sensitive, we might want to add a few other things to force all the section names and keys to lowercase, or perhaps we could do that while we were building the hash.
And there you have it: an INI-like file parser made with
Parse::RecDescent
. Hopefully, this brief intro to this powerful
module will get you interested enough to read the rest of the
documentation and study its amazing array of features. And you'll
never fear parsing an odd-looking file again. Until next time, enjoy!