Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 35 (Dec 2000)

[suggested title: So what's the difference?]

A lot of common programming is dealing with things that change. And things do indeed change, and sometimes we'd like to know how they changed.

For example, if we have a list of items:

  @one = qw(a b c d e f g);

and then later, we look at it again, and there's a different set of items:

  @two = qw(b c e h i j);

How can we tell what's new, what's old, and what's gone?

We could certainly try to do it by brute force:

  @one = qw(a b c d e f g);
  @two = qw(b c e h i j);
  foreach $one (@one) {
    if (grep $one eq $_, @two) {
      print "$one is in both old and new\n";
    } else {
      print "$one has been deleted\n";
    }
  }
  foreach $two (@two) {
    unless (grep $two eq $_, @one) {
       print "$two has been added\n";
    }
  }

And this in fact gives us an appropriate response:

  a has been deleted
  b is in both old and new
  c is in both old and new
  d has been deleted
  e is in both old and new
  f has been deleted
  g has been deleted
  h has been added
  i has been added
  j has been added

But this is incredibly inefficient. The computation time will rise in proportion to the product of sizes of both the lists. This is because every element of one list is being compared to every element of the other list (twice, in fact). The grep operator is a loop over each item, so we've effectly got nested loops, and that should nearly always be a danger sign.

The perlfaq4 manpage approaches this subject, giving a solution of something like:

  @union = @intersection = @difference = ();
  %count = ();
  foreach $element (@one, @two) { $count{$element}++ }
  foreach $element (keys %count) {
      push @union, $element;
      push @{ $count{$element} > 1 ? \@intersection : \@difference }, $element;
  }

with the caveat that we're assuming one item of each kind within each list. While that works for our input data as well, we'll run into trouble on more general data. However, with a slight modification, we can handle even duplicate items in each list:

  @one = qw(a a a a b c d e f g);
  @two = qw(b c e h i i i i j);
  my %tracker = ();
  $tracker{$_} .= 1 for @one;
  $tracker{$_} .= 2 for @two;
  for (sort keys %tracker) {
    if ($tracker{$_} !~ /1/) {
      print "$_ has been added\n";
    } elsif ($tracker{$_} !~ /2/) {
      print "$_ has been deleted\n";
    } else {
      print "$_ is in both old and new\n";
    }
  }

Success. Correct output, and reasonably efficient. If you're doing a lot of these, check into the CPAN modules starting with Set::.

And then we come to the problem of telling the difference between two sequences, where the ordering matters. The very nice Algorithm::Diff in the CPAN computes a reasonably short difference-list, similar to the Unix diff command, to tell us how to transform one list into another. There are a number of interfaces. The most interesting one I found was traverse_sequences, which gives me all of the elements of the two lists in sequence, but marked in a way that I can tell which of the two lists (or both) the item belongs.

Let's look at a simple example:

  use Algorithm::Diff qw(traverse_sequences);
  @one = qw(M N a b P Q c d e f V W g h);
  @two = qw(a b R S c d T U e f g h X Y);
  traverse_sequences(\@one, \@two, {
    MATCH => sub { show($one[$_[0]], $two[$_[1]]) },
    DISCARD_A => sub { show($one[$_[0]], "---") },
    DISCARD_B => sub { show("---", $two[$_[1]]) },
  });
  sub show {
    printf "%10s %10s\n", @_;
  }

Here we've given two token sequences in @one and @two. Using traverse_sequences, we'll print out common sequences (via the MATCH callback, removed material (via the DISCARD_A callback), and new material (via the DISCARD_B callback). Changed material shows up as a series of deletes followed by a series of inserts.

The callbacks are defined as references to anonymous subroutines, more commonly known as ``coderefs''. The two parameters passed to each of the callbacks are the current indicies within the @one and @two arrays. As this isn't the actual value, I need to take the index and look it up in the appropriate array.

The result is something like:

         M        ---
         N        ---
         a          a
         b          b
         P        ---
         Q        ---
       ---          R
       ---          S
         c          c
         d          d
       ---          T
       ---          U
         e          e
         f          f
         V        ---
         W        ---
         g          g
         h          h
       ---          X
       ---          Y

Notice the common sequences. The printf operation lines up the columns nicely.

Well, this is a nice text-mode tabular output, but we can get a bit nicer if we know we're sending the result to HTML. Let's color-code all deletions in red, and insertions in green.

A first cut at the algorithm generates far too many font tags:

  use Algorithm::Diff qw(traverse_sequences);
  @one = qw(M N a b P Q c d e f V W g h);
  @two = qw(a b R S c d T U e f g h X Y);
  traverse_sequences(\@one, \@two, {
    MATCH => sub { colorshow("", $one[$_[0]]) },
    DISCARD_A => sub { colorshow("red", $one[$_[0]]) },
    DISCARD_B => sub { colorshow("green", $two[$_[1]]) },
  });
  sub colorshow {
    my $color = shift;
    my $string = shift;
    if (length $color) {
      print "<font color=$color>$string</font>\n";
    } else {
      print "$string\n";
    }
  }

This generates a correct result, but excessive output:


    <font color=red>M</font>
    <font color=red>N</font>
    a
    b
    <font color=red>P</font>
    <font color=red>Q</font>
    <font color=green>R</font>
    <font color=green>S</font>
    c
    d
    <font color=green>T</font>
    <font color=green>U</font>
    e
    f
    <font color=red>V</font>
    <font color=red>W</font>
    g
    h
    <font color=green>X</font>
    <font color=green>Y</font>

What we need is some tracking of state information to figure out if we're already in red or green mode:

  use Algorithm::Diff qw(traverse_sequences);
  @one = qw(M N a b P Q c d e f V W g h);
  @two = qw(a b R S c d T U e f g h X Y);
  traverse_sequences(\@one, \@two, {
    MATCH => sub { colorshow("", $one[$_[0]]) },
    DISCARD_A => sub { colorshow("red", $one[$_[0]]) },
    DISCARD_B => sub { colorshow("green", $two[$_[1]]) },
  });
  colorshow(""); # reset back to 
  BEGIN {
    my $currentcolor = "";

    sub colorshow {
      my $color = shift;
      my $string = shift;
      if ($color ne $currentcolor) {
        print "</font>\n" if length $currentcolor;
        print "<font color=$color>\n" if length $color;
        $currentcolor = $color;
      }
      if (defined $string and length $string) {
        print "$string\n";
      }
    }
  }

Here, I'm tracking the state of the current HTML color in the $currentcolor static variable. As it changes, I send out the end-font or begin-font tags as needed. The only oddness now is that I need to make one final call to colorshow with the uncolored tag to close off any final begin-font tag. This call is harmless if we were already outside a colored region.

And that's much better, resulting in:

    <font color=red>
    M
    N
    </font>
    a
    b
    <font color=red>
    P
    Q
    </font>
    <font color=green>
    R
    S
    </font>
    c
    d
    <font color=green>
    T
    U
    </font>
    e
    f
    <font color=red>
    V
    W
    </font>
    g
    h
    <font color=green>
    X
    Y
    </font>

Although my web-hacking friends might prefer to see that as:

    <span style="background: red; color: black">
    M
    N
    </span>
    a
    b
    <span style="background: red; color: black">
    P
    Q
    </span>
    <span style="background: green; color: black">
    R
    S
    </span>
    c
    d
    <span style="background: green; color: black">
    T
    U
    </span>
    e
    f
    <span style="background: red; color: black">
    V
    W
    </span>
    g
    h
    <span style="background: green; color: black">
    X
    Y
    </span>

And that'd be a pretty easy small change, but I'll leave that to you. There's a little extra whitespace in the output here than I like, but at least the job is getting done with minimal hassle.

So, now when someone asks you ``what the difference?'', you can show different ways of answering that question! Until next time, enjoy!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Unix Review Column 35 (Dec 2000)