Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
![]() |
Download this listing! | ![]() |
![]() |
![]() |
Unix Review Column 67 (Nov 2006)
[Suggested title: ``Formatting reports with Template Toolkit'']
Recently, a Usenet posting (yes, I still read that) discussed using
Perl to determine the disk block usage for various users on the
system. Apparently, the poster had noticed a discrepency between the
number of blocks reported by du and the total blocks returned by
summing Perl's -s
function applied to each file. I replied that
you can't use -s
, because that's merely the length of the file, and
thanks to sparse Unix files, the value wasn't necessarily the number
of blocks used. Also, a large file consumes indirect blocks to
locate the blocks as needed.
Because of sparse files and indirect blocks, the accurate way to
determine the actual cost of a file requires calling stat
, and
using the blocks value (element index 12). As I was thinking about
this, I thought it'd be interesting to profile my own disk usage
organized according to logarithmic buckets. Once I had that working, I
started tinkering some more (as I often do), and wanted a nice HTML
table output, organized by user, of the disk blocks and file count for
a given directory hierarchy. And the result is in [listing one,
below].
Line 5 provides a constant to help me scale a given block count into the relevant bucket. If I take the natural log of the block count, and divide it by the natural log of 2, then truncate that downward to an integer, I'll get buckets for 1 block, 2-3 blocks, 4-7 blocks, 8-15 blocks, and so on, in nice powers of two. Rather than keep dividing by the natural log of 2, I'll use the reciprocal and multiply, which in hindsight is a completely unnoticed microoptimization. Oh well.
Lines 7 and 8 pull in the File::Find
module (found in the core
distribution) and Template
Toolkit module (found in the CPAN).
Line 10 defines the summary data structure which will hold the results of our data gathering. The hash is keyed by user ID number, pointing at an array organized by logarithmic block buckets, each holding a hashref that accumulates the number of blocks and the count of files.
Line 14 defaults the directories to be searched to include the current directory.
Line 16 creates a hash to enable us to ignore the subsequent times
that we see the same multiply-linked file. This hash has to be
defined outside the subroutine used by File::Find
.
Lines 18 to 28 recurse through the directories using File::Find
.
The inline anonymous subroutine is called for every entry: it's our
job to return quickly if the entry is not needed. Hence, lines 19 and
20 return for symbolic links or items that are not either files or
directories. (Directories consume disk blocks, and thus I chose to
count those in my storage summary.)
Lines 21 to 23 pick out the relevant parts of the stat
call,
including the device and inode number, the user ID of the file's
owner, and the number of blocks. If the device and inode number pair
have been seen before, line 23 rejects the entry quickly.
If we make it to line 25, it's the first time we've seen a given file, so we next need to determine the bucket in which this entry should be accumulated. If the block count is 0, we can't compute the logarithm, so I decided to arbitrarily put that in with the entries with a block count of 1. Otherwise, the natural log of the block count is divided by the log of 2. We don't need to take the integer value of the result because we're using it as an array index, which automatically truncates the value.
Lines 26 to 27 accumulate the information for this file into the appropriate bucket.
Once the data is gathered, I create two auxiliary data structures to
help the rendering a bit. Line 30 creates a mapping from user ID to
usernames. Lines 31 to 36 create the top-of-table column headers by
computing the powers of 2 as needed. The @labels
array is compared
to each user's data, and if needed, extended until it's long enough.
And for the rendering, I reach for my favorite templating system,
Template Toolkit. I need to pass the data into the template using a
hash, which I create in lines 38 to 41. Inside the template, sums
refers to the my %sums
hashref, and so on.
Lines 43 and 44 create the top-level Template
object, configured to
invoke post-chomping (to avoid spurious newlines in the output), and
invoke the template on my data. The result appears on STDOUT
.
This is the end of the Perl code. The rest of the file is a Template
Toolkit template in a giant ``here'' document. Line 45 declares that
the blocks returned by stat
are in 1K blocks, which I determined by
experimentation.
Lines 46 to 49 are a slight tweak to make the table cells look reasonable in my browser. I'm no CSS expert, so if this is broken, just remember that I never claimed to be ``Just another CSS hacker''. It does seem to do what I want though.
Lines 51 to 67 define a macro to take a number and format it nicely
into 4 characters or less (up to a terabyte, and after that, you're on
your own, you lucky dog). I tried using
Template::Plugin::Number::Format
and it didn't do what I wanted, so
I wrote my own. I use this both for the bytes and for the file count,
simply by applying the BLOCKSIZE scaling factor as needed.
Lines 68 to 70 define a macro to show bytes and file count in a consistent way from column to column, on separate lines in the HTML.
Lines 71 to 107 define the main loop, walking through the results user by user.
We need to preface the first user's output with all of the table header information, including the column headers. This is detected in lines 72 to 84.
Lines 74 to 83 produce the column heads by walking through the
labels
array. Ahead of the first column, I insert the username
column (line 76), and after the last column, I append the totals
column (line 80). The remaining labels are scaled by the block size
and displayed using the nice number formatting routine (line 78).
Once the headers are displayed, it's time to display each row. Lines 85 to 103 iterate over the columns of a given row, using the labels to define the columns over which we will iterate.
As with the column headers, we'll insert the username before the first data column, and initialize the total accumulators (lines 88 to 91). And after the data is complete, the totals are dumped (lines 99 to 102).
For each data cell, we'll pull out the detailed hashref in line 87. If this cell exists (checked in line 92), then we should show the data (likes 93 to 95). Otherwise, we need to include a non-breaking space so that the browser renders the cell anyway (line 97).
Okay, this started out being a 10-line program, but there it is. A sample run on my laptop's library directory as rendered by my browser is shown in [figure one]. Hope you find this tool useful, and learned a few things about creating data reduction tools. Until next time, enjoy!
Listing One
=1= #!/usr/bin/perl =2= use strict; =3= use warnings; =4= =5= use constant ONEOVERLOG2 => 1/log(2); =6= =7= use File::Find; =8= use Template; =9= =10= my %sums; =11= ## $sums{$uid}[$log_2_blocks]{blocks} += $blocks; =12= ## $sums{$uid}[$log_2_blocks]{count} ++; =13= =14= @ARGV = qw(.) unless @ARGV; =15= =16= my %seen; =17= =18= find sub { =19= return if -l; =20= return unless -f _ or -d _; =21= my @stat = stat(_) or return; =22= my ($dev, $ino, $uid, $blocks) = @stat[0, 1, 4, 12]; =23= return if $seen{"$dev $ino"}++; =24= =25= my $slot = $blocks ? log($blocks) * ONEOVERLOG2 : 0; =26= $sums{$uid}[$slot]{blocks} += $blocks; =27= $sums{$uid}[$slot]{count}++; =28= }, @ARGV; =29= =30= my %usernames = map { $_ => scalar getpwuid $_ } keys %sums; =31= my @labels; =32= for (values %sums) { =33= while (@labels < @$_) { =34= push @labels, 2**@labels; =35= } =36= } =37= =38= my %VARS = =39= (sums => \%sums, =40= usernames => \%usernames, =41= labels => \@labels); =42= =43= Template->new({ POST_CHOMP => 1 }) =44= ->process(\<<'END_OF_TEMPLATE', \%VARS) or die Template->error; =45= [% BLOCKSIZE = 1024 %] =46= <style type="text/css"> =47= th, td { =48= margin: 0; border: 1px dotted black; padding: 0.2em; text-align: center; =49= } =50= </style> =51= [% MACRO nice(n) BLOCK; =52= IF n >= 1073741824 * 9.995; =53= n / 1073741824 | format('%dG'); =54= ELSIF n >= 1073741824; =55= n / 1073741824 | format('%.2gG'); =56= ELSIF n >= 1048576 * 9.995; =57= n / 1048576 | format('%dM'); =58= ELSIF n >= 1048576; =59= n / 1048576 | format('%.2gM'); =60= ELSIF n >= 1024 * 9.995; =61= n / 1024 | format('%dK'); =62= ELSIF n >= 1024; =63= n / 1024 | format('%.2gK'); =64= ELSE; =65= n; =66= END; =67= END %] =68= [% MACRO bc(b,c) BLOCK %] =69= <td>[% bytes = b * BLOCKSIZE; nice(bytes) %]<br/>[% nice(c) %]</td> =70= [% END %] =71= [% FOR uid = sums.keys.nsort %] =72= [% IF loop.first %] =73= <table> =74= [% FOR label = labels %] =75= [% IF loop.first %] =76= <tr><th>User</th> =77= [% END %] =78= <th>[% label_bytes = label * BLOCKSIZE; nice(label_bytes) %]</th> =79= [% IF loop.last %] =80= <th>Total</th> =81= </tr> =82= [% END %] =83= [% END %] =84= [% END %] =85= [% labels_max = labels.max %] =86= [% FOR column = [0..labels_max] %] =87= [% cell = sums.$uid.$column %] =88= [% IF loop.first %] =89= [% total_blocks = 0; total_count = 0 %] =90= <tr><th>[% usernames.$uid || "?"; " ("; uid; ")" %]</th> =91= [% END %] =92= [% IF cell %] =93= [% total_blocks = total_blocks + cell.blocks %] =94= [% total_count = total_count + cell.count %] =95= [% bc(cell.blocks, cell.count) %] =96= [% ELSE %] =97= <td> </td> =98= [% END %] =99= [% IF loop.last %] =100= [% bc(total_blocks, total_count) %] =101= </tr> =102= [% END %] =103= [% END %] =104= [% IF loop.last %] =105= </table> =106= [% END %] =107= [% END %] =108= END_OF_TEMPLATE