Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
![]() |
Download this listing! | ![]() |
![]() |
![]() |
Web Techniques Column 36 (Apr 1999)
Well, last month's column was a pretty heavy piece of work, clocking in at 312 lines of code, and a correspondingly large amount of descriptive text. This month, I decided to get back to basics and tackle a simple but annoying problem on your typical web page: making it load faster.
One of the things you can do to make a web page appear to load
faster is by giving the browser hints about the ultimate size of its
images. In modern HTML, the IMG
tag accepts WIDTH
and HEIGHT
attributes to give the pixel dimensions of the image. The browser can
use this to leave a hole of the appropriate size while the rest of the
HTML is still loading, and even while the picture image data is being
fetched in a separate HTTP transaction. While this doesn't actually
make the page load any faster, it seems to calm the users down a bit
more, since things aren't jumping around for as long.
But, to make this work, you've got to get the actual pixel sizes into the HTML code. Doing this by hand means downloading the image into your favorite image manipulation tool, looking at the information for the picture, noting the pixel size, and then invoking your favorite text editor to hack the HTML. Bleh. No wonder it doesn't get done as often as it could.
But, thanks to the nice Image::Size
module (available from the CPAN
at http://www.cpan.org/CPAN.html
and other places), I was able to
write a program to automatically fetch the image, compute its size,
and then edit that data right into the HTML! No more excuses: my web
pages will now have sizes on them! The Image::Size
module handles
all the common image formats, such as GIF, JPEG, and PNG, as well
as some that you probably won't be using on the web.
To fix an index.html
and reference.html
file, for example, I can
now enter:
addsize -i index.html reference.html
The -i
switch here says to edit these files in place, which
means that the files will be changed, saving the old versions to the
original names with an appended tilde.
So, let's examine together the program presented in [listing one, below].
Line 1 turns on warnings, while line 2 enables all compiler restrictions. These selections make writing any program longer than about ten lines easier to get right the first time.
Line 4 pulls in the URI::file
module, from the new URI
distribution.
This class (or the class it superseded) was formerly part of the huge
LWP
distribution (in the CPAN). Now it's a separate piece. You can
still install this piece and any other former LWP
pieces using
the Bundle::LWP
installation from CPAN.pm
like so:
$ perl -MCPAN -eshell cpan> install Bundle::LWP [lots of output] cpan> quit $
The URI::file
module creates objects that represent a URI for a
diskfile (usually starting with a scheme of file:
). This will be
used later to translate the command line arguments into an appropriate
object for relative and absolute addressing.
Lines 6 through 50 create a subclass from the HTML::Filter
class.
Once again, the base class is a part of the LWP
bundle described
above. These lines are wrapped in a BEGIN
block, both to localize
the effects of setting the package in line 7 as well as to ensure that
any needed modules are brought in and initialized before the rest of
the program is parsed.
Line 7 sets the class name as a package name: MyFilter
.
Line 8 pulls in the HTML::Filter
module, and sets MyFilter
's
inheritance to include that module. If you're running a version
of Perl prior to 5.005, you'll need to replace that line with:
use HTML::Filter; @MyFilter::ISA = qw(HTML::Filter);
which seems like more work, but that's why I used base
instead.
Lines 9 through 11 bring in three other useful classes.
Image::Size
is described above. HTML::Entities
is found in the
LWP bundle, and lets us provide proper escaping for the HTML attribute
values. And finally, LWP::Simple
gives us an easy way to fetch
remote images so that their size can be corrected as well.
Lines 13 through 19 define an overridden constructor method called
new
. The first parameter to this class method will be the class
name (package name), which I shift off the @_
array in line 14.
The second parameter will be an object of type URI
(or one of its
subclasses). This object will be used to construct proper absolute
pathnames (or URLs) when we're given a relative URL in an image source
URL. We'll shift this off in line 15, saving it for a moment in a
temporary variable.
Line 16 calls the superclass (in this case, HTML::Filter
) to
construct the base object, passing along any other parameters (in this
case, none usually). The SUPER
syntax here ensures that we don't
need to know the inheritance path currently established, although for
this example, the path is trivial to determine upon inspection. The
result is our object to be returned, saved for the moment into
$self
.
Line 17 saves the saved URI into an instance variable called
_uri
. Note that I've determined by inspection that this name is
available. If I wasn't sure, I'd pick something like
_MyFilter_uri_
, which would be very unlikely to conflict.
Finally, the newly constructed object is returned in line 18.
Lines 21 through 50 define the start
method. This method is called
automatically by the HTML::Filter
class whenever a start tag is
seen, such as img src=...
. We're overriding the default
method (in HTML::Filter::start
), which simply dumps the tag to the
output. We override it because for some tags (namely the image tag),
we're gonna make some changes and decisions.
Line 22 grabs the current object into a local variable $self
.
Line 23 captures the incoming parameters: the tag name, a hashref for the attributes, a listref that gives the original sequence of those attributes, and the original untouched text (for a quick passthrough if no editing is required).
Lines 24 through 26 detect a base
tag, used in the HTML header to
define an alternate URL for relative references. This is important to
notice, because we need to fetch images according to this tag as well.
If we see one of these, we'll grab the URL and stuff it into _uri
instance variable.
Lines 27 through 47 handle an IMG
tag that needs to be rewritten.
Lines 28 through 31 first determine if we're looking at something that
needs to be hacked (must be an IMG
tag, must have a SRC
attribute, must not already have WIDTH
and HEIGHT
attributes).
If anything doesn't pass the muster, the last
operator breaks us
out to line 48.
Lines 32 and 33 compute the URI for the SRC
attribute. We'll take
the given attribute value, and compute an absolute URI based on the
_uri
instance variable. We'll use this either to open a local file
or fetch a remote URL to determine the image size.
Lines 34 through 37 compute the image size. If the source URI scheme
is file
, then we're looking at what was originally a relative URL,
because we've made it absolute against a file
URL repesenting this
particular HTML file. Of course, when this document is ultimately
fetched, it'll be an HTTP
URL, but that's not relevant here.
If we're looking at a file
URL, then line 36 calls imgsize
(imported from the Image::Size
package) routine on the filename.
Otherwise, we'll use the get
routine (from LWP::Simple
) to fetch
the contents of the remote URL, and pass a reference to that scalar
data to imgsize
. The imgsize
distinguishes filenames from
actual data by noting that actual data is always passed as a scalar
reference. The return value from one of these two calls to imgsize
ends up in @xy
.
If imgsize
fails for any reason, the first two elements of @xy
are undef
, which I'll test for in line 38. If we don't get a good
value, we'll bail, and just dump out the original text below.
Line 39 takes the X and Y value, and stores those as new attributes in
the hash pointed to by $attr
, using the hashref-slice notation
here.
Lines 40 through 44 build up a new tag-attribute string. Each entry
in the list pointed to by $attrseq
, including the new width
and
height
values, are dumped. Note that we need to encode the
HTML-significant entities from the attribute values, so we're calling
encode_entities
(from HTML::Entities
) to handle that.
Once the string is built up, we'll dump it in line 45 by calling the
output
method. By default, this is merely a print
to the
default filehandle, which is exactly where we want it to go. But
we'll call the method anyway in case someone subclasses my filter
routine, overriding output
(it could happen, but not in this
program).
Line 48 is selected only when the start tag needs to be output exactly as it was input. This happens nearly all the time, so this line gets called a lot.
And that defines the class MyFilter
, a subclass of HTML::Filter
,
with specific instructions to read HTML data, look for image tags,
determine the size of the corresponding images, and rewrite those
items as necessary. Now all we have to do is call an object of that
class.
Line 52 undefines the $/
variable. When this variable is undef
,
any ``line'' read operation becomes an entire file read operation.
Very useful here, as you'll see a few lines down.
Line 53 notes a -i
option on the command line. If the option is
present, we'll enable in-place editing mode, affecting the way the
diamond operator in line 54 opens up a new file. The backup file
extension is set to tilde by default. However, if there's an
extension present after the -i
parameter, then we'll use that
instead.
Lines 54 through 58 form a diamond loop, reading through the filenames
now present in @ARGV
. As each file is read, the entire contents
end up in $_
, and the filename is in $ARGV
.
Line 55 creates a new URI
object: actually, in this case a
URI::file
object. We'll dump out the filename in line 56 for safe
keeping (or a progress indicator).
Finally, the major work gets done in line 57. A call to the new
method returns the parsing object, which we then invoke a parse
method within, passing it the contents of the file, and then signal
end of file by calling eof
. This will result in a bunch of stuff
being dumped to the currently selected filehandle (either STDOUT
,
or ARGVOUT
if we're in in-place editing mode), and we're done!
And there you have it. A small-sized program to do a giant-sized service to your web site visitors. Until next time, enjoy!
Listing One
=1= #!/usr/bin/perl -w =2= use strict; =3= =4= use URI::file; =5= =6= BEGIN { =7= package MyFilter; =8= use base qw(HTML::Filter); =9= use Image::Size; =10= use HTML::Entities; =11= use LWP::Simple; =12= =13= sub new { =14= my $package = shift; =15= my $uri = shift; =16= my $self = $package->SUPER::new(@_); =17= $self->{_uri} = $uri; =18= $self; =19= } =20= =21= sub start { =22= my $self = shift; =23= my($tag, $attr, $attrseq, $origtext) = @_; =24= if ($tag eq 'base' and exists $attr->{href}) { =25= $self->{_uri} = URI->new($attr->{href}); =26= } =27= { =28= last unless $tag eq 'img'; =29= last unless exists $attr->{src}; =30= last if exists $attr->{width}; =31= last if exists $attr->{height}; =32= my $src = $attr->{src}; =33= my $src_uri = URI->new_abs($src, $self->{_uri}); =34= my @xy = =35= $src_uri->scheme eq "file" ? =36= imgsize($src_uri->path) : =37= imgsize(\get($src_uri)); =38= last unless defined $xy[0]; =39= @$attr{qw(width height)} = @xy[0,1]; =40= my $tmp = "<$tag"; =41= for (@$attrseq, qw(width height)) { =42= $tmp .= qq/ $_="/.encode_entities($attr->{$_}).q/"/; =43= } =44= $tmp .= ">"; =45= $self->output($tmp); =46= return; =47= } =48= $self->output($origtext); =49= } =50= } =51= =52= undef $/; =53= shift, $^I = ($1 || "~") if @ARGV and $ARGV[0] =~ /^-i(.*)/; =54= while (<>) { =55= my $file = URI::file->new_abs($ARGV); =56= print STDOUT "===== $ARGV =====\n"; =57= MyFilter->new($file)->parse($_)->eof; =58= }