Cleaning white spaces from compiled html

518 views
Skip to first unread message

arquitecto

unread,
Aug 24, 2010, 5:51:17 AM8/24/10
to nanoc
Hi!
I realized that after compiling content from the site I got a lot of
white spaces and line changes.
Of couse this doesn't cause any diference on the browser!
Anyway, aren't we all compulsive obsessive by the code cleanliness?
;)
So, does anyone know of any filter that checks and erases all
unecessary wihe spaces that we can put on the rules file?
Thanks and happy summer!

Denis Defreyne

unread,
Aug 24, 2010, 6:55:40 AM8/24/10
to na...@googlegroups.com

Hi,

What exactly do you want to achieve? Do you want a HTML prettifier or a HTML minifier?

One tool that might be useful for both is Nokogiri. “Nokogiri::HTML.parse(content).to_s” should get rid of most unnecessary whitespace and return something that is both fairly compact yet readable. There are probably better solutions out there though…

Denis

Denis Defreyne

unread,
Aug 24, 2010, 7:38:56 AM8/24/10
to na...@googlegroups.com
Alternative, Tidy: <http://tidy.sourceforge.net/>

Denis

arquitecto

unread,
Aug 24, 2010, 9:54:49 AM8/24/10
to nanoc
Hi Denis,
Thanks.
The idea was to erase white lines thant apear between links generated
when the site was compiled.
I just wanted to know if anyone had already solved this issue and how.
I will try the options you mentioned, they seam to be nice.


arquitecto
http://www.utopia-projectos.com

Arno Hautala

unread,
Aug 24, 2010, 5:17:54 PM8/24/10
to na...@googlegroups.com
On Tue, Aug 24, 2010 at 09:54, arquitecto <rtedi...@gmail.com> wrote:
>
> The idea was to erase white lines thant apear between links generated
> when the site was compiled.

I've been using two perl scripts to achieve something similar to this.

The first (strip_blank_lines) simply removes blank lines and performs
no other formatting.
The second (compact) removes as much white space as is possible
(between tags, at the end and beginning of lines that aren't between
<pre> or <script> blocks, etc.).

They're quite simple, but seem to have worked fin for me so far.

I can post them if you're interested.

--
arno  s  hautala    /-|   ar...@alum.wpi.edu

pgp eabb6fe6 d47c500f b2458f5d a7cc7abb f81c4e00

arquitecto

unread,
Aug 25, 2010, 11:48:48 AM8/25/10
to nanoc
Hi Arno!
Thank you very much.
You are very kind, but don't bother to post it.
I 'll stick to Nokogiri, at least for now...
It's interesting to see that nanoc users have multiple languanges on
the pocket...perl, php, etc...
I thought they were all "ruby talibans"!
:)
Not that I am one of the "ruby talibans"!
:o
O boy..., now all my e-mails will be checked!
Run Arno, Run!


Arquitecto
http://www.utopia-projectos.com

arquitecto

unread,
Aug 30, 2010, 5:04:40 PM8/30/10
to nanoc
Nokogiri and Tidy are great tools but i always prefer low-tech for
small jobs.
Therefore for those interested here is the solution for my problem.
First a file on lib directory with this code:
module Nanoc3::Filters

# This filter cleans blanks lines
class Clean < Nanoc3::Filter

identifier :clean

type :text => :text

def run(content, params={})

content.gsub(/^\s*\n/, "")

end
end
end

After, you just have to place on the rules file where you want to
clean the html items:

filter :clean

Et voilá! A simple gsub method with a regexp solved it! I 'm in love
with ruby!

Susana

unread,
Aug 30, 2010, 10:32:45 AM8/30/10
to nanoc
Hi,
Same issue here and a few questons:
1- What about using Hpricot? Is it worst?
2- I read on the wiki a nice way to parse our content and add
permalinks to headers http://projects.stoneship.org/trac/nanoc/wiki/Tips/AddingHeaderPermalinks
So, like in this example for Hpricot, in order to use Nokogiri, I
should create a new filter, and call it from the rules file, right?
3- Wouldn't it be a good idea to have the default nanoc configuration
aplying one of this filters after all the layouts? Or am I missing
something?
Thanks, and great, great app you got there!

Arno Hautala

unread,
Aug 30, 2010, 6:24:11 PM8/30/10
to na...@googlegroups.com
On Mon, Aug 30, 2010 at 17:04, arquitecto <rtedi...@gmail.com> wrote:
>
> Nokogiri and Tidy are great tools but i always prefer low-tech for
> small jobs.
> Therefore for those interested here is the solution for my problem.

Somewhat similar to my solution. Mine is in perl and should be easy
enough to port to ruby, but it works fine as is.

lib/filters/strip_blank_lines.rb:

module Nanoc3::Filters
class StripBlankLines < Nanoc3::Filter

identifier :strip_blank_lines

def run(content, params={})
open('|/usr/bin/perl lib/scripts/strip_blank_lines.pl', 'r+') do |io|
io.write(content)
io.close_write
io.read
end
end

end
end


lib/scripts/strip_blank_lines.pl:

#!/usr/bin/perl

use warnings;
use strict;

my $count_pre = 0;

while (<>) {
if ($_ =~ m|<pre>|) { $count_pre ++; }
if ($_ =~ m|</pre>|) { $count_pre --; }

# if we're inside any number of <pre> blocks, skip the reduction
if ( 0 == $count_pre ) {
$_ =~ s/^\s*$//; # zap empty lines

}

# print the result
print ;
}


As you can see, it also skips removing blank lines if they're within
one or more <pre> blocks. Presumably, those have some meaning and
should stay.

My compact filter is similar, but removes quite a bit more whitespace.
The result is smaller and harder to read.

lib/scripts/compact.pl:

#!/usr/bin/perl

use warnings;
use strict;

my $count_pre = 0;

while (<>) {
if ($_ =~ m|<pre>|) { $count_pre ++; }
if ($_ =~ m|</pre>|) { $count_pre --; }

if ($_ =~ m|<script|) { $count_pre ++; }
if ($_ =~ m|</script>|) { $count_pre --; }

# if we're inside any number of <pre> or <script> blocks, skip the reduction
if ( 0 == $count_pre ) {
$_ =~ s/[\r\n]/ /g; # turn line endings into spaces
$_ =~ s/\s+/ /g; # reduce serial whitespace to 1 character
$_ =~ s/>\s+</></g; # zap whitespace between tags
$_ =~ s/>\s+$/>/g; # zap whitespace after tags
$_ =~ s/^\s+</</g; # zap whitespace before tags
$_ =~ s/^\s*$//; # zap empty lines

# if we're inside any number of <pre> blocks, eliminate before a
pre block only
} else {
$_ =~ s/^\s+<pre>/<pre>/;
$_ =~ s/^\s+<script/<script/;
}

# print the result
print ;
}


Here I'm also avoiding the removal of spacing within <script> blocks,
because this could cause problems in some cases (ie. comment lines).
I'm also fairly certain that there are cases which could cause
problems if the opening and closing <pre> or <script> tags aren't on
separate lines. I'm fairly consistent in this generation, so I
haven't run into that yet.

This could also be ported to ruby. Again, I haven't taken the time,
but probably should at some point, if only for the exercise.

Denis Defreyne

unread,
Aug 31, 2010, 5:11:45 AM8/31/10
to na...@googlegroups.com
On 30 Aug 2010, at 16:32, Susana wrote:

> Same issue here and a few questons:
> 1- What about using Hpricot? Is it worst?

Both nokogiri and Hpricot will work, but I’m more of a nokogiri fan (I used Hpricot in the past, but I find nokogiri to be easier to use). I’m not sure how Hpricot’s whitespace-removing works, or whether it even does that at all…

> 2- I read on the wiki a nice way to parse our content and add
> permalinks to headers http://projects.stoneship.org/trac/nanoc/wiki/Tips/AddingHeaderPermalinks
> So, like in this example for Hpricot, in order to use Nokogiri, I
> should create a new filter, and call it from the rules file, right?

Yep, that’s correct. The filter would parse the content with nokogiri/Hpricot and then turn the parsed document back into (X)HTML and return that.

> 3- Wouldn't it be a good idea to have the default nanoc configuration
> aplying one of this filters after all the layouts? Or am I missing
> something?

I’d rather keep the default nanoc configuration as minimal as possible. A lot of people would probalby not see the use for such a filter. Adding such a filter would also create an unnecessary dependency on Hpricot or nokogiri. You can use it if you want, but it shouldn’t be enabled/included by default.

Denis

Denis Defreyne

unread,
Aug 31, 2010, 5:15:07 AM8/31/10
to na...@googlegroups.com
On 30 Aug 2010, at 23:04, arquitecto wrote:

> Nokogiri and Tidy are great tools but i always prefer low-tech for
> small jobs.
> Therefore for those interested here is the solution for my problem.

> First a file on lib directory with this code: [snip]

Hi,

First of all, “type :text => :text” can be shortened to “type :text” and, even better, can be left out entirely because it is the default anyway.

I don’t think your filter will always work as expected. Imagine running the filter on this piece of HTML code:

<pre>
class Foo
def moo
puts "hello"
end
end
</pre>

Your filter would strip all leading whitespace, which will ruin the indentation of the code block. The only way to make sure that this doesn’t happen is to use a HTML parser; there’s no way to properly parse HTML and XML with regular expressions (because HTML and XML are not regular languages).

Denis

Denis Defreyne

unread,
Aug 31, 2010, 5:17:03 AM8/31/10
to na...@googlegroups.com
On 31 Aug 2010, at 00:24, Arno Hautala wrote:

> On Mon, Aug 30, 2010 at 17:04, arquitecto <rtedi...@gmail.com> wrote:
>>
>> Nokogiri and Tidy are great tools but i always prefer low-tech for
>> small jobs.
>> Therefore for those interested here is the solution for my problem.
>
> Somewhat similar to my solution. Mine is in perl and should be easy

> enough to port to ruby, but it works fine as is. [..]

Ahh, that’s not a bad solution. It doesn’t have the issues with <pre> elements that I mentioned in my previous mail, so this filter would certainly work better. Thanks for sharing!

Denis

arquitecto

unread,
Aug 31, 2010, 6:14:28 AM8/31/10
to nanoc

> My compact filter is similar, but removes quite a bit more whitespace.
> The result is smaller and harder to read.

Hi, Arno!
Cool script.
Thanks for sharing!
I will test your script to see the diferences. But it seems quite
nice.
I also wish I had more perl knowledge in order to study it better...!
I only have poor ruby skills and I am lost in this great Babylon...
:(

Denis, you can test the regexp for yourself and see that text
indentation is not deleted! Only blank lines without text are deleted!
At least on my files! But I am definetly not a ruby master like you!
You may be watching other things I missed!

Susana, I only tried Nokogiri and I was a litle disapointed...
Don't know If it helps but, after nokogiri installed, you can use my
filter and place there
require 'rubygems'
require 'nokogiri'
Then replace the line with gsub method by the code Denis suggested
Nokogiri::HTML.parse(content).to_s
Unfortunatly, and don't know why, all my files became blank. Maybe my
installation was damaged or I need to put options. But try it for
yourself and see.

Denis Defreyne

unread,
Aug 31, 2010, 6:34:17 AM8/31/10
to na...@googlegroups.com
On 31 Aug 2010, at 12:14, arquitecto wrote:

> Denis, you can test the regexp for yourself and see that text
> indentation is not deleted! Only blank lines without text are deleted!
> At least on my files! But I am definetly not a ruby master like you!
> You may be watching other things I missed!

Ahh, never mind me. You’re right; only blank lines will be removed. Regex are easy to write but hard to read. :)

Denis

Reply all
Reply to author
Forward
0 new messages