Hi,
What exactly do you want to achieve? Do you want a HTML prettifier or a HTML minifier?
One tool that might be useful for both is Nokogiri. “Nokogiri::HTML.parse(content).to_s” should get rid of most unnecessary whitespace and return something that is both fairly compact yet readable. There are probably better solutions out there though…
Denis
Denis
I've been using two perl scripts to achieve something similar to this.
The first (strip_blank_lines) simply removes blank lines and performs
no other formatting.
The second (compact) removes as much white space as is possible
(between tags, at the end and beginning of lines that aren't between
<pre> or <script> blocks, etc.).
They're quite simple, but seem to have worked fin for me so far.
I can post them if you're interested.
--
arno s hautala /-| ar...@alum.wpi.edu
pgp eabb6fe6 d47c500f b2458f5d a7cc7abb f81c4e00
Somewhat similar to my solution. Mine is in perl and should be easy
enough to port to ruby, but it works fine as is.
lib/filters/strip_blank_lines.rb:
module Nanoc3::Filters
class StripBlankLines < Nanoc3::Filter
identifier :strip_blank_lines
def run(content, params={})
open('|/usr/bin/perl lib/scripts/strip_blank_lines.pl', 'r+') do |io|
io.write(content)
io.close_write
io.read
end
end
end
end
lib/scripts/strip_blank_lines.pl:
#!/usr/bin/perl
use warnings;
use strict;
my $count_pre = 0;
while (<>) {
if ($_ =~ m|<pre>|) { $count_pre ++; }
if ($_ =~ m|</pre>|) { $count_pre --; }
# if we're inside any number of <pre> blocks, skip the reduction
if ( 0 == $count_pre ) {
$_ =~ s/^\s*$//; # zap empty lines
}
# print the result
print ;
}
As you can see, it also skips removing blank lines if they're within
one or more <pre> blocks. Presumably, those have some meaning and
should stay.
My compact filter is similar, but removes quite a bit more whitespace.
The result is smaller and harder to read.
lib/scripts/compact.pl:
#!/usr/bin/perl
use warnings;
use strict;
my $count_pre = 0;
while (<>) {
if ($_ =~ m|<pre>|) { $count_pre ++; }
if ($_ =~ m|</pre>|) { $count_pre --; }
if ($_ =~ m|<script|) { $count_pre ++; }
if ($_ =~ m|</script>|) { $count_pre --; }
# if we're inside any number of <pre> or <script> blocks, skip the reduction
if ( 0 == $count_pre ) {
$_ =~ s/[\r\n]/ /g; # turn line endings into spaces
$_ =~ s/\s+/ /g; # reduce serial whitespace to 1 character
$_ =~ s/>\s+</></g; # zap whitespace between tags
$_ =~ s/>\s+$/>/g; # zap whitespace after tags
$_ =~ s/^\s+</</g; # zap whitespace before tags
$_ =~ s/^\s*$//; # zap empty lines
# if we're inside any number of <pre> blocks, eliminate before a
pre block only
} else {
$_ =~ s/^\s+<pre>/<pre>/;
$_ =~ s/^\s+<script/<script/;
}
# print the result
print ;
}
Here I'm also avoiding the removal of spacing within <script> blocks,
because this could cause problems in some cases (ie. comment lines).
I'm also fairly certain that there are cases which could cause
problems if the opening and closing <pre> or <script> tags aren't on
separate lines. I'm fairly consistent in this generation, so I
haven't run into that yet.
This could also be ported to ruby. Again, I haven't taken the time,
but probably should at some point, if only for the exercise.
> Same issue here and a few questons:
> 1- What about using Hpricot? Is it worst?
Both nokogiri and Hpricot will work, but I’m more of a nokogiri fan (I used Hpricot in the past, but I find nokogiri to be easier to use). I’m not sure how Hpricot’s whitespace-removing works, or whether it even does that at all…
> 2- I read on the wiki a nice way to parse our content and add
> permalinks to headers http://projects.stoneship.org/trac/nanoc/wiki/Tips/AddingHeaderPermalinks
> So, like in this example for Hpricot, in order to use Nokogiri, I
> should create a new filter, and call it from the rules file, right?
Yep, that’s correct. The filter would parse the content with nokogiri/Hpricot and then turn the parsed document back into (X)HTML and return that.
> 3- Wouldn't it be a good idea to have the default nanoc configuration
> aplying one of this filters after all the layouts? Or am I missing
> something?
I’d rather keep the default nanoc configuration as minimal as possible. A lot of people would probalby not see the use for such a filter. Adding such a filter would also create an unnecessary dependency on Hpricot or nokogiri. You can use it if you want, but it shouldn’t be enabled/included by default.
Denis
> Nokogiri and Tidy are great tools but i always prefer low-tech for
> small jobs.
> Therefore for those interested here is the solution for my problem.
> First a file on lib directory with this code: [snip]
Hi,
First of all, “type :text => :text” can be shortened to “type :text” and, even better, can be left out entirely because it is the default anyway.
I don’t think your filter will always work as expected. Imagine running the filter on this piece of HTML code:
<pre>
class Foo
def moo
puts "hello"
end
end
</pre>
Your filter would strip all leading whitespace, which will ruin the indentation of the code block. The only way to make sure that this doesn’t happen is to use a HTML parser; there’s no way to properly parse HTML and XML with regular expressions (because HTML and XML are not regular languages).
Denis
> On Mon, Aug 30, 2010 at 17:04, arquitecto <rtedi...@gmail.com> wrote:
>>
>> Nokogiri and Tidy are great tools but i always prefer low-tech for
>> small jobs.
>> Therefore for those interested here is the solution for my problem.
>
> Somewhat similar to my solution. Mine is in perl and should be easy
> enough to port to ruby, but it works fine as is. [..]
Ahh, that’s not a bad solution. It doesn’t have the issues with <pre> elements that I mentioned in my previous mail, so this filter would certainly work better. Thanks for sharing!
Denis
> Denis, you can test the regexp for yourself and see that text
> indentation is not deleted! Only blank lines without text are deleted!
> At least on my files! But I am definetly not a ruby master like you!
> You may be watching other things I missed!
Ahh, never mind me. You’re right; only blank lines will be removed. Regex are easy to write but hard to read. :)
Denis