cleaning up msword HTML export

webdev_aw_ucsb

unread,

Sep 14, 2009, 6:43:49 PM9/14/09

to nokogiri-talk

greetings --

I am trying to cleanup an MSWord HTML export to something much cleaner
for consumption by TinyMCE.

I would like to take the orignal dirty html file remove some generic
MSWord classes -- and in a few instances create my own wrapper
classes ...

Here are my 2 problems:

1) I want to cleanup the HTML by removing unwanted attributes & styles
and then output a simplified HTML page:
unwanted_class_node = doc.search("//p[@class ='MsoNormal']")
unwanted_class_node.each do |n|
n.remove_attribute('class')
end
@builder = Nokogiri::HTML::Builder.new do |dest|
dest.html {
dest.head {}
dest.script {}
dest.style {}
dest.body {
dest.text(
#doc.xpath('//body').children.to_html
doc.xpath('//body').inner_html
)
}
}
end
File.open('cleaned.html', 'w') {|f| f.write(@builder.to_html) }

But in the above example code the resulted cleaned.html is contains
HTML-literals < instead of < within the destination files <body>
tag.

2) For some block element(s) in the source HTML there is:
blah blhal bhalb blah blah      foo foo foo foo foo foo foo

I would like to transform the above to essentially:
blah blhal bhalb blah blah foo foo
foo foo foo foo foo foo

But I'm having trouble accomplishing (2) as this is my first exposure
to XPath / Nokogiri -- but I believe the tools are in my hands to get
this done ... and I think this "might work" (TM).

Thanks for your time,
David

Aaron Patterson

unread,

Sep 14, 2009, 7:47:04 PM9/14/09

to nokogi...@googlegroups.com

Yes, when you call the text method, it assumes that you want to insert
text and will escape characters for you. I would recommend copying
the nodes from one document to another like this:

doc1 = Nokogiri::HTML(<<-eohtml)
<html>
<body><h1>Hello World</h1> how are you?</body>
</html>
eohtml

dest = Nokogiri::HTML::Builder.new do |b|
b.html do
b.head do
b.script
b.style
end

b.body
end
end.doc

# Get the destination body
body = dest.at('body')

# Add the children of the source body to the destination body
doc1.at('body').children.each { |c| body << c }

puts dest

> 2) For some block element(s) in the source HTML there is:
> blah blhal bhalb blah blah yes">     foo foo foo foo foo foo foo
>
> I would like to transform the above to essentially:
> blah blhal bhalb blah blah foo foo
> foo foo foo foo foo foo
>
> But I'm having trouble accomplishing (2) as this is my first exposure
> to XPath / Nokogiri -- but I believe the tools are in my hands to get
> this done ... and I think this "might work" (TM).

If you're more comfortable with CSS, you should use CSS. Here is a
CSS query that says "find all span tags that have an attribute named
style whose value is 'mso-spacerun: yes'":

doc.css('span[style = "mso-spacerun: yes"]').each do |span|
span['style'] = 'poetry-line-odd'
end

Here is the same query using XPath:

doc.xpath('//span[@style = "mso-spacerun: yes"]').each do |node|
span['style'] = 'poetry-line-odd'
end

Hope that helps!

--
Aaron Patterson
http://tenderlovemaking.com/

Mike Dalessio

unread,

Sep 15, 2009, 9:24:38 AM9/15/09

to nokogiri-talk

You may want to try Loofah, an HTML sanitizer based on Nokogiri. It has a filter, 'whitewash', which does precisely what you are asking for.

http://github.com/flavorjones/loofah

If Loofah doesn't do exactly what you want, please let me know, I'm happy to improve it!

2009/9/14 webdev_aw_ucsb <david...@gmail.com>

--
mike dalessio
mi...@csa.net

Reply all

Reply to author

Forward