Displaying Word converted to HTML - looks terrible

6 views
Skip to first unread message

Chris W

unread,
Apr 13, 2009, 6:44:29 PM4/13/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
I know that MS Word has code that is not true, clean HTML (it makes it
look terrible). However, I still need to display the HTML version of a
Word document.

I have several hundred documents that I need to 'clean up' so they can
be seen correctly when searched for in Google Mini, the problem is the
tools I've tried remove more code than I want causing me to loose some
of the formatting I want to keep. I've tried Notepad++, Word Cleaner
from Zapadoo, and the Save As tool in Word that filters the document.
None did exactly what I need.

This will be an ongoing problem for me since the documents are edited
regularly.

Has anyone had success with displaying Word files saved as HTML? (all
our editors use Word, so we need to keep using it).

brianb

unread,
Apr 15, 2009, 12:35:40 AM4/15/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Hi Chris,

I am not sure I understand exactly what you are trying to do. But the
Mini will crawl documents in Word format and before it indexes them,
it will convert them to html. You can see this if you check out the
cache page. The converted html of course is not an exact replica of
the Word file but it generally not that bad. From your description
above though it sounds like you are using outside tools to convert to
html and then crawl them? can you give us a little background on what
you are trying to do?

Brian

Chris W

unread,
Apr 15, 2009, 11:09:53 AM4/15/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Our editors in each department create their procedures in MS Word.
There is specific formating they follow (fonts, tables, headings,
bullet points, etc.). We want to display the document as HTML so it
loads faster for our users. With a previous tool this wasn't a
problem.

To make the Word document HTML we are using the Save As feature in MS
Word and saving the file as HTML. Opening the document directly with
Internet Explorer exactly correct, but when we search for the document
via Google Mini and view it there, it has extra spaces between the
bullet point and the text filled with question marks, the dashes
apostrophes and quotes aren't recognized so there is a box in their
place, there is extra line spacing between paragraphs, and so on.

A tool we are trying is Word Cleaner. It's removes the Microsoft
proprietary coding, but it cleans out some of the formatting I want to
keep; and I don't want to have to go in and fix each document every
time they are updated.

My question is why does the document look fine when viewed in IE, but
when they are searched for via Google Mini and viewed in IE they have
all that junk showing and the formatting has changed? Is there
something we can do about it?

demodulated

unread,
Apr 15, 2009, 3:21:20 PM4/15/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
"when we search for the document via Google Mini and view it there, it
has extra spaces"

Are you talking about viewing the cached version of the Word DOC on
the Google Mini? I'm kind of struggling to understand what your
question has to do with search.

As for converting text documents to HTML, I bet you'd have the most
luck just pasting the word document into a WYSIWYG web authoring
tool. As previously mentioned, though, I don't think this would help
nor hinder search in any way versus crawling your Word DOCs, unless
for some reason you are trying to optimize how the cached copy looks
on the Mini. Perhaps you could clarify your reasoning behind this
exercise?

miguev

unread,
Apr 16, 2009, 5:44:49 AM4/16/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

Hi Chris,

When you say "dashes apostrophes and quotes aren't recognized so there
is a box in their place" I think in encoding mismatch, something in
the HTML document generated by MS Word that is causing the Mini to
misunderstand the character encoding. Maybe you have an option to
specify a different encoding when you Save As HTML, try a few (i.e.
UTF-8, ISO-8859-1) and see if that helps. Otherwise you may have to
write a script to "correct" the HTML code before crawling it.

Chris W

unread,
Apr 16, 2009, 12:25:22 PM4/16/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Sorry for the confusion.

I am not referring to the cashed version, it's the current version.
The issue isn't with Search, it is with Google (i'm assuming) since
that is the only change I am introducing. Here is an example of what
I'm doing.

Scenario A: I double click on the HTML file on the server and it opens
in IE. The formatting looks correct.
Scenario B: I open IE and search using Google Mini for the exact same
file, the formatting looks off (has the problems I've listed above).

I'm using the same document and the same application to view the
document, the only thing different is that Google is involved with the
one that doesn't display the formatting correctly. Is that because
when Word converts the file to HTML via the Save As feature, it puts
in proprietary coding, and if so, why isn't that displayed properly?

~Chris

Thiru

unread,
Apr 16, 2009, 2:51:49 PM4/16/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Hi Chris,

How are you crawling the word document into the Mini ? If you are
performing a web crawl & serve then the Google appliance simply
redirects the user to the web server when the user clicks the link in
the search results.
However if you are crawling the docs from an SMB file share then
Google Mini will be proxying the request to the smb file server when
the user clicks the search result. You can overwrite the proxying part
by editing the frontend xslt.

In the xslt code, look for the entry :
select="concat($protocol,'/',$temp_url)"/>

and change it to :
select="concat('file://///',$temp_url)"/>

Cheers,
Thiru
Reply all
Reply to author
Forward
0 new messages