socrtwo
unread,Feb 28, 2009, 12:16:04 AM2/28/09Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to 102 Business Ideas
Office 2007 files are compound zip files of XML and images or other
multimedia. The text for Office 2007 is in the document.xml file for
Word 2007, in the sharedstrings.xml file for Excel2007 and contained
on each slide numbers xml for PowerPoint 2007 files, such as
slide1.xml, slide2.xml, or slide3.xml.
To extract the strings from these files even if they are corrupt, you
can fist repair the zip nature of the docx, xlsx, and pptx files, then
just extract those files mentioned in the first paragraph, and then
feed those xml file though Tidy HTML. Tidy HTML will make nice Web
pages of the text even if the XML is corrupt and no longer well
formed.
One could easily make a web service that does those three steps 1.
Repair the zip. Extract the relevant text containing xml file and
three running Tidy HTML to extract the text even if the file is no
longer well formed XML.