Extract Text from Corrupt DOCX, XLSX, and PPTX Files Office 2007 Files

139 views
Skip to first unread message

socrtwo

unread,
Feb 28, 2009, 12:16:04 AM2/28/09
to 102 Business Ideas
Office 2007 files are compound zip files of XML and images or other
multimedia. The text for Office 2007 is in the document.xml file for
Word 2007, in the sharedstrings.xml file for Excel2007 and contained
on each slide numbers xml for PowerPoint 2007 files, such as
slide1.xml, slide2.xml, or slide3.xml.

To extract the strings from these files even if they are corrupt, you
can fist repair the zip nature of the docx, xlsx, and pptx files, then
just extract those files mentioned in the first paragraph, and then
feed those xml file though Tidy HTML. Tidy HTML will make nice Web
pages of the text even if the XML is corrupt and no longer well
formed.

One could easily make a web service that does those three steps 1.
Repair the zip. Extract the relevant text containing xml file and
three running Tidy HTML to extract the text even if the file is no
longer well formed XML.

socrtwo

unread,
Aug 18, 2014, 7:01:19 PM8/18/14
to 102-busin...@googlegroups.com
Here are my realizations of these kinds of software. I developed them, got help developing them or sponsored their development:
  • Previous Version File Recoverer - recover previous versions of file if System Restore service is turned on in your Vista, Windows 7 or Windows 8 machines.
  • Corrupt DOCX Salvager - Recover text from corrupt Word DOCX programs. If this doesn't work or is unsatisfactory many free alternatives are available on an alternatives menu.
  • S2 Recovery Tools for Microsoft Word - invoke almost all of Microsoft's recommended methods for recovering from Microsoft Word corruption. The program adds many of its own methods.
  • Savvy DOCX Recovery - one button access to current algorithms for recovering from DOCX corruption. Soon will be updating the algorithm for recovering from "Unspecified errors".
  • Corrupt XLSX Salvager - simple data recovery from corrupt XLSX files.
  • S2 Recovery Tools for Microsoft Excel - invoke almost all of Microsoft's recommended methods for recovering from Microsoft Excel corruption. The program adds several of its own methods.
  • Excel Recovery Add-In - an add-in for Excel with many of the methods of S2 Recovery Tools for Excel.
  • Corrupt PPTX Salvager - recover the text at least from corrupt PPTX files.
  • S2 Recovery Tools for MS PowerPoint - invoke almost all of Microsoft's recommended methods for recovering from Microsoft PowerPoint corruption. The program adds several of its own methods.
  • Corrupt Open Office Recovery - recover the text and maybe formatting from corrupt Open Office files.
  • Corrupt Office Salvager - recover the text from corrupt DOCX, XLSX, PPTX, ODT,ODS and ODP files. Possibly recover the formatting from ODT,ODS and ODP files
  • Corrupt Extractor for Microsoft Office - recover the text and media such as pictures, videos (maybe) and sound files from corrupt DOCX, XLSX and PPTX. Also has a basic XML editor for fixing bad XML files. Finally has a zip repair facility for the zip structure of  DOCX, XLSX and PPTX. Note most of the programs above will automatically attempt to fix the zip structure of the corrupt file before trying to recover them.
Reply all
Reply to author
Forward
0 new messages