Secrets of Recovering Corrupt Office Open DOCX, XLSX and PPTX Files

1,333 views
Skip to first unread message

socrtwo

unread,
Mar 23, 2012, 11:13:58 AM3/23/12
to datarecove...@googlegroups.com
Docx, xlsx and pptx so called "Office Open" format Microsoft Office 2007 and 2010 files are in reality zipped collections of mostly xml files. Normally each part of an ordinary zip file is discretely zipped itself such that if a zip file overall is partly corrupt it means that other parts or files within the zip can be recovered. Even within a zipped up individual file within the larger zip, corruption of part of the file doesn't prevent the healthy part of it from being recovered by some zip repair software.

The most effective scheme for recovering corrupt docx, xlsx and pptx files I found is as follows:
  1. First repair the zip structure of the file with InfoZip's Zip -FF command followed by the zip -F command. For a command line for example you might do Zip -FF gwn.docx --out gwn_repaired.docx. You can download InfoZip's zip here: http://sourceforge.net/projects/infozip/files/
  2. Try to open your docx, xlsx or pptx file in Word, Excel or PowerPoint respectively. Choose the second step of salvaging text, formulas or data if and when offered to you. If the programs ultimately tells you the file is corrupt and can't be open, click on the "more details" button and carefully make note of which xml file and what line and column the program is choking on. If no xml file is referenced in this "more details" window, then unfortunately, it is unlikely for you to be able to recover any text, formulas or data.
  3. Extract your docx, xlsx or pptx file with 7zip's 7z.exe command line program with the x command.  A possible hangup here is that the name of the folder you wish to extract your zipped docx, xlsx or pptx file to should follow the "-o" parameter with no spaces. So a command might be 7z.exe x gwn.docx -ogwnoutput. 7zip can be downloaded as follows and note I found it more effective than InfoZip's Unzip for recovering XML files from partially corrupt Office Open files:  http://www.7-zip.org/download.html.
  4. In your extraction folders, find the xml file referenced in the more detail error in step 2. Open the XML file with the freeware NotePad++  ( http://notepad-plus-plus.org/ ) or some other XML editor that gives XML line and column numbers. Locate the error and then remove the rest of the XML from the error to the end of the file (unless you know how to correct the XML error but where there is one error, there are usually others). Also remove the rest of the tag upstream from where the error is indicated so that the file now end only in text, formulas or data. In a later iteration of this step, you may need to remove a complicated xml tag if it now ends the file.
  5. Now use the xmllint --recover command on the truncated xml file. An example command might be xmllint --recover word/document.xml -o word/document.xml. You can install xmllint and it's dll support files by installing the free Strawberry Perl:  http://strawberryperl.com/. Xmllint usually gets installed into the C:\strawberry\c\bin directory where you can copy it along with the support files libiconv-2_.dll, libxml2-2_.dll and libz_.dll.  However I wouldn't bother most of the time, because, I think the  C:\strawberry\c\bin folder gets added to the path variable in the Control Panel's Windows System app during Strawberry Perl installation after which you can use and xmllint command from any folder. There are other ways to install xmllint which you can find by Googling.
  6. So anyway what xmllint --recover does is adds the correct ending tags to the truncated xml file you made in step 4. So no we can rezip our xml files. Rezip all the files within the folder you made with 7z.exe extraction in step 3, but don't zip the larger folder itself, just the contents. Change the extension of the rezipped contents to docx, xlsx or pptx and try to open the file in your appropriate MS Office program.  If you still get an error, repeat steps 3-6. One alternative say if you keep getting an XML error referring to an xml file that is not integral to content say styles.xml in a Word docx file instead of a document.xml one, or a styles.xml file instead of a worksheet#.xml file in Excel, is to just delete that file and rezip. Sometimes Word or presumably other programs (of which I have less experience) won't get hung up then and will rebuild the missing say again styles.xml file automatically on its own. In my experience, Word will usually not care about styles.xml corruption as long as there is a well constructed document.xml file with proper ending tags as produced by xmllint, although one file I recovered did care about styles.xml until I removed it.
So that is it. I'm working to make this process automatic in my freeware. An early version is available here: https://sourceforge.net/projects/quickwordrecovr/. It so far doesn't work in Vista for some reason...
Reply all
Reply to author
Forward
0 new messages