1. Change the docx extension to zip. Docx and all the new MS Office
2007 format documents are made up of many XML documents in one zip
file.
2. Repair the file with a zip repair program. Document corruption
seems to be often caused by the zip file corruption. One of the best
I have found is the zip repair program that is part of the zip suite
Ccy's HaHaZip -
http://www.ccyjchk.com/catalog/hahazip.php
3. Look in the "Word" folder of the zip file and extract the file
"document.xml", or extract the whole file. The text of your original
document though will probably be found exclusively in the document.xml
file.
4. Extract the XML from the document.xml file at
http://taporware.mcmaster.ca/~taporware/xmlTools/xmlquery.shtml. Use
the HTML output. If you get an error message about malformed XML, you
have more work to do.
5. The best program I have found for fixing malformed XML is a
combination of Microsoft Expressions or FrontPage, and XMLShell found
at:
http://www.softgauge.com/xmlshell/index.htm. Opening the
document.xml file in Microsoft Expressions I or II will highlight
malformed XML in yellow. What will happen probably is that a section
toward the end of what was recoverable by the zip repair, will start
being malformed or a whole middle section of the XML file will be
malformed. Excise this until you no long see the yellow highlighting
indicating illegal formatting. After this open the file in XML
shell. It immediately will tell you the first XML error it
encounters. If you are lucky, it will indicate only which XML
elements are missing to properly close up the file. If you then type
the characters "</" it will start finishing the elements with what is
missing to close up the file, such as </w:rPr>, </w:r>, </w:p>, </
w:body> and </w:document>. If you are unlucky, you may be stuck with
a cryptic XML error. If you can't figure out how to fix, you may
think about cutting the entire section out from the error to the end
of the file and turning just that section into a new XML file in
FrontPage or MS Expressions, to play around with.
6. Hopefully now in XML shell you have a file that won't give
errors. You can check this fir sure by choosing "Check well formed"
on the Tools Menu of the XML Shell. Once you have a well formed XML
document, try step 4. again. If that doesn't work, E-mail me the file
at
soc...@s2services.com. I charge $22 for the text extraction.
Another possibility is to open well formed XML in Excel and copying
the the text column (usually the 10th or so column) to Word and doing
a paste special of just text no formatting, and then removing the line
breaks.