Note that there is also a short thread in the archivematica-tech group documenting an equivalent normalization rule using the headless libreoffice.
https://groups.google.com/d/topic/archivematica-tech/onaG67k3ADY/discussion
This wasn't immediately obvious to me, so for anyone else trying this I'll point out that the script linked above needs to be added as a normalization command in the FPR (with the relevant packages installed). The normalization itself working for me although I'm currently getting a verification error; I don't know if something needs to be adjusted in the script or if something is incorrect in the way I've configured the rule.
After some testing, I'm coming around to the view that having this as a default normalization rule for office documents would be worthwhile - you can always do manual normalization instead. In the spirit of "good enough" preservation ... the current alternative is to do nothing (and of course you still have the original).
I think the spacing issue is largely due to fonts (even if the Windows fonts are loaded), although it's not bad. Footnotes seem to be implemented with the LibreOffice default - roman numerals for endnotes; and the page number location also changes. The main problems I've come across are with older WordPerfect and Word formats:
- Word for DOS 5.x - via Archivematica (i.e. on linux), the headers and footers are garbled. I.e. more than an encoding problem, it's bringing in the document signature, stylesheet reference etc. On Windows these files are handled quite well, including with the command line. I'm assuming there's an input filter that could be used but so far I haven't had any luck.
- WordPerfect 4.2 - different encoding was used for (e.g.) line spacing, hyphens, centering. The hyphen is perhaps most problematic for readability... In this case the Windows version of LibreOffice has the same limitation. The WordPerfect 4.2 spec is documented, so in theory this could be improved through the LibreOffice libraries.
Tim