Unconv document conversion tool in Archivematica

242 views
Skip to first unread message

Robert Gillesse

unread,
Jul 24, 2017, 7:00:10 AM7/24/17
to archivematica
Hello everybody,

I would like to know if anybody of you is currently or has been using the Unoconv document conversion tool (https://github.com/dagwieers/unoconv) for normalisation purposes within Archivematica. As the tool seems now more stable then it was a few years ago (https://wiki.archivematica.org/Normalizing_Office_Documents) maybe it's in a state where it could work within Archivematica. 

It would be great if anybody could share some of their experiences with using Unoconv! 

Thanks,
Robert Gillesse

Digital Archivist

 

international institute of social history

Alex Garnett

unread,
Jul 24, 2017, 3:59:52 PM7/24/17
to archivematica
Hi Robert,

I think, but am not certain, that upstream Archivematica currently ships with an unoconv microservice to produce "access" copies of PDFs from Word docs; if it's not actually there, we can share one from our environment.

It works OK -- the issue has never necessarily been that the tool itself isn't stable (it's mostly just a python wrapper on LibreOffice's "headless" converter functionality, which is notoriously difficult to use on its own due to legacy java stuff that isn't well-understood), but that LibreOffice's conversion fidelity isn't necessarily up to snuff for preservation use. The layout behaviour of Word and PowerPoint is notoriously difficult to replicate despite what is basically a decades-long reverse engineering effort and the file formats themselves being open sourced at one point, so it's still generally best used when you *have* to recover content from an old document and don't care about getting it 100%, or conversely when you really want Office content in a web-embeddable format like PDF for display purposes.

Lachlan Glanville

unread,
Jul 24, 2017, 7:32:20 PM7/24/17
to archivematica
I've been using unoconv to manually normalise word docs where we have accessibility issues. If there is an implementation of Archivematica that incorporates it as a microservice I'd love to see it. 
I'd echo Alex's sentiments re fidelity. Most documents we've converted have been quite true to the original, but we do notice changes in spacing and formatting, especially in more complex documents with footnotes etc. I'd only use it for dissemination files where the original is no longer suitable.

Alex Garnett

unread,
Jul 25, 2017, 4:08:57 PM7/25/17
to archivematica

Tim Hutchinson

unread,
Aug 1, 2017, 12:08:46 PM8/1/17
to archivematica
Note that there is also a short thread in the archivematica-tech group documenting an equivalent normalization rule using the headless libreoffice. https://groups.google.com/d/topic/archivematica-tech/onaG67k3ADY/discussion

This wasn't immediately obvious to me, so for anyone else trying this I'll point out that the script linked above needs to be added as a normalization command in the FPR (with the relevant packages installed). The normalization itself working for me although I'm currently getting a verification error; I don't know if something needs to be adjusted in the script or if something is incorrect in the way I've configured the rule.

After some testing, I'm coming around to the view that having this as a default normalization rule for office documents would be worthwhile - you can always do manual normalization instead. In the spirit of "good enough" preservation ... the current alternative is to do nothing (and of course you still have the original).

I think the spacing issue is largely due to fonts (even if the Windows fonts are loaded), although it's not bad. Footnotes seem to be implemented with the LibreOffice default - roman numerals for endnotes; and the page number location also changes. The main problems I've come across are with older WordPerfect and Word formats:
- Word for DOS 5.x - via Archivematica (i.e. on linux), the headers and footers are garbled. I.e. more than an encoding problem, it's bringing in the document signature, stylesheet reference etc. On Windows these files are handled quite well, including with the command line. I'm assuming there's an input filter that could be used but so far I haven't had any luck.
- WordPerfect 4.2 - different encoding was used for (e.g.) line spacing, hyphens, centering. The hyphen is perhaps most problematic for readability... In this case the Windows version of LibreOffice has the same limitation. The WordPerfect 4.2 spec is documented, so in theory this could be improved through the LibreOffice libraries.

Tim

Robert Gillesse

unread,
Aug 3, 2017, 4:18:13 AM8/3/17
to archivematica
Thanks for all your answers - most helpful. For our specific situation I think we could live with a certain amount of lay-out loss. As long as textual content wouldn't be changed. But the examples of Nick mentions are worrisome as they seem to change more than just the lay out (and where does "lay-out" ends and "textual content" begins?). 

Any experiences with peformance of Unoconv when it comes to the conversion of large scale archives?

Robert


Op dinsdag 1 augustus 2017 18:08:46 UTC+2 schreef Tim Hutchinson:

Alex Garnett

unread,
Aug 3, 2017, 9:24:18 AM8/3/17
to archivematica
Performance per se isn't a problem, but because it technically spins up its own listener every time it runs, there can be some "job failed" issues around request timing -- you'll notice that the unmerged code I linked makes several requests to unoconv before giving up to work around this.

Lachlan Glanville

unread,
Aug 9, 2017, 9:49:51 PM8/9/17
to archivematica
Tried using Alex's script and got some indefinite hangs which meant I had to restart the MCP server. The job usually converted 4-5 documents before stopping entirely, not sure why it doesn't fail when it can't connect to a listener. I might try reducing max tries to see if that helps. There's probably a way of keeping a listener open and ready, but will require a bit of digging. 
Santiago's command directly to libreoffice doesn't have this issue, but it has created some odd formatting with occasional page breaks in the middle of sentences. 

On Monday, 24 July 2017 21:00:10 UTC+10, Robert Gillesse wrote:
 
Reply all
Reply to author
Forward
0 new messages