convert doc/ppt files to pdf

245 views
Skip to first unread message

Santiago Rodríguez Collazo

unread,
Jun 20, 2017, 8:46:05 PM6/20/17
to Archivematica Tech
Hi

 After a request from a client, I have been working on an access normalization command that converts microsoft word and powerpoint files to pdf using libreoffice.  The steps to create it are:

- First, sudo apt-get install libreoffice

- Then, in Preservation planning -> Format Policy Registry -> Tools,  created a new one with the following paramenters:
  Description: libreoffice
  Version: 4.2.8.2

- In Normalization -> Commands , create  the command using:
  Tool: libreoffice
   Description: Convert office file to pdf
   Command: libreoffice --headless --invisible --convert-to pdf --outdir "%outputDirectory%" "%fileFullName%"
   Output location: %outputDirectory%%fileName%%postfix%.pdf
   Script type: command line
   Output File Format: Acrobat PDF
   Command usage: normalization
   Verification command: Standard verification command (non zero filesize)

- In Normalization -> Rules, created twp rules, one for word and one for powerpoint:
  Purpose: Access
  Format: Word Processing: Microsoft Word (Generic)
  Command: "Convert office file to pdf"



Is a rule like this interesting enough to be addedd toi the fpr?

/santi

--
Santiago Rodríguez
DevOps, Artefactual Systems Inc.

Timothy Walsh

unread,
Jun 21, 2017, 1:51:27 PM6/21/17
to Archivematica Tech
Hi Santi,

IIRC, this was one of the default rules in the earlier days of Archivematica, no? I think it would be great to have normalization (for preservation and access) of word processing files -- not just doc, but also WordPerfect, WordStar, Pages, and so on. In testing though, I haven't found the results of batch libreoffice conversions to be very reliable, so I'm a bit wary of integrating it as a tool into the fpr at this point.

Tim

Robert Gillesse

unread,
Aug 3, 2017, 3:28:17 AM8/3/17
to Archivematica Tech
Thanks Santiago. Sound really interesting. I agree with Tim that it would be really interesting to see this working for a whole range of word processing, and maybe also older spreadsheet, files. 

@Tim: where in your opinion lies the lack of reliability of batch libreoffice conversion: in the authentic representation or the performance? As the layout is not top priority for us we could - probably - live with that. The latter would be more worrisome. 

Robert

Op woensdag 21 juni 2017 19:51:27 UTC+2 schreef Timothy Walsh:

Lachlan Glanville

unread,
Aug 28, 2017, 10:21:01 PM8/28/17
to Archivematica Tech
Hi Santiago, 

This rule does not add a UUID to the converted file. I attempted to append a rename command on the end: 
libreoffice --headless --invisible --convert-to pdf && "%inputFile%" mv "%inputFile%.pdf" %outputDirectory%%prefix%%fileName%%postfix%.pdf"

but had no luck getting this to work, though it works in the terminal.

Lachlan Glanville

unread,
Aug 28, 2017, 10:21:54 PM8/28/17
to Archivematica Tech
sorry, that should be libreoffice --headless --invisible --convert-to pdf "%inputFile%" && mv "%inputFile%.pdf" %outputDirectory%%prefix%%fileName%%postfix%.pdf"

tim.hut...@usask.ca

unread,
Sep 5, 2017, 5:37:53 PM9/5/17
to Archivematica Tech
Hi Lachlan,

Good catch. I got this to work:

libreoffice --headless --invisible --convert-to pdf --outdir "%outputDirectory%" "%fileFullName%"
mv "%outputDirectory%%fileName%%postfix%.pdf" "%outputDirectory%%prefix%%fileName%%postfix%.pdf"

But I had to change the script type to bash script instead of command line. I also changed output location to:
%outputDirectory%%prefix%%fileName%%postfix%.pdf

Tim

Timothy Walsh

unread,
Jan 17, 2018, 4:53:23 PM1/17/18
to Archivematica Tech
Hi all,

Just reviving this thread a bit. After some more evaluation, we've decided to try giving libreoffice another shot as a preservation normalization tool to convert older Microsoft Office, Clarisworks, and WordPerfect files into preservation formats (Open Office and Office Open XML, depending on the source format). I've tested against a few files using a variation of the bash script described above with success, but I'm wary of the many accounts online of libreoffice hanging giving the volume of files we'll be ingesting.

I like the approach described here, which is essentially running libreoffice headless as a subprocess from a Python script with a timeout value specified. I wrote a similar Python script and tried using it as a normalization command, but it seems that Archivematica is using Python 2.7 to run normalization commands specified as Python scripts and timeout was only added as an option to subprocess.call in Python 3.

Before I go off recreating timeout through other means, is there a way to tell Archivematica to use Python 3 for a normalization command?

Thanks!
Tim

Hutchinson, Tim

unread,
Jan 17, 2018, 5:41:51 PM1/17/18
to Archivematica Tech

Hi Tim,

 

For what it’s worth, in my testing of an AIP for which about 1800 files were normalized, I didn’t run into any timeout issues running headless libreoffice. Albeit on a local VM with no other processes running. And of course your volumes may be higher still. With unconv I did get some random timeouts as discussed in this thread.

 

In terms of python3, it looks like you could use script type = no shebang needed but add the path to the python3 interpreter, i.e. the shebang (I must confess that’s a new term for me) as the first line. As you note, the python script type has the python2 path hard-coded. From the FPR docs – “No shebang” allows you to write a script in any language as long as the shebang is included as the first line.

https://www.archivematica.org/en/docs/fpr/

 

Tim

 

Tim Hutchinson
Archivist, University Archives & Special Collections
University Library, University of Saskatchewan

Tel: (306) 966-1643

Email: tim.hut...@usask.ca

On sabbatical leave, July 2017-June 2018

--
You received this message because you are subscribed to a topic in the Google Groups "Archivematica Tech" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/archivematica-tech/onaG67k3ADY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to archivematica-t...@googlegroups.com.
To post to this group, send email to archivema...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/archivematica-tech/6ca47be5-67a1-4a44-810c-7994fa393909%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Timothy Walsh

unread,
Jan 18, 2018, 8:40:08 AM1/18/18
to Archivematica Tech
Hi Tim,

Thanks for the feedback, and the link to the FPR documentation. I should know better than to ask a question without reading the docs first haha.

And good to know about your testing - 1800 files is certainly a larger test set than anything we've done yet. One of our digital processing archivists, Stefana Breitwieser, has been taking the lead on testing with unoconv vs. libreoffice outside of Archivematica, and I'm just starting to set it all up in a Vagrant VM for testing.

Out of curiosity, have you moved beyond testing into production use with libreoffice, or do you plan to?

Thanks,
Tim

To unsubscribe from this group and all its topics, send an email to archivematica-tech+unsub...@googlegroups.com.
To post to this group, send email to archivem...@googlegroups.com.

Hutchinson, Tim

unread,
Jan 18, 2018, 12:00:23 PM1/18/18
to Archivematica Tech

Hi Tim,

 

At this point LibreOffice isn’t part of our production instance, but I think that’s the ultimate plan – in the spirit of “good enough” preservation. One of the tradeoffs will be deciding to what extent to automate such normalization. For example we have a bunch of older Word files which convert quite nicely through the Windows command line, but the headers get mangled via archivematica (i.e. linux). So ideally there would be some manual normalization mixed in (that is, normalization independent of Archivematica). But normalizing as many files as possible even with some less than desirable results is preferable to leaving collections in backlog, and of course we have the option of re-ingesting and more customized normalization for collections that need it.

To unsubscribe from this group and all its topics, send an email to archivematica-t...@googlegroups.com.

--

You received this message because you are subscribed to a topic in the Google Groups "Archivematica Tech" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/archivematica-tech/onaG67k3ADY/unsubscribe.

To unsubscribe from this group and all its topics, send an email to archivematica-t...@googlegroups.com.
To post to this group, send email to archivema...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/archivematica-tech/f357562b-6327-42f1-a89d-96c43d50dcb0%40googlegroups.com.

Timothy Walsh

unread,
Feb 6, 2018, 1:28:38 PM2/6/18
to Archivematica Tech
Thanks for the information, Tim. We're very much embracing the "good enough" model for preservation formats as well, particularly in cases like Microsoft Word where current versions of the software are already unable to open older formats generated by the same software.

In that light, we've implemented a few normalization commands in one of our production Archivematica pipelines, namely:

Transcoding to docx with libreoffice
Transcoding to odt with libreoffice
Transcoding to xlsx with libreoffice
Transcoding to ods with libreoffice
Transcoding to pptx with libreoffice

I've amended the commands a bit to account for the fact that you occasionally get failures due to libreoffice taking too long to spin up in headless mode by running the command several times (this is a very similar approach to Misty's Python script for normalizing with unoconv). Here, for example, is the bash script for "Transcoding to docx with libreoffice":

for i in `seq 1 10`; do libreoffice --headless --invisible --convert-to docx --outdir "%outputDirectory%" "%fileFullName%"
if [ -f "%outputDirectory%%fileName%.docx" ];
   
then mv "%outputDirectory%%fileName%.docx" "%outputDirectory%%prefix%%fileName%%postfix%.docx"
   
break
fi
done


The only catch at the moment is that we've had the highest-quality results in testing (for Microsoft formats, especially) with Libreoffice 5, but didn't have any luck implementing that in Archivematica (it kept hanging) so for the moment we are using 4.2.8.2 420m0(Build:2), the version that is in the default Ubuntu apt repository.

Tim

To unsubscribe from this group and all its topics, send an email to archivematica-tech+unsub...@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "Archivematica Tech" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/archivematica-tech/onaG67k3ADY/unsubscribe.

To unsubscribe from this group and all its topics, send an email to archivematica-tech+unsub...@googlegroups.com.


To post to this group, send email to archivem...@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages