--dc--adobecom.hlx.page/dc-shared/assets/images/frictionless/how-to-images/word-to-pdf-how-to.svg A Microsoft Word document next to an Adobe Acrobat document displaying the Word to PDF conversion process
Here is a modification of a program that worked for me. It uses Word 2007 with the Save As PDF add-in installed. It searches a directory for .doc files, opens them in Word and then saves them as a PDF. Note that you'll need to add a reference to Microsoft.Office.Interop.Word to the solution.
I went through the Word to PDF pain when someone dumped me with 10000 word files to convert to PDF. Now I did it in C# and used Word interop but it was slow and crashed if I tried to use PC at all.. very frustrating.
This lead me to discovering I could dump interops and their slowness..... for Excel I use (EPPLUS) and then I discovered that you can get a free tool called Spire that allows converting to PDF... with limitations!
Also, with Office 2007 having publish to PDF functionality, I guess you could use office automation to open the *.DOC file in Word 2007 and Save as PDF. I'm not too keen on office automation as it's slow and prone to hanging, but just throwing that out there...
Microsoft PDF add-in for word seems to be the best solution for now but you should take into consideration that it does not convert all word documents correctly to pdf and in some cases you will see huge difference between the word and the output pdf. Unfortunately I couldn't find any api that would convert all word documents correctly.The only solution I found to ensure the conversion was 100% correct was by converting the documents through a printer driver. The downside is that documents are queued and converted one by one, but you can be sure the resulted pdf is exactly the same as word document layout.I personally preferred using UDC (Universal document converter) and installed Foxit Reader(free version) on server too then printed the documents by starting a "Process" and setting its Verb property to "print". You can also use FileSystemWatcher to set a signal when the conversion has completed.
I know this can be done using Microsoft.Office.Interop.Word, but my application is .NET Core and does not have access to Office interop. It could be running on Azure, but it could also be running in a Docker container on anything else.
Bad news at the moment there isn't a lot of choice for PDF generation libraries on .NET Core. Since it doesn't look like you want to pay for one and you can't legally use a third party service we have little choice except to roll our own.
The main problem is getting the Word Document Content transformed to PDF. One of the popular ways is reading the Docx into HTML and exporting that to PDF. It was hard to find, but there is .Net Core version of the OpenXMLSDK-PowerTools that supports transforming Docx to HTML. The Pull Request is "about to be accepted", you can get it from here:
Now that we can extract document content to HTML we need to convert it to PDF. There are a few libraries to convert HTML to PDF, for example DinkToPdf is a cross-platform wrapper around the Webkit HTML to PDF library libwkhtmltox.
If you only want to show Word .docx files in a web browser its better not to convert the HTML to PDF as that will significantly increase bandwidth. You could store the HTML in a file system, cloud, or in a dB using a VPP Technology.
Next thing we need to do is pass the HTML to DinkToPdf. Download the DinkToPdf (90 MB) solution. Build the solution - it will take a while for all the packages to be restored and for the solution to Compile.
The DinkToPdf library requires the libwkhtmltox.so and libwkhtmltox.dll file in the root of your project if you want to run on Linux and Windows. There's also a libwkhtmltox.dylib file for Mac if you need it.
Ps. I realise you wanted to convert both .doc and .docx to PDF. I'd suggest making a service yourself to convert .doc to docx using a specific non-server Windows/Microsoft technology. The doc format is binary and is not intended for server side automation of office.
The LibreOffice project is a Open Source cross-platform alternative for MS Office. We can use its capabilities to export doc and docx files to PDF. Currently, LibreOffice has no official API for .NET, therefore, we will talk directly to the soffice binary.
It is a kind of a "hacky" solution, but I think it is the solution with less amount of bugs and maintaining costs possible. Another advantage of this method is that you are not restricted to converting from doc and docx: you can convert it from every format LibreOffice support (e.g. odt, html, spreadsheet, and more).
I wrote a simple c# program that uses the soffice binary. This is just a proof-of-concept (and my first program in c#). It supports Windows out of the box and Linux only if the LibreOffice package has been installed.
I don't know if this suits your use case, as you haven't specified the size of the documents you're trying to write, but if they're < 3 pages or you can manipulate them to be less than 3 pages, it will allow you to convert them into PDFs.
After struggling for some hours, I found that the test.docx copied to bin file is only 1kb. To solve this, right click test.docx > Properties, set Copy to Output Directory to Copy always solves this problem.
For converting DOCX to PDF even with placeholders, I have created a free "Report-From-DocX-HTML-To-PDF-Converter" library with .NET CORE under the MIT license, because I was so unnerved that no simple solution existed and all the commercial solutions were super expensive. You can find it here with an extensive description and an example project:
You only need the free LibreOffice. I recommend using the LibreOffice portable edition, so it does not change anything in your server settings. Have a look, where the file "soffice.exe" (on Linux it is called differently) located, because you need it to fill the variable "locationOfLibreOfficeSoffice".
As you see, you can also convert from DOCX to HTML. Also, you can put placeholders into the Word document, which you can then "fill" with values. However, this is not in the scope of your question, but you can read about that on Github (README).
This is adding to Jeremy Thompson's very helpful answer. In addition to the word document body, I wanted the header (and footer) of the word document converted to HTML. I didn't want to modify the Open-Xml-PowerTools so I modified Main() and ParseDOCX() from Jeremy's example, and added two new functions. ParseDOCX now accepts a byte array so the original Word Docx isn't modified.
In my case, I then convert the HTML files to images (using Net-Core-Html-To-Image, also based on wkHtmlToX). I combine the header and body images together (using Magick.NET-Q16-AnyCpu), placing the header image at the top of the body image.
Here is my implementation of Shmuel H. method using LibreOffice binary on windows, maybe this could help someone out. It works pretty well, just ensure you install LibreOffice, I used the portable version ( -versions/) and copied it to my C drive. Performance wise it is not too bad, most of the time it takes is for loading LibreOffice into memory. Apparently you can have it running as a service somehow which should speed things up but I haven't been able to do so yet.
Hi @zaeendesouza ,
Works fine for me (also with xlsx files).
Thanks for sharing and welcome to the ODK community forum
When you'll get a chance don't hesitate to take some time to introduce yourself here .
I think the issue is that your form uses a column called 'label::english' while the requirement is just 'label'. Convert 'label::english' to 'label' and it should work fine. I will be pushing an update sometime next week, so will try to fix this, or mention that it you need to change the column to this.
Enketo allows PDF exports but they aren't always the best for these purposes. If I was to give any feedback it would be to decide on what the purpose of the word exports is. If it is for someone who will review the content of the form then maybe it works well as it is.
If the purpose is for someone who needs to fill in the form but realises they cannot do it digitally, then maybe the choices should have check boxes and the text questions should have spaces ______. In that scenario the question "name" might not be needed.
In this context, I'd also want to echo @Stephen_K_ojwang that one"formatting" thing I typically do when doing similar printouts for reviewers (usually doing that much more manually, which is what's so nice about your tool) is use "groups" to kind of organise my printout, so highlight the group header row in a different colour, bold, that kind of thing, and leave a blank space.
Even if there is no indication of which being_group and end_group lines are pairs, simply formatting them differently to the other questions is already very useful. Especially if they are "field-list" groups that should appear on one mobile device screen.
Hi Janna, thanks for flagging these too. Will look into it as well. Would you mind sharing an anonymized sample questionnaire IF possible via dm? I might need to sit and check which questions are getting dropped and why!
Conversions from adjectives to nouns and vice versa are both very common and unnotable in English; much more remarked upon is the creation of a verb by converting a noun or other word (for example, the adjective clean becomes the verb to clean).
In English, verbification typically involves simple conversion of a non-verb to a verb. The verbs to verbify and to verb, the first by derivation with an affix and the second by zero derivation, are themselves products of verbification (see autological word), and, as might be guessed, the term to verb is often used more specifically, to refer only to verbification that does not involve a change in form. (Verbing in that specific sense is therefore a kind of anthimeria.)
Verbification may have a bad reputation with some English users because it is such a potent source of neologisms. Although some neologism that are products of verbification may meet considerable opposition from prescriptivist authorities (the verb sense of impact is a well-known example), most such derivations have become so central to the language after several centuries of use that they no longer draw notice.
d3342ee215