for quite a bunch of PDF files with mixed and somewhat complex content I've tried several tools out the, that can convert PDF to TEXT. Those that are command line capable like PDFtoTEXT.EXE unfortunately are 1.) slow and 2.) the resulting TXT files are not presenting proper content.
Best results I got so far using the Foxit Reader: Fast and reliably creates correct content in the resulting TXT files, but Foxit Reader is not command line capable in a manner to use it for batch processing.
So I wrote this script (ugly: Using a lot of send() commands, as ControlClick() to the appropriate control-IDs doesn't seem to work), and I would like to ask here, if someone has either done a better automation for Foxit Reader already, or maybe there is a good alternative approach to convert PDF to TXT files?
What about converting the PDF to a simple image (png/bmp/jpg), then using a more specific OCR program to read the images? It gives you the flexibility of not looking for a PDF specific OCR program, and you can try something like
or any other CLI OCR. An added benefit of converting to an image is that you can then also do some modifications to the image, to improve the potential clarity of the text if the PDF isn't computer generated (such as it was created from a scanned document).
If you want to use the stand-alone executables (pdftopng for example) with your application, you're free to do so. (To comply with the GPL, you'll need to distribute the Xpdf documentation along with the pdftopng executable - see the Xpdf README file for details.)
With your Foxit Reader method, after loading the PDF desired, have you tried simply sending a Ctrl-a to select all text, then a Ctrl-c to copy it to the clipboard? Then your script can retrieve the copied text with ClipGet and do whatever you want with it (i.e. save it to a file, display it, manipulate it, etc.). This method doesn't require as much faffing about with the menus and controls in Foxit, and it seems to produce cleaner text since it doesn't include all of the extraneous blank lines that the 'Save As' method generates.
For a very lengthy document, it might take a few seconds to select all of the text, and then then another few seconds to copy it to the clipboard. So you'd have to figure out how to know when each step was done. On the old version of Foxit Reader I use (the last one that lets you choose the classic toolbar instead of that horrible Ribbon Mode), a progress window appears while selecting the text after pressing Ctrl-a. Then when that window disappears, you can go on to copy the text to the clipboard using Ctrl-c and another progress window appears while that takes place. By watching for these progress windows you could know when to take the next step.
I have not tested this (I don't have Foxit), but I was curious what Chat GPT-4 would say when asked about possible solutions to best automate reader. I asked it specifically about using a COM API (Foxit does have an API). It barfed up the following tester which may or may not work. Either way, the API reference docs can be found here. I would think the API would be more reliable then sending commands to the GUI.
@rudi - I was digging around a bit, I think it may have a dependency on the paid Foxit SDK. It appears you can get a free trial on their site but then you need to pay after the trial. So not sure if this is a work thing ... if so may still be worth investigating.
But after upgrading to the currently latest release, v10.01.1, the results look quite promising. The remaining constraint is, that quite a lot of lines, that are saved as two lines by foxit (separate table rows in the original PDF file) are now saved as one line by gs. But that can be handled by the data processing done later on.
Your initial script was perfect. I had to make some minor changes because I might use a newer version of Foxit Reader, but it did exactly what I needed it to do. I was able to reduce most sleeps with 500ms versions to make it faster.
795a8134c1