Hi Mike,
Not wanting to suggest that you take the Python route, but just sharing my experience.
I've tried Acrobat Reader's "Save as Text" functionality, and also one or two Python libraries to extract text from PDFs (PyPDF2 is the one I've settled on).
But what I learnt - without really digging into the issue - is that PDF is a pretty weird format where text from the same sentence/paragraph "floats around" as separate objects.
Bottom line is - no matter what tool you use - you may find it really tricky to get polished text from what seems like a simple PDF.
That said ... please please prove me wrong !
If anyone has a good pdf extraction tool in any easy to use form I'm interested.
My own use case is to extract text from some partner training materials which I regularly deliver so that I can do a diff to see what changed between releases (obviously if they actually summarized that would be ideal, but they point me to pdfdiff ... yuk).
I have some scripting that - just about - works but it's an absolute pain having
- sentence/paragraphs broken up into multiple lines (and not the same across releases)
- embedded code (in boxes) is indistinguishable from other text.
My 2cts.py,
another Mike.