dale wrote:
> Hi,
>
> I have about 100 PDF files I would like to search for common words.
>
> "word" for instance
>
> prefer not to write a script, but can if the only option
>
> using Windows 10, so the scripting language would be preferably powershell
>
> assistance appreciated and will be recognized in any public use of
> assistance if okay
>
Acrobat Exchange (could be called just "Acrobat" today),
had an inverted search indexer, which could accept hundreds of
documents and index them into a common database file. I have
at least one CD distributed as advertising, which had one
of those indexes which covered every PDF on the CD. Very nice.
The search could then be carried out instantly, and when you
clicked a single line in the search result, the document
in question would open.
My copy of Acrobat Exchange 4 or so, had a feature like that.
The inverted search indexer.
This could also work... as long as Windows 10 had a search
provider for PDF. Since Windows 10 doesn't have a "thumbnailer"
capability for PDF, unless you install some Acrobat software,
what are the odds that PDF files will get indexed by the built-in
Federated Search in Windows 10 ?
The problem is, Windows cannot "add a file type" for content
search, unless a search provider knows how to open the file
and extract words from it.
There are tools such as open source pdf2text or pdftotext
that might work (script level detection of text). Note that,
there can be differences in the quality of the tools.
For example, any bozo can extract a single text string.
bozo
However, tools like LibreOffice, have on occasion resorted to
micro-positioning (overriding the font metrics and
pretending they're smarter than font people) when they
save out in PDF format. What happens when pdftotext sees
b o z o
Does that get converted to four, one letter words ?
Or is the tool clever enough to realize that is "bozo".
If the letters are arranged like this, no open-source
software will do a good job. The baseline has to be smooth.
b z
o o
Note that, PDF documents can contain strings positioned
on spline curves. Do not expect an open source tool to
extract those. Probably Adobe knows how to extract
such a thing, but other tools will be hit and miss.
Text which has purely horizontal or vertical orientation
(as a string), with a smooth baseline, might well be
extracted as you would expect.
So, yes, you might be able to buy software to do this.
I still, on occasion (for experiments), try to get
that old inverted search indexer to do stuff for me.
You could also try mechanically concatenating all 100
documents into one document, and then using the
sequential text search that exists in MSEdge. It
takes MSEdge roughly 7.5 minutes to search a 36,300
page document.
Paul