search multiple PDFs for common text

48 views
Skip to first unread message

dale

unread,
May 9, 2018, 5:14:35 PM5/9/18
to
Hi,

I have about 100 PDF files I would like to search for common words.

"word" for instance

prefer not to write a script, but can if the only option

using Windows 10, so the scripting language would be preferably powershell

assistance appreciated and will be recognized in any public use of
assistance if okay

--
dale - http://www.dalekelly.org/
Not a professional opinion unless specified.

Mayayana

unread,
May 9, 2018, 5:32:58 PM5/9/18
to
"dale" <da...@dalekelly.org> wrote

| I have about 100 PDF files I would like to search for common words.
|

It's easy with something like Agent Ransack. Just put
the PDFs in a folder and search. No script needed. But
there is one, big caveat: Not all text in PDFs is in there
as text. A book, for instance, might be comprised of
the text of the book. Or it could be composed of scans of
the book's pages. The latter will have no actual text.
You's have to extract the pages and run them through
OCR to get the text.


Keith Nuttle

unread,
May 9, 2018, 5:47:38 PM5/9/18
to
Please remember that there are two basic kinds of PDF documents. One is
a Text based document in the PDF format. The other is an Image in the
PDF format. If the Image is of a text document, it will look like a
text based document, but not be able to be searched. If the Image
document had been OCR'ed and the OCR text had been included in the PDF
the PDF document would be searchable.

So depending on the source of the documents you with to search, your
goal may be obtainable, but if the pdf's are image document from a
scanner they will not.

--
2018: The year we learn to play the great game of Euchre

dale

unread,
May 9, 2018, 6:00:40 PM5/9/18
to
On 5/9/2018 5:32 PM, Mayayana wrote:
These aren't books, though might be encoded the same way, I am able to
copy/paste the text on one I tried, using MS Edge browser as viewer

looked over the Agent Ransack specs, seems like it can work

isn't available in Microsoft store so I don't know yet if I want to
install it, "about us" looks impressive

Thanks much

dale

unread,
May 9, 2018, 6:05:05 PM5/9/18
to
I can copy/paste from one of the PDF files using MS Edge browser as
viewer ..., and I can search one with Edge's "find on page" function

Paul

unread,
May 9, 2018, 6:27:18 PM5/9/18
to
dale wrote:
> Hi,
>
> I have about 100 PDF files I would like to search for common words.
>
> "word" for instance
>
> prefer not to write a script, but can if the only option
>
> using Windows 10, so the scripting language would be preferably powershell
>
> assistance appreciated and will be recognized in any public use of
> assistance if okay
>

Acrobat Exchange (could be called just "Acrobat" today),
had an inverted search indexer, which could accept hundreds of
documents and index them into a common database file. I have
at least one CD distributed as advertising, which had one
of those indexes which covered every PDF on the CD. Very nice.

The search could then be carried out instantly, and when you
clicked a single line in the search result, the document
in question would open.

My copy of Acrobat Exchange 4 or so, had a feature like that.
The inverted search indexer.

This could also work... as long as Windows 10 had a search
provider for PDF. Since Windows 10 doesn't have a "thumbnailer"
capability for PDF, unless you install some Acrobat software,
what are the odds that PDF files will get indexed by the built-in
Federated Search in Windows 10 ?

The problem is, Windows cannot "add a file type" for content
search, unless a search provider knows how to open the file
and extract words from it.

There are tools such as open source pdf2text or pdftotext
that might work (script level detection of text). Note that,
there can be differences in the quality of the tools.
For example, any bozo can extract a single text string.

bozo

However, tools like LibreOffice, have on occasion resorted to
micro-positioning (overriding the font metrics and
pretending they're smarter than font people) when they
save out in PDF format. What happens when pdftotext sees

b o z o

Does that get converted to four, one letter words ?
Or is the tool clever enough to realize that is "bozo".

If the letters are arranged like this, no open-source
software will do a good job. The baseline has to be smooth.

b z
o o

Note that, PDF documents can contain strings positioned
on spline curves. Do not expect an open source tool to
extract those. Probably Adobe knows how to extract
such a thing, but other tools will be hit and miss.
Text which has purely horizontal or vertical orientation
(as a string), with a smooth baseline, might well be
extracted as you would expect.

So, yes, you might be able to buy software to do this.
I still, on occasion (for experiments), try to get
that old inverted search indexer to do stuff for me.

You could also try mechanically concatenating all 100
documents into one document, and then using the
sequential text search that exists in MSEdge. It
takes MSEdge roughly 7.5 minutes to search a 36,300
page document.

Paul

dale

unread,
May 9, 2018, 7:28:54 PM5/9/18
to
Thanks much, will research

found these

https://helpx.adobe.com/acrobat/using/searching-pdfs.html

https://helpx.adobe.com/acrobat/using/creating-pdf-indexes.html#creating_pdf_indexes

don't think I need an index although it would be nice, this means I
don't need "Pro"

this shows pricing ... "Pro" is only $2 more on a monthly basis

https://acrobat.adobe.com/us/en/acrobat/pricing.html

Mayayana

unread,
May 9, 2018, 8:01:27 PM5/9/18
to
"dale" <da...@dalekelly.org> wrote

| looked over the Agent Ransack specs, seems like it can work
|
| isn't available in Microsoft store so I don't know yet if I want to
| install it, "about us" looks impressive
|

I've used it for years to replace the very limited
Windows search functionality. Some people like
a program called Everything, but I've never tried
that.

Another option would be to use something like
Sumatra, or any basic PDF reader, to export the
content as a text file. Then you'd have better
access. Though in my experience, nothing exports
text perfectly. There's usually some "noise", like
an "h" that ends up as 1n. Things like that. Mistakes
based on character shape.


Paul

unread,
May 9, 2018, 11:01:35 PM5/9/18
to
dale wrote:
> Hi,
>
> I have about 100 PDF files I would like to search for common words.
>
> "word" for instance
>
> prefer not to write a script, but can if the only option
>
> using Windows 10, so the scripting language would be preferably powershell
>
> assistance appreciated and will be recognized in any public use of
> assistance if okay
>

There are some more breadcrumbs in this thread.

https://answers.microsoft.com/en-us/windows/forum/windows_10-win_cortana-winpc/cannot-search-contents-of-pdf-files-using-file/0d15e80d-8dc6-4879-8356-3247614e202b

The "reader search handler" is apparently a means of
getting Windows Search to include the content of PDF files.

https://filestore.community.support.microsoft.com/api/images/96379b8f-c966-433a-bf9c-184b8e852900

Paul

mike

unread,
May 10, 2018, 12:23:05 AM5/10/18
to
Total Commander is a great substitute for windows explorer.
Has a lot of plugins including one to search pdf files.

Paul

unread,
May 10, 2018, 12:49:25 AM5/10/18
to
Weird. I just checked my PDF entry and I have a
"reader search handler". Looks like I had Acrobat Reader (as
part of an experiment to get PDF thumbnails working), and
removed it, and the "reader search handler" seems to have
stuck around.

When I found a reasonably unique keyword and tried a search
in File Explorer, the only file that popped up in the
search result, was the PDF file in question. So it looks
like mine have been indexed in Win10, purely by accident/sideeffect.

*******

For the Total Commander plugin, is that an indexer or
just a "real time search" ?

I think a fun test, would be to find a file prepared in
Illustrator, where the text is on a path, and see if it
detects the text string properly.

Paul

mike

unread,
May 10, 2018, 4:58:58 AM5/10/18
to
I have no experience.
Standard install of TC won't search inside a PDF, but the plugin does.
I just installed the plugin to see what it would do.
Put in a keyword to search and the PDF files containing the keyword popped
up in the list.
I searched a directory of PDF's to keep it simple.
Seems to be able to restrict search to pdf, but I didn't try it.
It's a keeper.

dale

unread,
May 10, 2018, 1:59:22 PM5/10/18
to
thanks much Paul

dale

unread,
May 10, 2018, 5:43:28 PM5/10/18
to
On 5/9/2018 5:14 PM, dale wrote:
> Hi,
>
> I have about 100 PDF files I would like to search for common words.
>
> "word" for instance
>
> prefer not to write a script, but can if the only option
>
> using Windows 10, so the scripting language would be preferably powershell
>
> assistance appreciated and will be recognized in any public use of
> assistance if okay
>

was able to do this in Windows File Explorer, enter a folder that has
the PDF files, then use the search field in the upper right hand corner
... duh ... thanks to my Accountant Sister

don't know all types it will do, it will do a Mozilla Thunderbird Email
file (.eml) ... just happened to be in the same folder

Peter Flynn

unread,
May 17, 2018, 5:26:26 PM5/17/18
to
On 09/05/18 22:14, dale wrote:
> Hi,
>
> I have about 100 PDF files I would like to search for common words.
>
> "word" for instance
>
> prefer not to write a script, but can if the only option
>
> using Windows 10, so the scripting language would be preferably powershell

That sounds like gross overkill for something this simple.

pdftotext is a command-line utility in the Poppler libraries, which
extracts all text from a PDF, so installing Poppler would be the first
thing to try. It appears to be available for Linux, Mac, and Windows.

Then you can type as follows (this is Linux Bash syntax; I assume
Windows has something similar in the Command shell):

for f in *.pdf; do pdftotext $f; done

which will create a .txt file for every .pdf, so you can use them in
whatever search program you have.

///Peter
Reply all
Reply to author
Forward
0 new messages