Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to search "greater than" numeric values in a pdf file?

1,857 views
Skip to first unread message

quouo

unread,
Dec 11, 2011, 2:01:09 PM12/11/11
to
Hello guys,

I have a pdf file (3500 pages): i need to search inside of it values greater
or equal than 25.000, how to? (Already tried to import it in excel but the
columns aren't right and I have more than a value in a cell).

Thank you ;)

Axel Berger

unread,
Dec 11, 2011, 2:34:02 PM12/11/11
to
quouo wrote:
> i need to search inside of it values greater
> or equal than 25.000, how to?

Run pdftotxt over it and then you can open it in a real editor with
regular expressions and other powerful search tools.

> Already tried to import it in excel

Pointee-clickee won't get you far here, I'm afraid.

Axel

quouo

unread,
Dec 11, 2011, 3:30:48 PM12/11/11
to


"Axel Berger" ha scritto nel messaggio news:4EE505AA...@Gmx.De...

>Run pdftotxt over it and then you can open it in a real editor with
>regular expressions and other powerful search tools.

Thank you for your reply, Axel.

Here are other two questions...thank you for your patience :)

Is it sufficient to save the pdf in txt format from the adobe reader?

Can you suggest me the name of one of the real editors with
regular expressions and other powerful search tools?



Thank you :)

Axel Berger

unread,
Dec 11, 2011, 4:20:25 PM12/11/11
to
quouo wrote:
> Is it sufficient to save the pdf in txt format from the adobe reader?

Possibly. My reader version doesn't offer that menu item, which is why I
am asking help in converting to PDF 1.4 in another thread.

> Can you suggest me the name of one of the real editors with
> regular expressions and other powerful search tools?

The canonical answer here has to be emacs, which I don't know. My
personal choice in Windows is NoteTab, <http://www.notetab.com/>. It has
very powerful and easy to use scripting capabilties and as jumping into
the deep end with a specific problem like yours would make a really
steep learning curve there is also a very active and helpful user group
as a Yahoo mailing list.

Axel

Peter Flynn

unread,
Dec 22, 2011, 7:28:51 PM12/22/11
to
On 11/12/11 21:20, Axel Berger wrote:
> quouo wrote:
>> Is it sufficient to save the pdf in txt format from the adobe reader?
>
> Possibly. My reader version doesn't offer that menu item, which is why I
> am asking help in converting to PDF 1.4 in another thread.
>
>> Can you suggest me the name of one of the real editors with
>> regular expressions and other powerful search tools?
>
> The canonical answer here has to be emacs, which I don't know.

If you need to do this kind of thing on a regular basis, Emacs would
certainly be worth learning, but using any editor is going to be a
tedious way to do it.

Instead, use pdftotext to extract the text. The output uses ^L
characters as pagebreaks. A simple script can then split the file on the
space character to individual words, and you can then strip the
punctuation (except for the decimal point) and test them for digits and
thus for values over 25, giving the page on which they occur, eg

pdftotext yourfile.pdf - |\
tr ' ,():;"-' '\012' |\
awk '/^L/ {++p} /^[0-9][0-9]*$/ {if($0>25)print $0 " on p." p+1}'

Make sure that ^L is an actual Ctrl-L character, not the two characters
^ and L.

Using the awk manual at http://www.cs.unibo.it/~renzo/doc/awk/nawkA4.pdf
as a test document, I get the following output:

1995 on p.1
1993 on p.2
1985 on p.3
1991 on p.5
1991 on p.5
675 on p.5
675 on p.10
1989 on p.10
45 on p.13
119 on p.13
...etc

///Peter
0 new messages