On 11/12/11 21:20, Axel Berger wrote:
> quouo wrote:
>> Is it sufficient to save the pdf in txt format from the adobe reader?
>
> Possibly. My reader version doesn't offer that menu item, which is why I
> am asking help in converting to PDF 1.4 in another thread.
>
>> Can you suggest me the name of one of the real editors with
>> regular expressions and other powerful search tools?
>
> The canonical answer here has to be emacs, which I don't know.
If you need to do this kind of thing on a regular basis, Emacs would
certainly be worth learning, but using any editor is going to be a
tedious way to do it.
Instead, use pdftotext to extract the text. The output uses ^L
characters as pagebreaks. A simple script can then split the file on the
space character to individual words, and you can then strip the
punctuation (except for the decimal point) and test them for digits and
thus for values over 25, giving the page on which they occur, eg
pdftotext yourfile.pdf - |\
tr ' ,():;"-' '\012' |\
awk '/^L/ {++p} /^[0-9][0-9]*$/ {if($0>25)print $0 " on p." p+1}'
Make sure that ^L is an actual Ctrl-L character, not the two characters
^ and L.
Using the awk manual at
http://www.cs.unibo.it/~renzo/doc/awk/nawkA4.pdf
as a test document, I get the following output:
1995 on p.1
1993 on p.2
1985 on p.3
1991 on p.5
1991 on p.5
675 on p.5
675 on p.10
1989 on p.10
45 on p.13
119 on p.13
...etc
///Peter