Copy short description of each chapter to txt

get rid of you

unread,

Feb 7, 2012, 11:12:51 PM2/7/12

to

Hello Folks !

I have around 40 pdf documents, all of them are indexed (left menu
chapters pointing to the exact page where the text starts). Each file
has around 10 chapters.

I need to copy the first 50 words (for example), and move it to an
excel, word or text file, so i can create a "preview" of each chapter.

Anyone has any idea how can i do this with less efforts as possible ?

anything that would copy certain amount of text and move to another
place automatically would be ok !

Thanks !!

Peter Flynn

unread,

Feb 8, 2012, 4:04:52 PM2/8/12

to

On 08/02/12 04:12, get rid of you wrote:
> Hello Folks !
>
> I have around 40 pdf documents, all of them are indexed (left menu
> chapters pointing to the exact page where the text starts). Each file
> has around 10 chapters.

What is a "left menu chapter"? Is this some kind of table of contents?

> I need to copy the first 50 words (for example), and move it to an
> excel, word or text file, so i can create a "preview" of each chapter.
>
> Anyone has any idea how can i do this with less efforts as possible ?

You mean you want to extract the first ~50 words of each chapter
(excluding the chapter title)?

pdftotext can give you the text, with ^L characters at the page-breaks,
so if a script could pick up the chapter start page numbers from the
ToC, it would be possible to scan through the extracted text with (eg)
awk, perl, etc and snip off the first 50 words of the correct pages.

But without knowing how much other stuff there is (running headers,
running footers, etc), it's not possible to be more accurate. pdftk may
also be able to help.

If you have the master documents from which the PDFs were created, that
would probably be much easier to work with.

///Peter

get rid of you

unread,

Feb 8, 2012, 9:17:57 PM2/8/12

to

Peter, thank you very much. I didn't know the pdftotext tool.

Few tests and I'm exactly at the point you mentioned with the command
below.

pdftotext -f 3 -l 3 in.pdf out.txt

But it returns the entire page. I'm looking for a way to strip first
words. Any suggestion on this is really appreciated since i have no
much knowledge on perl.

Just an f.y.i. - The PDF documents has footer text and footer page.
The left menu i have mentioned are the pdf bookmarks (TOC) pointing to
the corresponding page.

Thanks !

rpresser

unread,

Feb 9, 2012, 1:23:44 AM2/9/12

to

Bring in some other unix utilities. For example:

pdftotext -f 3 -l 3 in.pdf |fmt -w 1 |head -50 |fmt -w 9999 >page3.txt

Or, you could write a short perl script. But you'd have to ask a perl monk for help with that; I'm not one.

BAlheit

unread,

Feb 10, 2012, 6:09:48 AM2/10/12

to

May be possible with Acrobat and Javascript.

Peter Flynn

unread,

Feb 11, 2012, 10:50:30 AM2/11/12

to

On 09/02/12 02:17, get rid of you wrote:
> Peter, thank you very much. I didn't know the pdftotext tool.
>
> Few tests and I'm exactly at the point you mentioned with the command
> below.
>
> pdftotext -f 3 -l 3 in.pdf out.txt
>
> But it returns the entire page. I'm looking for a way to strip first
> words. Any suggestion on this is really appreciated since i have no
> much knowledge on perl.

Me neither; I use awk mostly for this kind of thing. And the other Unix
utilities.

Without seeing one of your actual documents, I don't know what needs
doing to skip any non-paragraph text...

> Just an f.y.i. - The PDF documents has footer text and footer page.

...but it sounds like one line immediately before each pagebreak needs
to be skipped.

> The left menu i have mentioned are the pdf bookmarks (TOC) pointing to
> the corresponding page.

Those aren't part of the PDF document *text* -- they are generated from
marks inside the PDF markup, and they are not accessible to pdftotext:
you would need to buy the Adobe PDF API and write a program to access them.

So assuming your pdftotext command produces a file full of lines of
text, each line may be a paragraph (or heading, or list item, etc), or
it might be that the file was created with linebreaks preserved (in
which case you have to keep concatenating them until you reach a line
ending in a period, and assume that is the end of the paragraph. The
following awk script will thus cut out the first 50 words of each line.

pdftotext -f 3 -l 3 in.pdf - | awk 'BEGIN {ORS=""}
{if(substr($0,length($0))=="."){line=line
$0;n=split(line,w);for(i=1;i<=50;++i)print w[i] " ";print "\n";line=""}
else {line=line $0}}'

You will need to do more to skip over end-of-page material, and you will
need some way to detect which lines are headings and which are actual
paragraphs.

///Peter

///Peter

Thomas Kaiser

unread,

Feb 12, 2012, 2:49:43 PM2/12/12

to

Peter Flynn wrote in <news:9pnh26...@mid.individual.net>

> On 09/02/12 02:17, get rid of you wrote:
>> The left menu i have mentioned are the pdf bookmarks (TOC) pointing to
>> the corresponding page.
>
> Those aren't part of the PDF document *text* -- they are generated
> from marks inside the PDF markup, and they are not accessible to
> pdftotext: you would need to buy the Adobe PDF API and write a program
> to access them.

You could also use Acrobat Pro and Scripting. On a Mac it would look
like

tell application "Adobe Acrobat Pro"
get every bookmark of front document
end tell

in AppleScript. The same thing is possible with ECMAScript on Windows.

Regards,

Thomas