On 09/02/12 02:17, get rid of you wrote:
> Peter, thank you very much. I didn't know the pdftotext tool.
>
> Few tests and I'm exactly at the point you mentioned with the command
> below.
>
> pdftotext -f 3 -l 3 in.pdf out.txt
>
> But it returns the entire page. I'm looking for a way to strip first
> words. Any suggestion on this is really appreciated since i have no
> much knowledge on perl.
Me neither; I use awk mostly for this kind of thing. And the other Unix
utilities.
Without seeing one of your actual documents, I don't know what needs
doing to skip any non-paragraph text...
> Just an f.y.i. - The PDF documents has footer text and footer page.
...but it sounds like one line immediately before each pagebreak needs
to be skipped.
> The left menu i have mentioned are the pdf bookmarks (TOC) pointing to
> the corresponding page.
Those aren't part of the PDF document *text* -- they are generated from
marks inside the PDF markup, and they are not accessible to pdftotext:
you would need to buy the Adobe PDF API and write a program to access them.
So assuming your pdftotext command produces a file full of lines of
text, each line may be a paragraph (or heading, or list item, etc), or
it might be that the file was created with linebreaks preserved (in
which case you have to keep concatenating them until you reach a line
ending in a period, and assume that is the end of the paragraph. The
following awk script will thus cut out the first 50 words of each line.
pdftotext -f 3 -l 3 in.pdf - | awk 'BEGIN {ORS=""}
{if(substr($0,length($0))=="."){line=line
$0;n=split(line,w);for(i=1;i<=50;++i)print w[i] " ";print "\n";line=""}
else {line=line $0}}'
You will need to do more to skip over end-of-page material, and you will
need some way to detect which lines are headings and which are actual
paragraphs.
///Peter
///Peter