insert blank line (or any other mark) between paragraphs

Enrico Segre

unread,

Apr 25, 2011, 4:47:17 PM4/25/11

to tesseract-ocr

I'm striving to use tesseract for providing content to the Project
Gutenberg. There, proofing workflow requires that one blank line is
inserted between each recognized paragraph, paragraphs being defined
by a changing indentation of their first line w.r.o. the body text.

I found this old post:

http://groups.google.com/group/tesseract-ocr/browse_thread/thread/34ab77d8dd1636e3/35e59c6a67661ee3?lnk=gst&q=paragraph#35e59c6a67661ee3

Am I understanding correctly that the situation hasn't changed since
then, or is there a way?

Enrico

Dmitri Silaev

unread,

Apr 25, 2011, 11:12:07 PM4/25/11

to tesser...@googlegroups.com, Enrico Segre

From what I could find, Tesseract does paragraph breaking for hOCR output.
As I know there are hOCR-based tools can be used for Project Gutenberg.

Warm regards,
Dmitri Silaev
www.CustomOCR.com

> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

Robert Komar

unread,

Apr 26, 2011, 12:46:42 AM4/26/11

to tesseract-ocr

I ended up hacking TessBaseAPI::GetUTF8Text() in api/baseapi.cpp
to add a linefeed before indented text. It is a very simple hack,
and will probably fail with poetry and other ragged-left layouts,
but it gets most of the simple prose paragraphs right. It also
has the problem of not working when applied to revisions of tesseract
where the block detection code has changed behaviour (like it has under
the current revision 581). I know that it works under revision 549,
so if you check that out and apply the attached patch, you should
get a blank line appearing before each indented line.

Cheers,
Rob Komar

baseapi.cpp.diff

Enrico Segre

unread,

Apr 30, 2011, 6:56:51 PM4/30/11

to tesseract-ocr

I have given a try to tesseract .... hocr on a couple of test pages. I
understand the idea, but have the impression that the marked output
contains far too many paragraphs (about each line is a paragraph) than
I would expect. Are you perhaps aware of some config variable which
sets a tolerance threshold?

Also, I don't know what to to to filter the hocr output to plain text
+ additional line break. I've looked in hocr-tools, hocr-as-no-html is
listed as "possible", not even "planned".

Do you have refs for the " hOCR-based tools can be used for Project
Gutenberg" you mentioned?

Enrico

On Apr 26, 6:12 am, Dmitri Silaev <daemons2...@gmail.com> wrote:
> From what I could find, Tesseract does paragraph breaking for hOCR output.
> As I know there are hOCR-based tools can be used for Project Gutenberg.
>
> Warm regards,
> Dmitri Silaevwww.CustomOCR.com
>
> On Tue, Apr 26, 2011 at 12:47 AM, Enrico Segre
>

> <enrico.se...@weizmann.ac.il> wrote:
> > I'm striving to use tesseract for providing content to the Project
> > Gutenberg. There, proofing workflow requires that one blank line is
> > inserted between each recognized paragraph, paragraphs being defined
> > by a changing indentation of their first line w.r.o. the body text.
>
> > I found this old post:
>

> >http://groups.google.com/group/tesseract-ocr/browse_thread/thread/34a...

Robert Komar

unread,

Apr 30, 2011, 10:28:02 PM4/30/11

to tesseract-ocr

On Sat, 30 Apr 2011, Enrico Segre wrote:

> Thx. I tried your patch in rev 581, and on my test page it worked like
> I expected only if I cropped the image very close to the left text
> margin. With a larger margin, a line break is inserted at every line.
> It may well have to do with the changed behavior you mention, and
> might be reflected in the (unpatched) hocr output I've seen. I used a
> scanned book image as test, so it may also be that image dirt in the
> left margin fools the layout detection, but I think it is less likely.

The code for detecting blocks seems to be broken again in rev 581.
That's probably why the hocr output is wrong, as well. If you
check out rev 549, the patch should work properly there (use
"svn update -r 549"). Or you can wait a bit and the block detection
code will probably be fixed again sometime soon.

>
> Also, I realize that for PG I would like a blank line as well between
> unindented paragraphs, if there is white space between them (thought
> breaks) - but that is not what I asked in the first place.

Then you should probably wait for the hocr output to work, or hack
the GetUTF8Text() method in baseapi.cpp yourself to use the
IsParagraphBreak() method. My simple patch definitely won't handle
that correctly.

Rob

Enrico Segre

unread,

Apr 30, 2011, 7:32:28 PM4/30/11

to tesseract-ocr

Thx. I tried your patch in rev 581, and on my test page it worked like
I expected only if I cropped the image very close to the left text
margin. With a larger margin, a line break is inserted at every line.
It may well have to do with the changed behavior you mention, and
might be reflected in the (unpatched) hocr output I've seen. I used a
scanned book image as test, so it may also be that image dirt in the
left margin fools the layout detection, but I think it is less likely.

Also, I realize that for PG I would like a blank line as well between
unindented paragraphs, if there is white space between them (thought
breaks) - but that is not what I asked in the first place.

Enrico

> I ended up hacking TessBaseAPI::GetUTF8Text() in api/baseapi.cpp
> to add a linefeed before indented text. It is a very simple hack,
> and will probably fail with poetry and other ragged-left layouts,
> but it gets most of the simple prose paragraphs right. It also
> has the problem of not working when applied to revisions of tesseract
> where the block detection code has changed behaviour (like it has under
> the current revision 581). I know that it works under revision 549,
> so if you check that out and apply the attached patch, you should
> get a blank line appearing before each indented line.
>
> Cheers,
> Rob Komar
>

> baseapi.cpp.diff
> 1KViewDownload

Reply all

Reply to author

Forward