Warm regards,
Dmitri Silaev
www.CustomOCR.com
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
I ended up hacking TessBaseAPI::GetUTF8Text() in api/baseapi.cpp
to add a linefeed before indented text. It is a very simple hack,
and will probably fail with poetry and other ragged-left layouts,
but it gets most of the simple prose paragraphs right. It also
has the problem of not working when applied to revisions of tesseract
where the block detection code has changed behaviour (like it has under
the current revision 581). I know that it works under revision 549,
so if you check that out and apply the attached patch, you should
get a blank line appearing before each indented line.
Cheers,
Rob Komar
> Thx. I tried your patch in rev 581, and on my test page it worked like
> I expected only if I cropped the image very close to the left text
> margin. With a larger margin, a line break is inserted at every line.
> It may well have to do with the changed behavior you mention, and
> might be reflected in the (unpatched) hocr output I've seen. I used a
> scanned book image as test, so it may also be that image dirt in the
> left margin fools the layout detection, but I think it is less likely.
The code for detecting blocks seems to be broken again in rev 581.
That's probably why the hocr output is wrong, as well. If you
check out rev 549, the patch should work properly there (use
"svn update -r 549"). Or you can wait a bit and the block detection
code will probably be fixed again sometime soon.
>
> Also, I realize that for PG I would like a blank line as well between
> unindented paragraphs, if there is white space between them (thought
> breaks) - but that is not what I asked in the first place.
Then you should probably wait for the hocr output to work, or hack
the GetUTF8Text() method in baseapi.cpp yourself to use the
IsParagraphBreak() method. My simple patch definitely won't handle
that correctly.
Rob