Re: preserving spaces

2,172 views
Skip to first unread message

Nick White

unread,
Jun 10, 2013, 8:37:28 AM6/10/13
to tesser...@googlegroups.com
Hi Eric,

Thanks for this posting. Out of curiousity why do you need to
preserve multiple spaces?

Do you think you could update the code to allow a new configuration
variable? If you did, and posted the patch to the issues page, I
expect it would be accepted, as this sounds like the sort of thing
that is useful to be able to do.

Nick

On Sat, Jun 08, 2013 at 01:50:11PM -0700, emw...@gmail.com wrote:
> I found the code Ray referred to back in '09. It is now in GetUTF8Text(). In
> baseapi.cpp in TessBaseAPI::GetUTF8Text I changed:
>
> *ptr++ = ' ';
>
> to
>
> {
> int i ;
> for ( i = 0 ; i < word->word->space() ; i++ )
> *ptr++ = ' ';
> }
>
> This added back in the multiple spaces as advertised. The results are a bit
> unpredictable (as Ray warned back in '09).
>
> I'll keep poking at it.
>
> Eric
>
>
> On Saturday, June 8, 2013 10:37:20 AM UTC-4, emw...@gmail.com wrote:
>
> I need to maintain the (multiple) spaces in my output document. About 5
> years ago someone asked how to do this and Ray posted a suggestion. That
> suggestion does not appear to correspond to the current source code.
>
> Can anyone suggest how I can maintain word spacing both before the first
> word on a line (indentation) as well as between words within a line?
>
> I can force the text in the input image to have fixed spacing.
>
> Ideally, there is a command line switch or a config item that will do what
> I need, but I am not averse to modifying the code if necessary.
>
> Thanks,
> Eric
>
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to tesseract-oc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Shree Devi Kumar

unread,
Jun 12, 2013, 11:18:24 PM6/12/13
to tesser...@googlegroups.com
This is something that I would like to use too.

In my testing so far, I found that the interword space is sometimes eliminated altogether in case of Hindi (randomly, as far as I can tell).

And, if there are paragraphs that start with indentation, then the segmentation goofs up and that line does not get recognized correctly.

Maybe there are some config variables that I need to tweak to fix this.

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

raju.shar...@gmail.com

unread,
Aug 7, 2013, 1:13:10 PM8/7/13
to tesser...@googlegroups.com
Hi Eric,
I am unable to compile the project in vs 2005 and i do not have 2008. So I need your helhp, if you please to share the compiled (or dll) for this after you made changes related to multiple spaces in BaseAPI.CPP's GetUTF8Text() method.
Thanks.
Message has been deleted

Shik

unread,
Feb 23, 2014, 6:26:36 AM2/23/14
to tesser...@googlegroups.com
This is the exact same issue that I have. The words in the image that are separated by spaces are clubbed together. However instances wherever there is a new line has been correctly reproduced.
Plus I couldn't locate the code that needs to be fixed as suggested above at baseapi.cpp in TessBaseAPI::GetUTF8Text .
I'm using Tesseract 3.02.02
Is there a way to fix this?

Ian Carroll

unread,
Jan 19, 2015, 11:10:04 PM1/19/15
to tesser...@googlegroups.com
I do not know how its capabilities compare to tesseract in general, but gocr preserved the whitespace I needed in a very simple table layout. This project's webpage is at http://jocr.sourceforge.net/.

hsnn...@gmail.com

unread,
Oct 26, 2018, 10:17:43 AM10/26/18
to tesseract-ocr
Hi, thanks for your answer, but where can I find the baseapi.cpp file ?

Aditya Singh

unread,
Sep 5, 2019, 7:49:41 AM9/5/19
to tesseract-ocr
The getuf8text function has been changed in baseapi.cpp as below:
/** Make a text string from the internal data structures. */
char* TessBaseAPI::GetUTF8Text() {
  if (tesseract_ == NULL ||
      (!recognition_done_ && Recognize(NULL) < 0))
    return NULL;
  STRING text("");
  ResultIterator *it = GetIterator();
  do {
    if (it->Empty(RIL_PARA)) continue;
    const std::unique_ptr<const char[]> para_text(it->GetUTF8Text(RIL_PARA));
    text += para_text.get();
  } while (it->Next(RIL_PARA));
  char* result = new char[text.length() + 1];
  strncpy(result, text.string(), text.length() + 1);
  delete it;
  return result;
}

So, there's no **ptr++=' ' to replace. Would be great if anyone can tell me how to go about this problem.
Reply all
Reply to author
Forward
0 new messages