Is there a control param for tesseract to disable line breaks within a paragraph?

3,836 views
Skip to first unread message

Bruce

unread,
Aug 6, 2014, 11:51:34 AM8/6/14
to tesser...@googlegroups.com
For example with the image attached, I get the output:
  • Chapter One

  • A royal-red Ford F—150 Super-
  • Crew rolled through the streets
  • of Albany, Georgia. The pickup’s
  • driver brimmed with optimism, so
  • much that he couldn’t possibly
  • foresee the battles about to hit
  • his hometown.

  • Life here is going to be good,
  • thirty—seven—year—old Nathan
  • Hayes told himself. After eight
  • years in Atlanta, Nathan had
  • come home to Albany, three
  • hours south, with his wife and
Is there a way to make the output as the below, without the line breaks within a paragraph?
  • Chapter One

  • A royal-red Ford F—150 Super-Crew rolled through the streets of Albany, Georgia. The pickup’s driver brimmed with optimism, so much that he couldn’t possibly foresee the battles about to hit his hometown.

  • Life here is going to be good, thirty—seven—year—old Nathan Hayes told himself. After eight years in Atlanta, Nathan had come home to Albany, three hours south, with his wife and
Thanks in advance!
upload.jpg

Quan Nguyen

unread,
Aug 6, 2014, 7:23:27 PM8/6/14
to tesser...@googlegroups.com
I'm afraid not. You can use any programming editor that supports Regex find/replace to do it for you, or use a tool such as VietOCR to remove line breaks from the output text.

Bruce

unread,
Aug 8, 2014, 11:49:27 AM8/8/14
to tesser...@googlegroups.com
Aww.. I tried removing the '\n' line breaks manually, however for some articles, the paragraph break still consists of single '\n' line break so if I remove that too doing a find/replace I loses the paragraph break. How did VietOCR solve this issue?

Quan Nguyen

unread,
Aug 8, 2014, 9:32:00 PM8/8/14
to tesser...@googlegroups.com
It employs a proper Regex statement. Following is the function in Java that it uses:

/**
* Removes line breaks.
* @param text
* @return
*/
public static String removeLineBreaks(String text) {
return text.replaceAll("(?<=\n|^)[\t ]+|[\t ]+(?=$|\n)", "").replaceAll("(?<=.)\n(?=.)", " ");
}

Bruce

unread,
Aug 10, 2014, 6:26:14 AM8/10/14
to tesser...@googlegroups.com
It works wonderfully.. can you explain more on the Regex statement? I can't understand what the first regex statement is matching against. 

Thanks again for sharing your wonderful solution!!

Satya Swaroop

unread,
Sep 18, 2014, 7:10:54 AM9/18/14
to tesser...@googlegroups.com

Seeking the same solution for iOS,please give solution.
Message has been deleted

Quan Nguyen

unread,
Sep 18, 2014, 9:24:11 PM9/18/14
to tesser...@googlegroups.com
The 1st part trims all the tab or space characters at the beginning and end of each line or of the text.

The 2nd part replaces all the single line-feed (newline) characters with spaces, effectively joining the lines. So if you want to retain the paragraph, mark it with 2 or more LF characters.
Reply all
Reply to author
Forward
0 new messages