Improve formatting of Bullet Points / Lists

97 views
Skip to first unread message

Rarity

unread,
Jun 2, 2015, 8:39:28 AM6/2/15
to tesser...@googlegroups.com
Hello Tesseract-OCR community,

I am well happy with the quality of the conversions, however, when OCR'ing bullet points, the output formatting of the text file is all wrong. The text file first lists bullet point numbers, and then the text.
This is not really an OCR issue, as all the text is recognized correctly, but I want to know if I can fix the formatting as well.


I cannot show snippets of documents where it went wrong, but I can show an example:


Input file:

Lorem ipsum dolor sit am.
  1. Ex vero phaedrum ius. 
  2. appareat patrioque mea. Has at alienum scaevola indoctum
  3.  No his modo quaerendum,
  4.  consul eruditi ex vim.

Output file:

Lorem ipsum dolor sit am.

1. Ex vero phaedrum ius. 
2.
3.
4.

appareat patrioque mea. Has at alienum scaevola indoctum
 No his modo quaerendum,
 consul eruditi ex vim.





Bonus question:
Assuming Google Docs use Tesseract-OCR, what is their setup / languages? 
Their formatting of output PDFs is gorgeous.
Reply all
Reply to author
Forward
0 new messages