wrong recognition of small lines or points

123 views
Skip to first unread message

Filippo Riccio

unread,
Dec 14, 2015, 8:42:29 AM12/14/15
to tesseract-ocr
Hallo everybody,

I am testing Tesseract to recognize the characters in the attached Picture.

I created a traineddata with a small number of characters.

My Problem is that Tesseract recognizes as character also the small lines at the left of the first 0 and
under J. Precisely, the recognized text is F0002HNJH2UF

How can I avoid it? It is possible to fix the minimal size of characters?

Thank you in advance.


text0_F0002HNJH2UF.png

Meh Hem

unread,
Mar 4, 2016, 1:17:03 AM3/4/16
to tesseract-ocr, filr...@gmail.com
The is definitely tesseract api configs for that :
textord_heavy_nr = 0 (0 default, 1 is very aggressive)
textord_max_noise_size

However I would simply use opencv to remove any blob with a vertical height of less than desired. 

Stephen Lambie

unread,
Mar 4, 2016, 2:25:06 PM3/4/16
to tesseract-ocr, filr...@gmail.com
What function of opencv would you use to do that?

Meh Hem

unread,
Mar 5, 2016, 10:40:34 AM3/5/16
to tesseract-ocr, filr...@gmail.com
Some links:

I was entirely impressed by the bounding box method of contour removal, but I did find success with findContours: 

Just filter which are the contours you want to lose (in your case using  height I would say) and replace the black pixels with white.


Looking at that text I would also consider doing some morphology to make the characters a bit stronger. 

I hope this helps
Reply all
Reply to author
Forward
0 new messages