Remove certain characters while fine tuning (training) tesseract

Murtuza Dahodwala

unread,

Mar 9, 2021, 2:30:17 AM3/9/21

to tesseract-ocr

Hello,

Currently, my OCR model detects certain characters like ₹ & |.

Is it possible that I can remove these characters by correcting my lstm bounding box dataset and then fine-tuning it so that it does not detect these symbols in my test images ??

Greg Dunkel

unread,

Mar 10, 2021, 12:50:31 PM3/10/21

to tesser...@googlegroups.com

Would it be easier to remove these characters from the output using editing tools?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ecd726d5-8ab0-4986-87b0-7ff344d3271cn%40googlegroups.com.

Murtuza Dahodwala

unread,

Mar 10, 2021, 12:52:29 PM3/10/21

to tesser...@googlegroups.com

I guess that would be manual work. I want to not detect them during inference

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CA%2BOX7toPXQHf%3DFhyHtXg%2B9ziY_ti%3Darq0ewLrLh%3DyYPNWj--cQ%40mail.gmail.com.

Reply all

Reply to author

Forward