Train Tesseract to ignore music?

46 views
Skip to first unread message

Sara Palmer

unread,
Jun 28, 2019, 1:15:59 AM6/28/19
to tesseract-ocr
I'd like to produce high-quality OCR of books that contain text interspersed with music. Is it possible to train Tesseract to ignore musical notation instead of turning it into junk OCR? How would one go about doing this?

Lorenzo Bolzani

unread,
Jun 28, 2019, 3:16:10 AM6/28/19
to tesser...@googlegroups.com
Hi Sara, can you please post a sample picture? You could probably detect the pentagram (hough lines with very tight paramters, custom horizontal lines detection) and just replace it with a white rectangle.


Lorenzo

Il giorno ven 28 giu 2019 alle ore 07:15 Sara Palmer <sara....@gmail.com> ha scritto:
I'd like to produce high-quality OCR of books that contain text interspersed with music. Is it possible to train Tesseract to ignore musical notation instead of turning it into junk OCR? How would one go about doing this?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3988b3d7-d757-4c10-9a66-f7aa34a65b6f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Timothy Snyder

unread,
Jun 28, 2019, 9:37:23 AM6/28/19
to tesser...@googlegroups.com
A picture would be helpful. From my experience, however, writing an independent program to segment text from "noisy" images with a lot of non-text print will give you the best results. Depending on how much the layout of those books varies between pages, this could be a simple or complicated task.

Sara Palmer

unread,
Jun 28, 2019, 11:48:29 AM6/28/19
to tesseract-ocr
Thank you both for your suggestions! I've attached a couple sample images.
00000047.jpg
00000012.jpg
Reply all
Reply to author
Forward
0 new messages