Recognizing censorship blocks

109 views
Skip to first unread message

Patrick Durusau

unread,
Dec 17, 2014, 4:41:20 PM12/17/14
to tesser...@googlegroups.com
Greetings!

I recently had wonderful success with tesseract-ocr on grand jury transcripts but now have a harder problem.

Can tesseract be trained to recognize censoring blocks in text? For example:

Assume this sentence has XXXXXXXXXXXXX a censoring block that obscures all the text it covers. (here represented by the X's, in the text, it is a solid black line)

What I want to do, in addition to recognizing the surrounding text, is to train tesseract to substitute for the black mark, (redaction - N) where N is the length of the redaction. 

There aren't that many different sized redactions, well, probably from one character space or a little better up to an entire line so producing examples of all the blackouts would be tedious but not difficult. 

Is that pushing tesseract in a direction it is not meant to go? 

If so, any suggestions on software that might be better suited to the task?

Thanks!

Patrick

Alfredo Jr. Go

unread,
Jun 10, 2021, 3:38:57 PM6/10/21
to tesseract-ocr
Hello,

I am currently doing the same project under OCR. Anyone had any experience detecting redactions using OCR?

Regards,
Fred.
Reply all
Reply to author
Forward
0 new messages