training using a page at a time?

12 views
Skip to first unread message

Jeremy C. Reed

unread,
12:31 AM (23 hours ago) 12:31 AM
to tesser...@googlegroups.com
I used main branch of Oct 21 (last commit Oct. 13) on a scan of a book
from 1830. Then I created 32 cropped png files and corresponding text in
gt.txt files and used the make training with START_MODEL=eng to create a
new model.

The book has over 200 unique misspellings or archaic spellings and
hundreds of uncommon proper names. I used -c load_system_dawg=F
-c load_freq_dawg=F But I noticed no difference with our without
that. I didn't want to use any dictionary decisions as I want to keep
all original spellings.

I ended up creating a word diff between the eng model and my model,
then I manually visually reviewed the entire book, with over 6000
differences which included much noise. So average over 10 changes to
compare for every page which I manually reviewed and edited. Such as:

cause of the greatness of the [-multitude; therefore,-] {+mutltitude; thercfore,+} he [-caused-] {+cansed+}

As I manually reviewed those I found other problems that were not
detected by tesseract using either model which I fixed in my
transcription. Then I compared with a third-party transcription of the
same book -- that was difficult because I found over 100 mistakes in
that transcription plus books in the 1800's may have in-press changes
where things are changed during the printings so the same book may have
various corrections (I found over 60) and damage (like type fell out or
ink? spots, I found many). I spent maybe a hundred hours on this.

I had zero pages that recognized perfectly, but I did have maybe 1% of
the pages that the eng model and my custom model recognized the same
(which I found were wrong by manual review or comparison with another
transcription).

Is there a simple way to do training using page by page instead of
cropping out line by line? Now that I have a near 600 pages
transcriptions with images, I'd like to train it on all that. (Then I
may attempt to transcribe some other 1800's books.)

I also used a tessedit_char_whitelist that only has ascii characters
plus an em dash. Still I had many outputs like "Detected 523 diacritics"
and has much noise which took me hours to manually clean out. How can I
get tesseract to not output content related to the "Detected ...
diacritics" that looks like the following?

i la p AV EU

E o a r a -
t

S rmi P V CF jim E

ar

DD pi-ay pa . f

If it thinks it is a diacritics is there a way to tell tesseract to not
output it?

Another odd behaviour I saw is that it repeated many characters like:

"m" output as "mn" or "nm"

"h" as "hu"

"had" as "bhad"

"wn" as "whn"

I think tesseract thinks a single character looks like two of the
choices which makes sense, but it also outputs both.

Anyways, thank you much for your software. It has been quite interesting
to learn and use.

Jeremy

Ger Hobbelt

unread,
7:59 AM (15 hours ago) 7:59 AM
to tesseract-ocr
In answer to your question: AFAIK there is no 'simple' solution/answer. Reading, OCRing (old) texts is, ah, "an area of active research". 

Maybe a few notes that could be of use to you:

1: the 'looks like two of the choices' bit you mention can occur and is called diplopia. If you Google that term with tesseract, previous discussions will show up, I'm sure. Not solved and, from my perspective, not solvable {for everybody}: the output can be *improved* perhaps, but sometimes you just need to err on the side of caution, and this is (mainly) the CTC stage being cautious. (I'm anthropomorphizing the engine and technically flaky in my explanation there but, as a simplification, it works for me).

2: try to think of tesseract as one process stage of many. There's a lot of improvement to be had by tuning your preprocessing (of the page scan images). Posting one or two page images here may elicit more suggestions from others.

2.B: much less discussed but there's also *postprocessing*: you yourself did a lot of that, manually, but what if one 'accepts' (mindset!) tesseract/OCR output as an intermediate stage product? That kind of thinking suggests: can we detect patterns in the mistakes and  improve=(auto)correct tesseract text output? 
Yes you can, particularly when your output content is human sentences. Like OCR itself it's a statistics game so don't expect perfect output, but how about constructing a custom spell checker for your output? Creating hunspell custom dictionaries is way cheaper in time and effort than training a customized-for-your-purposes neural net that takes image pixels as input. Personally, this would be the next thing I'd be looking into...

If you're fine with reading academic papers, here's ”Survey of Automatic Spelling Correction" from 2020 attached as pdf: this one is useful to learn what you're up against when you tackle this part of postprocessing. Start with them, hunspell and possibly look into the usefulness of funspell (python) which advertises itself as a context-aware improved hunspell: where hunspell works per-word, so would okay both 'their' and 'there', context-aware means it can guess which of these two is better suited in this particular spot in the sentence.

3: your challenges sound a lot like the ones the German universities working on pre-war and WW2 fraktur newspapers must have had: that would be Stefan Weil and co. Ditto for a project from several years ago where a dedicated tesseract model was created to decipher old greek. Might be useful if you can talk to them and hear their experiences.

4: given (2), run tesseract unrestricted, without a whitelist, and run it with HOCR and/or TSV output: those formats include statistics per letter & word and you may be able to use those rankings to steer the decisions in your (automatic) postprocess (spell correction). 
I am in favor of not using white/blacklists in tesseract as they basically drop information that I can prossibly better deal with outside. They are very useful for other scenarios, where you don't want any 'recovery after the fact'. Yours is, in my POV, not one of those.
I'm thinking: custom hunspell dictionary; apply judiciously, e.g. only pick this spell correction when input rankings are within a certain range; diacritics also show up (a lot!) when processing pictures or diagrams and charts that are part of the book page: if you cannot detect and erase them already in your PREprocess stage, might be useful to add a post stage detecting streams of 'gibberish': that's probably tesseract looking at an ornamented dropcap, an ink stain that could not be removed/filtered out, or other specialty bit that happens in your project. 

Yet *another* reason to always produce HOCR or TSV: these also spit out pixel coordinates so you can post-analyze the text output: f.e.: is it really a line of text there or is the spacing 'too weird' for that? 

5: in the end, for perfect score, you will always have a manual vetting stage at the end. Like you did. 
All the above is with minding this: what to do and how to do it so you have the best possible human vetting stage? I don't know. It depends on a lot of factors. But one thing I'd be doing is: carrying as much info through the chain as I can do so i can quickly and completely (if possible) evaluate what i see and doubt = wish to correct. And how to do that fast. (So carrying word pixel coordinates forward can help as you can then clip a part of the page image for human brain based 'OCR', for example. Carrying rankings forward helps when you wonder: "yeah, but how sure were you about this, machine?"

6: I talk about preprocessing and postprocessing here. See also the text and links/references in this recent conversation: 

While that one may seem unrelated at first glance, please do read those referenced documents and articles in there, if you haven't already, as they have general applicability.



Ultimately:
YMMV. It is not simple stuff.




Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/8fe96734-e6b3-3ec9-108c-754bc5a8c543%40reedmedia.net.
electronics-09-01670.pdf
Reply all
Reply to author
Forward
0 new messages