New release for tessdata_{fast,best}?

845 views
Skip to first unread message

Merlijn B.W. Wajer

unread,
Jan 27, 2021, 5:28:27 AM1/27/21
to tesser...@googlegroups.com
Hi,

With Tesseract now switching to regular (alpha) releases of 5.0.0; does
it make sense to consider some versioning for language files as well?

The Internet Archive has switched to using Tesseract for all our OCR,
and I'm hoping that we can record exactly what version of language files
was used for a specific OCR job. Currently, the answer is simple, since
we're using the default packages from Ubuntu focal, but I am working on
switching to Tesseract release/tag 5.0.0-20201231.

But the tessdata_fast (or tessdata_best, for that matter) do not seem to
have any recent 5.x releases:
https://github.com/tesseract-ocr/tessdata_fast/releases

Are there plans to create a release/tag for the tessdata_* repositories?

Cheers,
Merlijn

Shree Devi Kumar

unread,
Jan 27, 2021, 6:42:23 AM1/27/21
to tesseract-ocr
>The Internet Archive has switched to using Tesseract for all our OCR,

I am so happy to hear this. It will be great to have the Indic languages that were marked as non-ocrable so far be converted to text correctly on Internet Archive.

Is there any page with instructions to do this? Can a language be specified while OCRing? eg. Better results are many times received using script/Devanagari instead of san for Sanskrit.

Regarding your question about tessdata, there have only been minor changes to tessdata files but adding a tag is a good idea. I suggest you post this as a feature request in the repo.






--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/10c2e872-f9e2-d637-2c16-84a46f800e0a%40archive.org.

Merlijn B.W. Wajer

unread,
Jan 27, 2021, 7:03:13 AM1/27/21
to tesser...@googlegroups.com
Hi,

On 27/01/2021 12:42, Shree Devi Kumar wrote:
>> The Internet Archive has switched to using Tesseract for all our OCR,
>
> I am so happy to hear this. It will be great to have the Indic languages
> that were marked as non-ocrable so far be converted to text correctly on
> Internet Archive.

Right, that should now just work -- and we now also support the Fraktur
script and some more languages (all made possible by the great work on
Tesseract!).

> Is there any page with instructions to do this? Can a language be specified
> while OCRing? eg. Better results are many times received using
> script/Devanagari instead of san for Sanskrit.

We switched over completely mid-December 2020; and I'm still working
through a feature and documentation backlog, including document
discovery. But in general, if you set the right ISO-639 language (code
or name) in the "language" metadata field, that language should be used
exclusively - you can also set multiple languages. Potentially you could
also set a script in the language field; I must admin I have not tried
that yet.

If you omit the language field all together, the module will figure out
what scripts are being used, then perform OCR with the detected scripts
as data packs, then perform language analysis on the corpus, and finally
through some heuristics pick the (potentially multiple) languages it
believes the piece is written in, and perform a final OCR step using
those languages and their associated scripts. (Repo with this python
instrumentation code will follow soon)

I wouldn't mind chatting about this some more, but perhaps (?) off-list
would be a better way to do that - either way is fine by me.

> Regarding your question about tessdata, there have only been minor changes
> to tessdata files but adding a tag is a good idea. I suggest you post this
> as a feature request in the repo.

I've created one: https://github.com/tesseract-ocr/tessdata_fast/issues/26

Cheers,
Merlijn

Greg Jay

unread,
Jan 27, 2021, 11:24:32 PM1/27/21
to tesser...@googlegroups.com


On Jan 27, 2021, at 1:42 AM, Shree Devi Kumar <shree...@gmail.com> wrote:

>The Internet Archive has switched to using Tesseract for all our OCR,

I am so happy to hear this. It will be great to have the Indic languages that were marked as non-ocrable so far be converted to text correctly on Internet Archive.

Is there any page with instructions to do this? Can a language be specified while OCRing? eg. Better results are many times received using script/Devanagari instead of san for Sanskrit.

Regarding your question about tessdata, there have only been minor changes to tessdata files but adding a tag is a good idea. I suggest you post this as a feature request in the repo.

I hope someone adds Grantha script as there are many texts on Archive.org in this script.

Greg

Tom Morris

unread,
Jan 30, 2021, 3:25:12 PM1/30/21
to tesseract-ocr
On Wednesday, January 27, 2021 at 5:28:27 AM UTC-5 Merlijn Wajer wrote:

The Internet Archive has switched to using Tesseract for all our OCR,

That's great to hear! It's certainly been a long time coming. Nick White & I tried to get this to happen 7 years ago and even volunteered to help, but were ignored.
 
and I'm hoping that we can record exactly what version of language files
was used for a specific OCR job.

Yes, provenance of the OCR'd text and the software used to derive it would be very valuable.

Did you do any type of quality / performance comparative study as part of the switch or evaluation leading up to it? Can you share the results?

Will you be reprocessing the backlog of books which were originally done with ABBYY? As I mention in that thread from 7 years ago, there's a subset which, anecdotally, looks like it might have been processed using ABBYY "fast" mode, accounting for extra low quality output. These would be especially useful for reprocessing.

Are you looking at any higher level processing (e.g. voting / merging results from multiple scans/editions) to improve the raw quality further?

Tom

Merlijn B.W. Wajer

unread,
Feb 1, 2021, 8:50:57 PM2/1/21
to tesser...@googlegroups.com
Hi Tom,

On 30/01/2021 21:25, Tom Morris wrote:
> On Wednesday, January 27, 2021 at 5:28:27 AM UTC-5 Merlijn Wajer wrote:
>
>
> The Internet Archive has switched to using Tesseract for all our OCR,
>
>
> That's great to hear! It's certainly been a long time coming. Nick White
> & I tried to get this to happen 7 years ago and even volunteered to
> help, but were ignored.
> https://archive.org/post/1010389/using-tesseract-to-improve-ocr-for-some-languages

I've been working with the Internet Archive for only a couple of years,
and mostly worked on other parts of the digitisation efforts - I wasn't
aware of that thread. Sorry to see that it wasn't picked up.

>
> and I'm hoping that we can record exactly what version of language
> files
> was used for a specific OCR job.
>
>
> Yes, provenance of the OCR'd text and the software used to derive it
> would be very valuable.

Agreed.

> Did you do any type of quality / performance comparative study as part
> of the switch or evaluation leading up to it? Can you share the results?

We did internally compare Abbyy and Tesseract results on some books
microfilm. We found the results to be mostly similar, some parts a
little better, other a little worse. In particular, I believe for
newspaper segmentation there are some areas that can be improved with
Tesseract (even though the current state is quite good already), but the
recognition engine came out quite strong.

I am happy to share more details of our evaluation off list - please
drop me an email if you're interested.

> Will you be reprocessing the backlog of books which were originally done
> with ABBYY? As I mention in that thread from 7 years ago, there's a
> subset which, anecdotally, looks like it might have been processed using
> ABBYY "fast" mode, accounting for extra low quality output. These would
> be especially useful for reprocessing.

Do you have some specific collections in mind? I believe that we
currently do not have a lot of computational capacity to spare, but
could definitely target specific collections.

> Are you looking at any higher level processing (e.g. voting / merging
> results from multiple scans/editions) to improve the raw quality further?

That is an interesting idea. I do know that we usually do not digitise
duplicates, as digitisation is a relatively costly process. That said,
there are likely still plenty of duplicates to be found which could make
this technique something we could try. Did you have any particular
technique in mind?

Cheers,
Merlijn

PS: my invitation to share more details applies to others on this list too.
We also have a blog post up here detailing some of work
(https://blog.archive.org/2020/11/23/foss-wins-again-free-and-open-source-communities-comes-through-on-19th-century-newspapers-and-books-and-periodicals/),
thanking the open source community.

We also have a (Slack) channel (not a mailing list, sorry) for OCR
discussion, in case some of you are interested in helping out one way or
another (drop me an email and I can try to get you set up).

Tom Morris

unread,
Feb 19, 2021, 12:27:55 PM2/19/21
to tesseract-ocr
Hi Merlijn,

Apologies for the delayed reply. I'll definitely be in touch about the results of your OCR comparison study, but I'd encourage you to release it publically. One good way to give back to the open source community that the Internet Archive takes advantage of is to share knowledge and code openly. I know that can be a challenge given the historical culture there, but it'd be nice to see that change.

As for not digitizing duplicates, I think there might be a lot more than you suspect. I did a study 7 years ago for a library who wanted to link public domain "classics" to their library catalog and found a large number of duplicate scans for these works, with a wide range of OCR quality scores. At the extreme, there's "The Pilgrim's Progress" with 121 scans that have OCR average character confidence scores ranging from the mid-50s to 94.68. Some of these are unique editions (children's versions, etc), but others are duplicate scans of the same, or very closely related, editions (e.g. subsequent printings of an edition in later years). The Odyssey, The Illiad, Faust, a few of Shakespeare's works all have over 50 scans (in English). For the 243 works that I analyzed, there were a total of 2576 scans in English (at least that I was able to locate at the time).

The full list of scans that I analyzed is here: https://github.com/tfmorris/openlibrary-utils/tree/master/data

For leveraging multiple scans to improve quality, my initial thought was to extend Ismet Yalniz's work, but it's been years since I looked at the literature, so there may have been more advances in the intervening time. 

Hmm, actually they did a small scale study of exactly this use case with positive results, combining three editions of Wuthering Heights and three of Sense and Sensibility to improve the single best OCR scores from .885 to .924 for the combination and .911 to .954, respectively. 

The papers citing that work would be a good starting point for investigation of the current state of the art:

The ICDAR 2019 Post-OCR Text Correction competition might also be worth reviewing: 

Good luck! It's great to see someone working on improving the quality of Internet Archive OCR.

Best,
Tom

shree

unread,
Feb 24, 2021, 12:10:46 AM2/24/21
to tesseract-ocr
>There is now a 4.1.0 release available for tessdata_fast, tessdata and tessdata_best.

@Merlijn Wajer 

archive.org has many books which use English with diacritics for Sanskrit (IAST). You could try the models in https://github.com/Shreeshrii/tesstrain-Sanskrit-IAST for those.

Reply all
Reply to author
Forward
0 new messages