Hi Tom,
On 30/01/2021 21:25, Tom Morris wrote:
> On Wednesday, January 27, 2021 at 5:28:27 AM UTC-5 Merlijn Wajer wrote:
>
>
> The Internet Archive has switched to using Tesseract for all our OCR,
>
>
> That's great to hear! It's certainly been a long time coming. Nick White
> & I tried to get this to happen 7 years ago and even volunteered to
> help, but were ignored.
>
https://archive.org/post/1010389/using-tesseract-to-improve-ocr-for-some-languages
I've been working with the Internet Archive for only a couple of years,
and mostly worked on other parts of the digitisation efforts - I wasn't
aware of that thread. Sorry to see that it wasn't picked up.
>
> and I'm hoping that we can record exactly what version of language
> files
> was used for a specific OCR job.
>
>
> Yes, provenance of the OCR'd text and the software used to derive it
> would be very valuable.
Agreed.
> Did you do any type of quality / performance comparative study as part
> of the switch or evaluation leading up to it? Can you share the results?
We did internally compare Abbyy and Tesseract results on some books
microfilm. We found the results to be mostly similar, some parts a
little better, other a little worse. In particular, I believe for
newspaper segmentation there are some areas that can be improved with
Tesseract (even though the current state is quite good already), but the
recognition engine came out quite strong.
I am happy to share more details of our evaluation off list - please
drop me an email if you're interested.
> Will you be reprocessing the backlog of books which were originally done
> with ABBYY? As I mention in that thread from 7 years ago, there's a
> subset which, anecdotally, looks like it might have been processed using
> ABBYY "fast" mode, accounting for extra low quality output. These would
> be especially useful for reprocessing.
Do you have some specific collections in mind? I believe that we
currently do not have a lot of computational capacity to spare, but
could definitely target specific collections.
> Are you looking at any higher level processing (e.g. voting / merging
> results from multiple scans/editions) to improve the raw quality further?
That is an interesting idea. I do know that we usually do not digitise
duplicates, as digitisation is a relatively costly process. That said,
there are likely still plenty of duplicates to be found which could make
this technique something we could try. Did you have any particular
technique in mind?
Cheers,
Merlijn
PS: my invitation to share more details applies to others on this list too.
We also have a blog post up here detailing some of work
(
https://blog.archive.org/2020/11/23/foss-wins-again-free-and-open-source-communities-comes-through-on-19th-century-newspapers-and-books-and-periodicals/),
thanking the open source community.
We also have a (Slack) channel (not a mailing list, sorry) for OCR
discussion, in case some of you are interested in helping out one way or
another (drop me an email and I can try to get you set up).