Massive data-set driven learning

Brock

unread,

Oct 24, 2007, 3:11:27 PM10/24/07

to ocropus

Hi,

Fair warning: I am not a programmer, nor am I familiar with how
OCRopus works. I just have a question.

I just tried out OCRopus on some sample .png files, and results were
all over the map. Obviously there's a lot of room for improvement.
Particularly for certain kinds of fonts.

I was just wondering if Google is working on a massive dataset driven
learning approach to improve this software. They have access to
millions of "printer friendly" documents in Google News and the index
generally. Are they comparing the html documents (where the contents
are 100% machine readable) with image files generated by a Print
Preview-like program?

I would think that if you showed a neural-net learning system enough
documents in format, with hundreds of different fonts for each
document, that the program could "figure it out" fairly quickly.

Obviously this is a very data and CPU intensive project, one more
suited to Google's server farms than personal computers; so I just
wanted to know if this would work and if Google was working on this
approach.

Thanks,

Brock

Thomas Breuel

unread,

Oct 24, 2007, 5:01:25 PM10/24/07

to ocr...@googlegroups.com

I was just wondering if Google is working on a massive dataset driven
learning approach to improve this software. They have access to
millions of "printer friendly" documents in Google News and the index
generally. Are they comparing the html documents (where the contents
are 100% machine readable) with image files generated by a Print
Preview-like program?

In fact, Google has released 120Gbytes of training data, and we also have large amounts of training data ourselves. For the alpha release, there has been very little training; we have focused on getting the code in place, so performance of the character recognizer isn't all that good. For the beta release, we'll work on training and improving the character recognition.

I would think that if you showed a neural-net learning system enough
documents in format, with hundreds of different fonts for each
document, that the program could "figure it out" fairly quickly.

Unfortunately, it's a lot more complicated than that. First of all, the character recognizers need to be fast in addition to accurate, the characters are not presented in isolation, there are font and style variations, and there are ambiguities in the input.

Obviously this is a very data and CPU intensive project, one more
suited to Google's server farms than personal computers; so I just
wanted to know if this would work and if Google was working on this
approach.

Well, of course,training on big datasets is something that is being done for OCR systems in general, and will be done for OCRopus in particular. But additional techniques are needed to make OCR systems perform well in practice.

Again, let me emphasize: the alpha release of OCRopus has not been trained or tuned much, so don't expect great performance from it yet; that's what we are going to be working on over the next 6-12 months.

Cheers,
Thomas.

cma...@googlemail.com

unread,

Oct 24, 2007, 5:30:45 PM10/24/07

to ocropus

Hi Thomas,

> In fact, Google has released 120Gbytes of training data, and we also have
> large amounts of training data ourselves. For the alpha release, there has
> been very little training; we have focused on getting the code in place, so
> performance of the character recognizer isn't all that good. For the beta
> release, we'll work on training and improving the character recognition.

That's interesting news, can you say (or: are you allowed) to tell us
some more about this data. Is it from the book search project, which
languages and centuries are covered. Do they provide handprinted and
fraktur data? Do they provide scholarly journals as well?
Are those only images or duo they give you the fulltext as well?

Cheers,
Christian

Thomas Breuel

unread,

Oct 25, 2007, 9:11:51 PM10/25/07

to ocr...@googlegroups.com

That's interesting news, can you say (or: are you allowed) to tell us
some more about this data. Is it from the book search project, which
languages and centuries are covered. Do they provide handprinted and
fraktur data? Do they provide scholarly journals as well?
Are those only images or duo they give you the fulltext as well?

The data consists of 1000 out-of-copyright books, together with OCR results using a commercial OCR engine in hOCR format. A subset of the data was handed out on DVD at ICDAR 2007. The complete dataset is available from Google on a disk drive (I'm not sure how the distribution is being handled, but I can find out).

We're trying to put together a dataset of more recent books as well; it looks like some publishers are willing to help out. I'm not sure yet when that will be done though.

For scholarly journals, UW3 is not too bad.

Cheers,
Thomas.

cma...@googlemail.com

unread,

Oct 29, 2007, 10:19:31 AM10/29/07

to ocropus

Hello,

> The data consists of 1000 out-of-copyright books, together with OCR results
> using a commercial OCR engine in hOCR format. A subset of the data was
> handed out on DVD at ICDAR 2007. The complete dataset is available from
> Google on a disk drive (I'm not sure how the distribution is being handled,
> but I can find out).

Thanks for the update. I'm asking to get an overview about the data
you plan to train ocropus with and thus the type of material that will
later be processable.

For German books you should be able to use the data from the Google
BSB project.

Cheers,
Christian

cma...@googlemail.com

unread,

Oct 29, 2007, 2:30:03 PM10/29/07

to ocropus

Hi,
Another Question:
Since you want to train ocropus with the hOCR data from Google I
assume that the word or even character coordinates are encoded in
hOCR.
As a follow up of our discussion of output formats I think that we can
agree on two additional profiles:
- hOCR with character coordinates
- hOCR with word coordinates.

We should start to work on the formalisation of the different profiles
proposed so far. It's very odd that we can't really do this in a
established way (DTD, XML Schema) because of the HTML nature of hOCR.

Cheers,
Christian

Reply all

Reply to author

Forward