Fair warning: I am not a programmer, nor am I familiar with how
OCRopus works. I just have a question.
I just tried out OCRopus on some sample .png files, and results were
all over the map. Obviously there's a lot of room for improvement.
Particularly for certain kinds of fonts.
I was just wondering if Google is working on a massive dataset driven
learning approach to improve this software. They have access to
millions of "printer friendly" documents in Google News and the index
generally. Are they comparing the html documents (where the contents
are 100% machine readable) with image files generated by a Print
Preview-like program?
I would think that if you showed a neural-net learning system enough
documents in format, with hundreds of different fonts for each
document, that the program could "figure it out" fairly quickly.
Obviously this is a very data and CPU intensive project, one more
suited to Google's server farms than personal computers; so I just
wanted to know if this would work and if Google was working on this
approach.
Thanks,
Brock
I was just wondering if Google is working on a massive dataset driven
learning approach to improve this software. They have access to
millions of "printer friendly" documents in Google News and the index
generally. Are they comparing the html documents (where the contents
are 100% machine readable) with image files generated by a Print
Preview-like program?
I would think that if you showed a neural-net learning system enough
documents in format, with hundreds of different fonts for each
document, that the program could "figure it out" fairly quickly.
Obviously this is a very data and CPU intensive project, one more
suited to Google's server farms than personal computers; so I just
wanted to know if this would work and if Google was working on this
approach.
Cheers,
Christian
That's interesting news, can you say (or: are you allowed) to tell us
some more about this data. Is it from the book search project, which
languages and centuries are covered. Do they provide handprinted and
fraktur data? Do they provide scholarly journals as well?
Are those only images or duo they give you the fulltext as well?
For German books you should be able to use the data from the Google
BSB project.
Cheers,
Christian
We should start to work on the formalisation of the different profiles
proposed so far. It's very odd that we can't really do this in a
established way (DTD, XML Schema) because of the HTML nature of hOCR.
Cheers,
Christian