OCR challenge with prizes for open source project - starting in two weeks

43 views

Skip to first unread message

Ian Ozsvald (A.I. Cookbook)

unread,

Jul 5, 2010, 12:06:22 PM7/5/10

to tesser...@googlegroups.com

I'd like to announce an OCR challenge that will start soon, it is for
an open source project and will include prizes.

I'm a part of the http://OpenPlaques.org/ project, the site collects
flickr images of commemorative plaques which are manually transcribed
and added to the site. Plaques generally have 20-100 words which
describe a historic situation - the words are clear, often in white on
a blue background. These plaques exist all over the UK (and in many
other countries). The goal of the project is to make these historic
locations easily searchable.

Here's an example entry for Sir Whinston Churchill near me:
http://www.openplaques.org/plaques/990

The project founders *manually* transcribe the plaque photos at
present - this is a crazy situation as they have several thousand
plaques outstanding and more are added every day. The project is now
international (it started in the UK less than a year ago) and an
automatic transcription system is sorely needed.

As a part of my play-time projects I've setup an Artificial
Intelligence Cookbook site where I'm building a community of
like-minded folk who like solving interesting challenges. I've already
documented a work-in-progress report on a manual solution to this
problem using tesseract 2:
http://blog.aicookbook.com/2010/06/optical-character-recognition-webservice-work-in-progress/
and I've just posted a software outline in Python for (bad!) automatic
recognition:
http://blog.aicookbook.com/2010/07/automatic-plaque-transcription-using-python-work-in-progress/

The OpenPlaques project are building a corpus of images with
transcriptions for me, once we have a good set of images I'll begin
the challenge. This should be in the next two weeks.

You can see my demo code and a suggested solution here:
http://aicookbook.com/wiki/Automatic_plaque_transcription
and I'm *very* open to feedback in our Google Group:
http://groups.google.com/group/aicookbook

I'll run the competition for several months with a prize for the best
solution each month. Solutions get open sourced and sooner or later a
good automatic solution will be created which can start automatically
transcribing the OpenPlaques corpus of images. Winners will also get
their name listed on the OpenPlaques site.

If you'd like to test your skills with OCR then you'll find a good
range of images to work on - from simple clean shots to angled, dark,
smudged images of weather-beaten plaques taken at a distance.

Cheers,
Ian.

--
Ian Ozsvald (A.I. researcher, screencaster)
i...@IanOzsvald.com

http://IanOzsvald.com
http://MorConsulting.com/
http://blog.AICookbook.com/
http://TheScreencastingHandbook.com
http://FivePoundApp.com/
http://twitter.com/IanOzsvald

Jimmy O'Regan

unread,

Jul 5, 2010, 5:24:45 PM7/5/10

to tesser...@googlegroups.com

On 5 July 2010 17:06, Ian Ozsvald (A.I. Cookbook) <i...@aicookbook.com> wrote:
> I'd like to announce an OCR challenge that will start soon, it is for
> an open source project and will include prizes.
>
> I'm a part of the http://OpenPlaques.org/ project, the site collects
> flickr images of commemorative plaques which are manually transcribed
> and added to the site. Plaques generally have 20-100 words which
> describe a historic situation - the words are clear, often in white on
> a blue background. These plaques exist all over the UK (and in many
> other countries). The goal of the project is to make these historic
> locations easily searchable.
>

We have them in Ireland too. One aspect of them that you can use is
that they tend to adhere to a set of 'template' phrases - Person was
born here, site of the battle of X, etc.

> Here's an example entry for Sir Whinston Churchill near me:
> http://www.openplaques.org/plaques/990
>
> The project founders *manually* transcribe the plaque photos at
> present - this is a crazy situation as they have several thousand
> plaques outstanding and more are added every day. The project is now
> international (it started in the UK less than a year ago) and an
> automatic transcription system is sorely needed.
>

Not so crazy: if you already have a corpus of existing transcriptions,
that puts you in a position to use statistical post-editing
techniques.

There is a tool for statistical post-editing here:
http://www.cs.toronto.edu/~mreimer/tesseract.html
If you have the option, I'd recommend changing it to output a word
lattice, and feed that into an n-gram language model: IRSTLM is a good
open source toolset for n-gram language modelling.

> As a part of my play-time projects I've setup an Artificial
> Intelligence Cookbook site where I'm building a community of
> like-minded folk who like solving interesting challenges. I've already
> documented a work-in-progress report on a manual solution to this
> problem using tesseract 2:
> http://blog.aicookbook.com/2010/06/optical-character-recognition-webservice-work-in-progress/
> and I've just posted a software outline in Python for (bad!) automatic
> recognition:
> http://blog.aicookbook.com/2010/07/automatic-plaque-transcription-using-python-work-in-progress/
>
> The OpenPlaques project are building a corpus of images with
> transcriptions for me, once we have a good set of images I'll begin
> the challenge. This should be in the next two weeks.
>
> You can see my demo code and a suggested solution here:
> http://aicookbook.com/wiki/Automatic_plaque_transcription
> and I'm *very* open to feedback in our Google Group:
> http://groups.google.com/group/aicookbook
>

It very much looks like you're still brainstorming: if absolute
accuracy is your goal (and processing time, etc., are not so much of
an issue) one thing that would work for plaques is this: many newer
cameras add geotags to the EXIF tags. You can use the geotags to query
DBpedia for a list of Wikipedia articles that pertain to a particular
place, and extract a custom dictionary (and/or language model) for
that place -- if there's a plaque commemorating something, then it's
quite likely that Wikipedia mentions it. Names in particular tend to
be quite problematic for OCR, this way you can generate custom
wordlists that have a higher likelihood of containing those names. You
should get good enough results from Tesseract (or, indeed, any OCR
system) by passing such a list as the user dictionary, but if you
decide to use statistical language models you would also need those
words to be part of it, to avoid out-of-vocabulary errors. (IRST
supports interpolating from numerous individual models, so that's not
a problem).

> I'll run the competition for several months with a prize for the best
> solution each month. Solutions get open sourced and sooner or later a
> good automatic solution will be created which can start automatically
> transcribing the OpenPlaques corpus of images. Winners will also get
> their name listed on the OpenPlaques site.
>
> If you'd like to test your skills with OCR then you'll find a good
> range of images to work on - from simple clean shots to angled, dark,
> smudged images of weather-beaten plaques taken at a distance.
>
> Cheers,
> Ian.
>
> --
> Ian Ozsvald (A.I. researcher, screencaster)
> i...@IanOzsvald.com
>
> http://IanOzsvald.com
> http://MorConsulting.com/
> http://blog.AICookbook.com/
> http://TheScreencastingHandbook.com
> http://FivePoundApp.com/
> http://twitter.com/IanOzsvald
>

> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

Ian Ozsvald (A.I. Cookbook)

unread,

Jul 6, 2010, 5:09:41 AM7/6/10

to tesser...@googlegroups.com

Hi Jimmy. Thanks for the ideas.

The plaques do crop up everywhere - I got some (square stone ones) in
Barcelona last year. I want to focus on the blue English Heritage ones
first as they represent a large part of the corpus and are most common
in the UK.

Re. the template phrases - absolutely. A lot of them appear to have
sensible phrase types along with good named entities. I like the idea
of adding geotags into the mix, I had been mulling the idea of
searching for the named entites in Freebase and perhaps voting on the
results.

The Freebase/WikiPedia links are useful too as they'd provide extra
annotations for OpenPlaques in the final result.

Re. 'crazy' - only crazy as in "the sysadmins time is spent mostly
typing in plaques rather than working on ways of getting other
people/tools to enter the plaques in a more scaleable fashion" :-)

Re. statistical techniques - good idea but that's out of my world of
experience. OCR is also recent for me, I only started playing with it
this year. A part of the reason for putting up prizes behind this
challenge is to see how people crack this particular nut. I'll see if
I can do some reading via the link, cheers.

Re. dictionary - I'd overlooked that and was thinking of ways of using
the badly recognised text to vote on 'words that are more likely to be
good replacements'. I'll experiment with the dictionary using some of
the marked-up plaques.

Much obliged,
Ian.

Ian Ozsvald (A.I. Cookbook)

unread,

Jul 6, 2010, 5:41:46 PM7/6/10

to tesser...@googlegroups.com

Thanks for the geo-location idea, I've updated the wiki having tried
wikilocation.org
http://aicookbook.com/wiki/Automatic_plaque_transcription

I've noticed that some of the relevant pages in WikiPedia aren't
geo-tagged but they're linked from geo-tagged pages, or direct
searches (perhaps with two passes on tesseract) sometimes reveal
useful pages. There's certainly good source material to use here.

I've tried your suggestion of adding some dictionary words but that
didn't change the quality of recognition. I edited:
/usr/local/share/tessdata/eng.user-words
which had 925 lines of data already (I'm using tesseract 3 via svn
built last night with 'sudo make install' on my MacBook).

I confirmed that TESSDATA_PREFIX points at this location (and made it
go elsewhere just to check that tesseract reported an error).

Having added:
1866
Gold
Albert
Medal
posthumously
1881
when recognising a black and white, thresholded version of:
http://www.flickr.com/photos/54145418@N00/4701399020/
it still fails to recognise the above words (I added these after the
first run of tesseract, they were the poorest recognised words). There
is no difference in the output file before/after adding these lines.

Am I doing something silly?

Cheers,
Ian.

Jimmy O'Regan

unread,

Jul 7, 2010, 2:22:35 PM7/7/10

to tesser...@googlegroups.com

On 6 July 2010 22:41, Ian Ozsvald (A.I. Cookbook) <i...@aicookbook.com> wrote:
> Thanks for the geo-location idea, I've updated the wiki having tried
> wikilocation.org
> http://aicookbook.com/wiki/Automatic_plaque_transcription
>
> I've noticed that some of the relevant pages in WikiPedia aren't
> geo-tagged but they're linked from geo-tagged pages, or direct
> searches (perhaps with two passes on tesseract) sometimes reveal
> useful pages. There's certainly good source material to use here.
>

Yes; that's why I mentioned DBpedia. DBpedia uses Wikipedia's
structure to extract semantic data (RDF) and provides a public Sparql
endpoint to query - I'll assume that, as you've mentioned AI, you
either know what I'm talking about, or these will be easy concepts for
you to pick up :)

There's also the geonames dataset, to allow reverse geolookup. I think
Freebase contains both geonames and dbpedia, so you're covered.

> I've tried your suggestion of adding some dictionary words but that
> didn't change the quality of recognition. I edited:
> /usr/local/share/tessdata/eng.user-words
> which had 925 lines of data already (I'm using tesseract 3 via svn
> built last night with 'sudo make install' on my MacBook).
>
> I confirmed that TESSDATA_PREFIX points at this location (and made it
> go elsewhere just to check that tesseract reported an error).
>
> Having added:
> 1866
> Gold
> Albert
> Medal
> posthumously
> 1881
> when recognising a black and white, thresholded version of:
> http://www.flickr.com/photos/54145418@N00/4701399020/
> it still fails to recognise the above words (I added these after the
> first run of tesseract, they were the poorest recognised words). There
> is no difference in the output file before/after adding these lines.
>
> Am I doing something silly?
>

Well, adding /those/ words shouldn't make much of a difference - they
should be in the normal dictionaries - the suggestion was intended
more for less common proper names.

Also, Tesseract uses an adaptive classifier, so concatenating multiple
images together (say, a multipage tiff) should gave much better
results than running on each page individually, but it occurs to me
now that persisting the classifier's state would be useful in a
variety of other areas, such as business card scanning.

Ian Ozsvald (A.I. Cookbook)

unread,

Aug 25, 2010, 5:31:24 PM8/25/10

to tesser...@googlegroups.com

This is a quick update on the challenge - one of my collaborators has
posted an update which brings our average error down to 33 characters
per plaque (and in so doing he wins this month's prize). The error is
still too high but he's brought in a nice blue-region detector which
lets us isolate the right region of the image:
http://blog.aicookbook.com/2010/08/automatic-plaque-transcription-pytesseract-average-error-down-to-33-4/

I'm planning on presenting our results at an Open Day for the
OpenPlaques project at the end of September, I'm hoping to put in some
time on the project in the next few weeks. Cleaning up the recognised
result (with e.g. Jimmy's n-gram suggestion) will soon be on the
agenda.

If anyone here is interested in contributing there's an on-going £25
monthly prize for the best open-source solution to the problem.

Cheers,
Ian.

On 7 July 2010 20:26, Jimmy O'Regan <jor...@gmail.com> wrote:

> On 6 July 2010 10:09, Ian Ozsvald (A.I. Cookbook) <i...@aicookbook.com> wrote:
>> Hi Jimmy. Thanks for the ideas.
>>
>> The plaques do crop up everywhere - I got some (square stone ones) in
>> Barcelona last year. I want to focus on the blue English Heritage ones
>> first as they represent a large part of the corpus and are most common
>> in the UK.
>>
>> Re. the template phrases - absolutely. A lot of them appear to have
>> sensible phrase types along with good named entities. I like the idea
>> of adding geotags into the mix, I had been mulling the idea of
>> searching for the named entites in Freebase and perhaps voting on the
>> results.
>>
>> The Freebase/WikiPedia links are useful too as they'd provide extra
>> annotations for OpenPlaques in the final result.
>>
>> Re. 'crazy' - only crazy as in "the sysadmins time is spent mostly
>> typing in plaques rather than working on ways of getting other
>> people/tools to enter the plaques in a more scaleable fashion" :-)
>>
>> Re. statistical techniques - good idea but that's out of my world of
>> experience. OCR is also recent for me, I only started playing with it
>> this year. A part of the reason for putting up prizes behind this
>> challenge is to see how people crack this particular nut. I'll see if
>> I can do some reading via the link, cheers.
>>
>

> The statistical stuff is actually quite easy, in this case: very
> similar to the task of statistical spell checking. This article:
> http://norvig.com/spell-correct.html is a good introduction.
>
> Statistical post-editing for OCR usually only works with character
> n-grams: you can correct 'ehurch' to 'church' (as long as the
> corrector is set to look for 'e' where there should be 'c') simply
> because <START_OF_WORD> 'e' 'h' is an extremely unlikely combination,
> whereas <START_OF_WORD> 'c' 'h' will have a probability approaching
> 1.0; this only helps with ambiguous characters in unambiguous words,
> though; an n-gram language model does the same on a word level: 'are'
> and 'arc' are both valid words, so character statistics are no real
> help, but the statistics from the preceding words are: 'the are' is
> extremely unlikely, while 'the arc' is quite likely.

Reply all

Reply to author

Forward

0 new messages