Account Options

  1. Sign in
The old Google Groups will be going away soon.
Switch to the new Google Groups.
Google Groups Home
« Groups Home
OCR update - Oct 31st for next Progress Prize
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  7 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Ian Ozsvald  
View profile  
 More options Oct 2 2010, 6:32 pm
From: Ian Ozsvald <i...@ianozsvald.com>
Date: Sat, 2 Oct 2010 23:32:04 +0100
Local: Sat, Oct 2 2010 6:32 pm
Subject: OCR update - Oct 31st for next Progress Prize
You'll remember that I wanted to end the last update to the OCR
challenge just over a week back so I could present the results at the
OpenPlaques Open Day. The demo day went well and I got a nice write-up
from another attendee:
http://claireyross.wordpress.com/2010/09/28/open-plaques-more-than-pl...
"Ian Ozsvald pretty much blow my mind. He talked about Applying
machine vision to plaques.  Exactly, it was hardcore. In essence
figuring out how to make a computer do auto transcription of photos!
Within two months of collaborative effort Ian and a clever team have
created some very clever algorithms which crop a blue plaque image,
convert it to black and white, remove noise, remove the pesky English
heritage logo, and run a spell check to get a pretty convincing
transcription."

I've also sent Jonathan his second £25 prize for advancing the
challenge (albeit it was a small improvement...but it was an
improvement nonetheless).

The challenge continues, the next deadline is 31st October (end of
this month) for the next £25 progress prize. We've almost get to
0-errors for some plaques and I reckon that removing the curved text
will give a big improvement to the scores (I think the curved text
confuses tesseract - it tries to learn curved text which hurts its
performance on the later straight text). If someone can figure out how
to blank the curved text then they should easily win the next progress
prize...

Cheers,
Ian.

--
Ian Ozsvald (A.I. researcher, screencaster)
i...@IanOzsvald.com

http://IanOzsvald.com
http://MorConsulting.com/
http://blog.AICookbook.com/
http://TheScreencastingHandbook.com
http://FivePoundApp.com/
http://twitter.com/IanOzsvald


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jonathan Street  
View profile  
 More options Oct 4 2010, 8:17 am
From: Jonathan Street <streetjonat...@gmail.com>
Date: Mon, 4 Oct 2010 13:17:16 +0100
Local: Mon, Oct 4 2010 8:17 am
Subject: Re: OCR update - Oct 31st for next Progress Prize

That's a really nice write-up.  I'm glad it went well.

Hopefully we can continue to improve in the month ahead.

On 2 October 2010 23:32, Ian Ozsvald <i...@ianozsvald.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ian Ozsvald (A.I. Cookbook)  
View profile   Translate to Translated (View Original)
 More options Oct 4 2010, 9:01 am
From: "Ian Ozsvald (A.I. Cookbook)" <i...@aicookbook.com>
Date: Mon, 4 Oct 2010 14:01:24 +0100
Local: Mon, Oct 4 2010 9:01 am
Subject: Re: OCR update - Oct 31st for next Progress Prize
I hope so :-) Ideas, pseudo code etc for solving the
curved-text-detection-and-removal problem would be super appreciated,
I really think that's the next big step to cut down the larger errors
from tesseract.
i.

On 4 October 2010 13:17, Jonathan Street <streetjonat...@gmail.com> wrote:

--
Ian Ozsvald (A.I. researcher, screencaster)
i...@IanOzsvald.com

http://IanOzsvald.com
http://MorConsulting.com/
http://blog.AICookbook.com/
http://TheScreencastingHandbook.com
http://FivePoundApp.com/
http://twitter.com/IanOzsvald


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jonathan Street  
View profile  
 More options Oct 30 2010, 7:06 pm
From: Jonathan Street <streetjonat...@gmail.com>
Date: Sun, 31 Oct 2010 00:06:03 +0100
Local: Sat, Oct 30 2010 7:06 pm
Subject: Re: OCR update - Oct 31st for next Progress Prize

My submission for the month includes one function of cleaning up the image
and then a lot of text clean up.  I finally got the score down to 10.867.
Most of the improvement is in the text clean up.

The results file is at
http://jonathanstreet.com/downloads/aicookbook-comp/comp3_results.csv
The python file is at
http://jonathanstreet.com/downloads/aicookbook-comp/transcribe_plaque...

From the file in the github repository I have:
Reduced the size of the images to speed up processing
Removed the black circle around the plaque and the curved text at the top of
the image (this is far from perfect though it does reduce the score to
13.67)
Made various improvements to the regexes for cleaning up the years
Converted any instances of 'vv' (two v's) to 'w' (one w)
Switched 0 (zero) to o (letter o) in words
Removed any one/two character tokens from the end of the string
Improved the selection of suggestions from the spell checker
Broken up long words to see if a valid word can be found in the two halves
Changed "s to 's
Improved correction for endings where the ending is lived|worked|died here
and the spelling checker returns bad results
Removed any words containing three of lowercase, uppercase, digits and
punctuation.  The regex for this is something of a monstrosity and probably
deeply flawed.  Check it out starting on line 163.

Following our discussion last weekend I also played around with attaching
known good images onto the top of  'unknown' plaques but it didn't seem to
help.
This is my current code exploring this:
http://jonathanstreet.com/downloads/aicookbook-comp/transcribe_plaque...
http://jonathanstreet.com/downloads/aicookbook-comp/1.tif
Currently this gives a score of 13.93

On 4 October 2010 14:01, Ian Ozsvald (A.I. Cookbook) <i...@aicookbook.com>wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ian Ozsvald (A.I. Cookbook)  
View profile  
 More options Nov 2 2010, 5:39 am
From: "Ian Ozsvald (A.I. Cookbook)" <i...@aicookbook.com>
Date: Tue, 2 Nov 2010 09:39:34 +0000
Local: Tues, Nov 2 2010 5:39 am
Subject: Re: OCR update - Oct 31st for next Progress Prize
Most excellent! You win the progress prize again, you're up to £75
towards electronics now :-) I'll sort out the transfer in a couple of
days.

I'm up to my eyeballs at present, I'll get your code down by Friday
(sorry, so much going on right now) and get it merged in to the github
repo. My thanks for your continued efforts :-)

I'll also get a blog post up over the wknd, are you going to do
another blog entry? If so I'll link through to that when I post out
(which'll go to Planet Python), I'll also post into the tesseract
group and maybe a few others.

The November challenge is now open, the closing date for the next £25
progress prize is end-of-Nov.

Cheers!
Ian.

On 31 October 2010 00:06, Jonathan Street <streetjonat...@gmail.com> wrote:

--
Ian Ozsvald (A.I. researcher, screencaster)
i...@IanOzsvald.com

http://IanOzsvald.com
http://MorConsulting.com/
http://blog.AICookbook.com/
http://TheScreencastingHandbook.com
http://FivePoundApp.com/
http://twitter.com/IanOzsvald


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jonathan Street  
View profile  
 More options Nov 2 2010, 12:21 pm
From: Jonathan Street <streetjonat...@gmail.com>
Date: Tue, 2 Nov 2010 16:21:16 +0000
Local: Tues, Nov 2 2010 12:21 pm
Subject: Re: OCR update - Oct 31st for next Progress Prize

Yeah I'll try and put a blog post together over the next day or two.  I
quite like the idea of including a xkcd comic in a blog post and this might
be the best chance I get: http://xkcd.com/809/

I wonder whether it might be a good time to refactor some of the code.
Parts are looking a little messy, mainly due to me I suspect.  I'll probably
look at cleaning some of it up before I put a post together.

On 2 November 2010 09:39, Ian Ozsvald (A.I. Cookbook) <i...@aicookbook.com>wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jonathan Street  
View profile  
 More options Nov 4 2010, 4:59 pm
From: Jonathan Street <streetjonat...@gmail.com>
Date: Thu, 4 Nov 2010 20:59:01 +0000
Local: Thurs, Nov 4 2010 4:59 pm
Subject: Re: OCR update - Oct 31st for next Progress Prize

The write-up is now up on my blog.

http://jonathanstreet.com/blog/third-aicookbook-challenge

On 2 November 2010 16:21, Jonathan Street <streetjonat...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »