You'll remember that I wanted to end the last update to the OCR challenge just over a week back so I could present the results at the OpenPlaques Open Day. The demo day went well and I got a nice write-up from another attendee: http://claireyross.wordpress.com/2010/09/28/open-plaques-more-than-pl... "Ian Ozsvald pretty much blow my mind. He talked about Applying machine vision to plaques. Exactly, it was hardcore. In essence figuring out how to make a computer do auto transcription of photos! Within two months of collaborative effort Ian and a clever team have created some very clever algorithms which crop a blue plaque image, convert it to black and white, remove noise, remove the pesky English heritage logo, and run a spell check to get a pretty convincing transcription."
I've also sent Jonathan his second £25 prize for advancing the challenge (albeit it was a small improvement...but it was an improvement nonetheless).
The challenge continues, the next deadline is 31st October (end of this month) for the next £25 progress prize. We've almost get to 0-errors for some plaques and I reckon that removing the curved text will give a big improvement to the scores (I think the curved text confuses tesseract - it tries to learn curved text which hurts its performance on the later straight text). If someone can figure out how to blank the curved text then they should easily win the next progress prize...
Cheers, Ian.
-- Ian Ozsvald (A.I. researcher, screencaster) i...@IanOzsvald.com
> You'll remember that I wanted to end the last update to the OCR > challenge just over a week back so I could present the results at the > OpenPlaques Open Day. The demo day went well and I got a nice write-up > from another attendee:
> http://claireyross.wordpress.com/2010/09/28/open-plaques-more-than-pl... > "Ian Ozsvald pretty much blow my mind. He talked about Applying > machine vision to plaques. Exactly, it was hardcore. In essence > figuring out how to make a computer do auto transcription of photos! > Within two months of collaborative effort Ian and a clever team have > created some very clever algorithms which crop a blue plaque image, > convert it to black and white, remove noise, remove the pesky English > heritage logo, and run a spell check to get a pretty convincing > transcription."
> I've also sent Jonathan his second £25 prize for advancing the > challenge (albeit it was a small improvement...but it was an > improvement nonetheless).
> The challenge continues, the next deadline is 31st October (end of > this month) for the next £25 progress prize. We've almost get to > 0-errors for some plaques and I reckon that removing the curved text > will give a big improvement to the scores (I think the curved text > confuses tesseract - it tries to learn curved text which hurts its > performance on the later straight text). If someone can figure out how > to blank the curved text then they should easily win the next progress > prize...
> Cheers, > Ian.
> -- > Ian Ozsvald (A.I. researcher, screencaster) > i...@IanOzsvald.com
I hope so :-) Ideas, pseudo code etc for solving the curved-text-detection-and-removal problem would be super appreciated, I really think that's the next big step to cut down the larger errors from tesseract. i.
On 4 October 2010 13:17, Jonathan Street <streetjonat...@gmail.com> wrote:
> That's a really nice write-up. I'm glad it went well.
> Hopefully we can continue to improve in the month ahead.
> On 2 October 2010 23:32, Ian Ozsvald <i...@ianozsvald.com> wrote:
>> You'll remember that I wanted to end the last update to the OCR >> challenge just over a week back so I could present the results at the >> OpenPlaques Open Day. The demo day went well and I got a nice write-up >> from another attendee:
>> http://claireyross.wordpress.com/2010/09/28/open-plaques-more-than-pl... >> "Ian Ozsvald pretty much blow my mind. He talked about Applying >> machine vision to plaques. Exactly, it was hardcore. In essence >> figuring out how to make a computer do auto transcription of photos! >> Within two months of collaborative effort Ian and a clever team have >> created some very clever algorithms which crop a blue plaque image, >> convert it to black and white, remove noise, remove the pesky English >> heritage logo, and run a spell check to get a pretty convincing >> transcription."
>> I've also sent Jonathan his second £25 prize for advancing the >> challenge (albeit it was a small improvement...but it was an >> improvement nonetheless).
>> The challenge continues, the next deadline is 31st October (end of >> this month) for the next £25 progress prize. We've almost get to >> 0-errors for some plaques and I reckon that removing the curved text >> will give a big improvement to the scores (I think the curved text >> confuses tesseract - it tries to learn curved text which hurts its >> performance on the later straight text). If someone can figure out how >> to blank the curved text then they should easily win the next progress >> prize...
>> Cheers, >> Ian.
>> -- >> Ian Ozsvald (A.I. researcher, screencaster) >> i...@IanOzsvald.com
My submission for the month includes one function of cleaning up the image and then a lot of text clean up. I finally got the score down to 10.867. Most of the improvement is in the text clean up.
From the file in the github repository I have: Reduced the size of the images to speed up processing Removed the black circle around the plaque and the curved text at the top of the image (this is far from perfect though it does reduce the score to 13.67) Made various improvements to the regexes for cleaning up the years Converted any instances of 'vv' (two v's) to 'w' (one w) Switched 0 (zero) to o (letter o) in words Removed any one/two character tokens from the end of the string Improved the selection of suggestions from the spell checker Broken up long words to see if a valid word can be found in the two halves Changed "s to 's Improved correction for endings where the ending is lived|worked|died here and the spelling checker returns bad results Removed any words containing three of lowercase, uppercase, digits and punctuation. The regex for this is something of a monstrosity and probably deeply flawed. Check it out starting on line 163.
> I hope so :-) Ideas, pseudo code etc for solving the > curved-text-detection-and-removal problem would be super appreciated, > I really think that's the next big step to cut down the larger errors > from tesseract. > i.
> On 4 October 2010 13:17, Jonathan Street <streetjonat...@gmail.com> wrote: > > That's a really nice write-up. I'm glad it went well.
> > Hopefully we can continue to improve in the month ahead.
> > On 2 October 2010 23:32, Ian Ozsvald <i...@ianozsvald.com> wrote:
> >> You'll remember that I wanted to end the last update to the OCR > >> challenge just over a week back so I could present the results at the > >> OpenPlaques Open Day. The demo day went well and I got a nice write-up > >> from another attendee:
> http://claireyross.wordpress.com/2010/09/28/open-plaques-more-than-pl... > >> "Ian Ozsvald pretty much blow my mind. He talked about Applying > >> machine vision to plaques. Exactly, it was hardcore. In essence > >> figuring out how to make a computer do auto transcription of photos! > >> Within two months of collaborative effort Ian and a clever team have > >> created some very clever algorithms which crop a blue plaque image, > >> convert it to black and white, remove noise, remove the pesky English > >> heritage logo, and run a spell check to get a pretty convincing > >> transcription."
> >> I've also sent Jonathan his second £25 prize for advancing the > >> challenge (albeit it was a small improvement...but it was an > >> improvement nonetheless).
> >> The challenge continues, the next deadline is 31st October (end of > >> this month) for the next £25 progress prize. We've almost get to > >> 0-errors for some plaques and I reckon that removing the curved text > >> will give a big improvement to the scores (I think the curved text > >> confuses tesseract - it tries to learn curved text which hurts its > >> performance on the later straight text). If someone can figure out how > >> to blank the curved text then they should easily win the next progress > >> prize...
Most excellent! You win the progress prize again, you're up to £75 towards electronics now :-) I'll sort out the transfer in a couple of days.
I'm up to my eyeballs at present, I'll get your code down by Friday (sorry, so much going on right now) and get it merged in to the github repo. My thanks for your continued efforts :-)
I'll also get a blog post up over the wknd, are you going to do another blog entry? If so I'll link through to that when I post out (which'll go to Planet Python), I'll also post into the tesseract group and maybe a few others.
The November challenge is now open, the closing date for the next £25 progress prize is end-of-Nov.
Cheers! Ian.
On 31 October 2010 00:06, Jonathan Street <streetjonat...@gmail.com> wrote:
> My submission for the month includes one function of cleaning up the image > and then a lot of text clean up. I finally got the score down to 10.867. > Most of the improvement is in the text clean up.
> From the file in the github repository I have: > Reduced the size of the images to speed up processing > Removed the black circle around the plaque and the curved text at the top of > the image (this is far from perfect though it does reduce the score to > 13.67) > Made various improvements to the regexes for cleaning up the years > Converted any instances of 'vv' (two v's) to 'w' (one w) > Switched 0 (zero) to o (letter o) in words > Removed any one/two character tokens from the end of the string > Improved the selection of suggestions from the spell checker > Broken up long words to see if a valid word can be found in the two halves > Changed "s to 's > Improved correction for endings where the ending is lived|worked|died here > and the spelling checker returns bad results > Removed any words containing three of lowercase, uppercase, digits and > punctuation. The regex for this is something of a monstrosity and probably > deeply flawed. Check it out starting on line 163.
> On 4 October 2010 14:01, Ian Ozsvald (A.I. Cookbook) <i...@aicookbook.com> > wrote:
>> I hope so :-) Ideas, pseudo code etc for solving the >> curved-text-detection-and-removal problem would be super appreciated, >> I really think that's the next big step to cut down the larger errors >> from tesseract. >> i.
>> On 4 October 2010 13:17, Jonathan Street <streetjonat...@gmail.com> wrote: >> > That's a really nice write-up. I'm glad it went well.
>> > Hopefully we can continue to improve in the month ahead.
>> > On 2 October 2010 23:32, Ian Ozsvald <i...@ianozsvald.com> wrote:
>> >> You'll remember that I wanted to end the last update to the OCR >> >> challenge just over a week back so I could present the results at the >> >> OpenPlaques Open Day. The demo day went well and I got a nice write-up >> >> from another attendee:
>> >> http://claireyross.wordpress.com/2010/09/28/open-plaques-more-than-pl... >> >> "Ian Ozsvald pretty much blow my mind. He talked about Applying >> >> machine vision to plaques. Exactly, it was hardcore. In essence >> >> figuring out how to make a computer do auto transcription of photos! >> >> Within two months of collaborative effort Ian and a clever team have >> >> created some very clever algorithms which crop a blue plaque image, >> >> convert it to black and white, remove noise, remove the pesky English >> >> heritage logo, and run a spell check to get a pretty convincing >> >> transcription."
>> >> I've also sent Jonathan his second £25 prize for advancing the >> >> challenge (albeit it was a small improvement...but it was an >> >> improvement nonetheless).
>> >> The challenge continues, the next deadline is 31st October (end of >> >> this month) for the next £25 progress prize. We've almost get to >> >> 0-errors for some plaques and I reckon that removing the curved text >> >> will give a big improvement to the scores (I think the curved text >> >> confuses tesseract - it tries to learn curved text which hurts its >> >> performance on the later straight text). If someone can figure out how >> >> to blank the curved text then they should easily win the next progress >> >> prize...
Yeah I'll try and put a blog post together over the next day or two. I quite like the idea of including a xkcd comic in a blog post and this might be the best chance I get: http://xkcd.com/809/
I wonder whether it might be a good time to refactor some of the code. Parts are looking a little messy, mainly due to me I suspect. I'll probably look at cleaning some of it up before I put a post together.
On 2 November 2010 09:39, Ian Ozsvald (A.I. Cookbook) <i...@aicookbook.com>wrote:
> Most excellent! You win the progress prize again, you're up to £75 > towards electronics now :-) I'll sort out the transfer in a couple of > days.
> I'm up to my eyeballs at present, I'll get your code down by Friday > (sorry, so much going on right now) and get it merged in to the github > repo. My thanks for your continued efforts :-)
> I'll also get a blog post up over the wknd, are you going to do > another blog entry? If so I'll link through to that when I post out > (which'll go to Planet Python), I'll also post into the tesseract > group and maybe a few others.
> The November challenge is now open, the closing date for the next £25 > progress prize is end-of-Nov.
> Cheers! > Ian.
> On 31 October 2010 00:06, Jonathan Street <streetjonat...@gmail.com> > wrote: > > My submission for the month includes one function of cleaning up the > image > > and then a lot of text clean up. I finally got the score down to 10.867. > > Most of the improvement is in the text clean up.
> > From the file in the github repository I have: > > Reduced the size of the images to speed up processing > > Removed the black circle around the plaque and the curved text at the top > of > > the image (this is far from perfect though it does reduce the score to > > 13.67) > > Made various improvements to the regexes for cleaning up the years > > Converted any instances of 'vv' (two v's) to 'w' (one w) > > Switched 0 (zero) to o (letter o) in words > > Removed any one/two character tokens from the end of the string > > Improved the selection of suggestions from the spell checker > > Broken up long words to see if a valid word can be found in the two > halves > > Changed "s to 's > > Improved correction for endings where the ending is lived|worked|died > here > > and the spelling checker returns bad results > > Removed any words containing three of lowercase, uppercase, digits and > > punctuation. The regex for this is something of a monstrosity and > probably > > deeply flawed. Check it out starting on line 163.
> > Following our discussion last weekend I also played around with attaching > > known good images onto the top of 'unknown' plaques but it didn't seem > to > > help. > > This is my current code exploring this:
> > On 4 October 2010 14:01, Ian Ozsvald (A.I. Cookbook) <i...@aicookbook.com
> > wrote:
> >> I hope so :-) Ideas, pseudo code etc for solving the > >> curved-text-detection-and-removal problem would be super appreciated, > >> I really think that's the next big step to cut down the larger errors > >> from tesseract. > >> i.
> >> On 4 October 2010 13:17, Jonathan Street <streetjonat...@gmail.com> > wrote: > >> > That's a really nice write-up. I'm glad it went well.
> >> > Hopefully we can continue to improve in the month ahead.
> >> > On 2 October 2010 23:32, Ian Ozsvald <i...@ianozsvald.com> wrote:
> >> >> You'll remember that I wanted to end the last update to the OCR > >> >> challenge just over a week back so I could present the results at the > >> >> OpenPlaques Open Day. The demo day went well and I got a nice > write-up > >> >> from another attendee:
> http://claireyross.wordpress.com/2010/09/28/open-plaques-more-than-pl... > >> >> "Ian Ozsvald pretty much blow my mind. He talked about Applying > >> >> machine vision to plaques. Exactly, it was hardcore. In essence > >> >> figuring out how to make a computer do auto transcription of photos! > >> >> Within two months of collaborative effort Ian and a clever team have > >> >> created some very clever algorithms which crop a blue plaque image, > >> >> convert it to black and white, remove noise, remove the pesky English > >> >> heritage logo, and run a spell check to get a pretty convincing > >> >> transcription."
> >> >> I've also sent Jonathan his second £25 prize for advancing the > >> >> challenge (albeit it was a small improvement...but it was an > >> >> improvement nonetheless).
> >> >> The challenge continues, the next deadline is 31st October (end of > >> >> this month) for the next £25 progress prize. We've almost get to > >> >> 0-errors for some plaques and I reckon that removing the curved text > >> >> will give a big improvement to the scores (I think the curved text > >> >> confuses tesseract - it tries to learn curved text which hurts its > >> >> performance on the later straight text). If someone can figure out > how > >> >> to blank the curved text then they should easily win the next > progress > >> >> prize...
> Yeah I'll try and put a blog post together over the next day or two. I > quite like the idea of including a xkcd comic in a blog post and this might > be the best chance I get: http://xkcd.com/809/
> I wonder whether it might be a good time to refactor some of the code. > Parts are looking a little messy, mainly due to me I suspect. I'll probably > look at cleaning some of it up before I put a post together.
> On 2 November 2010 09:39, Ian Ozsvald (A.I. Cookbook) <i...@aicookbook.com>wrote:
>> Most excellent! You win the progress prize again, you're up to £75 >> towards electronics now :-) I'll sort out the transfer in a couple of >> days.
>> I'm up to my eyeballs at present, I'll get your code down by Friday >> (sorry, so much going on right now) and get it merged in to the github >> repo. My thanks for your continued efforts :-)
>> I'll also get a blog post up over the wknd, are you going to do >> another blog entry? If so I'll link through to that when I post out >> (which'll go to Planet Python), I'll also post into the tesseract >> group and maybe a few others.
>> The November challenge is now open, the closing date for the next £25 >> progress prize is end-of-Nov.
>> Cheers! >> Ian.
>> On 31 October 2010 00:06, Jonathan Street <streetjonat...@gmail.com> >> wrote: >> > My submission for the month includes one function of cleaning up the >> image >> > and then a lot of text clean up. I finally got the score down to >> 10.867. >> > Most of the improvement is in the text clean up.
>> > From the file in the github repository I have: >> > Reduced the size of the images to speed up processing >> > Removed the black circle around the plaque and the curved text at the >> top of >> > the image (this is far from perfect though it does reduce the score to >> > 13.67) >> > Made various improvements to the regexes for cleaning up the years >> > Converted any instances of 'vv' (two v's) to 'w' (one w) >> > Switched 0 (zero) to o (letter o) in words >> > Removed any one/two character tokens from the end of the string >> > Improved the selection of suggestions from the spell checker >> > Broken up long words to see if a valid word can be found in the two >> halves >> > Changed "s to 's >> > Improved correction for endings where the ending is lived|worked|died >> here >> > and the spelling checker returns bad results >> > Removed any words containing three of lowercase, uppercase, digits and >> > punctuation. The regex for this is something of a monstrosity and >> probably >> > deeply flawed. Check it out starting on line 163.
>> > Following our discussion last weekend I also played around with >> attaching >> > known good images onto the top of 'unknown' plaques but it didn't seem >> to >> > help. >> > This is my current code exploring this:
>> > On 4 October 2010 14:01, Ian Ozsvald (A.I. Cookbook) < >> i...@aicookbook.com> >> > wrote:
>> >> I hope so :-) Ideas, pseudo code etc for solving the >> >> curved-text-detection-and-removal problem would be super appreciated, >> >> I really think that's the next big step to cut down the larger errors >> >> from tesseract. >> >> i.
>> >> On 4 October 2010 13:17, Jonathan Street <streetjonat...@gmail.com> >> wrote: >> >> > That's a really nice write-up. I'm glad it went well.
>> >> > Hopefully we can continue to improve in the month ahead.
>> >> > On 2 October 2010 23:32, Ian Ozsvald <i...@ianozsvald.com> wrote:
>> >> >> You'll remember that I wanted to end the last update to the OCR >> >> >> challenge just over a week back so I could present the results at >> the >> >> >> OpenPlaques Open Day. The demo day went well and I got a nice >> write-up >> >> >> from another attendee:
>> http://claireyross.wordpress.com/2010/09/28/open-plaques-more-than-pl... >> >> >> "Ian Ozsvald pretty much blow my mind. He talked about Applying >> >> >> machine vision to plaques. Exactly, it was hardcore. In essence >> >> >> figuring out how to make a computer do auto transcription of photos! >> >> >> Within two months of collaborative effort Ian and a clever team have >> >> >> created some very clever algorithms which crop a blue plaque image, >> >> >> convert it to black and white, remove noise, remove the pesky >> English >> >> >> heritage logo, and run a spell check to get a pretty convincing >> >> >> transcription."
>> >> >> I've also sent Jonathan his second £25 prize for advancing the >> >> >> challenge (albeit it was a small improvement...but it was an >> >> >> improvement nonetheless).
>> >> >> The challenge continues, the next deadline is 31st October (end of >> >> >> this month) for the next £25 progress prize. We've almost get to >> >> >> 0-errors for some plaques and I reckon that removing the curved text >> >> >> will give a big improvement to the scores (I think the curved text >> >> >> confuses tesseract - it tries to learn curved text which hurts its >> >> >> performance on the later straight text). If someone can figure out >> how >> >> >> to blank the curved text then they should easily win the next >> progress >> >> >> prize...