page level ground truth alignment in 0.6, old char model and collection of receipts?

Nathan K

unread,

Mar 23, 2013, 5:34:06 PM3/23/13

to ocr...@googlegroups.com

Hey OCRopus Group,

Its been awhile in here, but I've just begin to update some old hacky scripts from 0.4.4 to 0.6. I've very pleased to see the worth thats been going on. Nice to see things a mor pythonic! I can't figure out how to align the page level ground truth to a page. My memory may be failing me, but I remember this very neat process where ocropus with automagically align page lines with a text transcription of the page. My goal is to regenerate my character training model, and also a language model. Would greatly appreciate any tips to that effect.

Also has there been some changes to the character models since 0.4.4 I tried to use an old one which I remember doing quite a bit of work on, and it fails to unpickle.

Lastly, does anyone have/know of a collection/database of receipts that could be used for training. I've asked friends and family and have so far only received 50 documents - some quite poor quality. Perhaps a couple of people keep digital records for tax purposes and would be happy to share. Happy to keep them confidential if required.

Cheers,

Nathan

Nathan K

unread,

Mar 23, 2013, 9:50:53 PM3/23/13

to ocr...@googlegroups.com

Just to clarify - looking over the examples

fraktur-boxes says:

"The next training step consists of retraining the model by aligning text lines with ground truth (see the example in uw3-500)"

And in the uw3-500 example data is downloaded with ground truth already placed at the line level. Thus it is not clear what one should do to automatically generate line level ground truth from page level ground truth text files. I remember there was some tool that would enable this in the past, it worked on the principle of finding a line match that was 'close enough' based on a cost function. This enabled bootstrapping of a character model.

Is this approach still valid? I could generate a character model using clustering and then manually review the results and then iterate. This however would still not yield ground truth for determining the error, or generating a language model.

Thanks for your assistance if you're in the know! Been pulling my hair out all day!

Cheers,

Nathan

Nathan

--
You received this message because you are subscribed to the Google Groups "ocropus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ocropus+u...@googlegroups.com.
To post to this group, send email to ocr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/ocropus/-/I8eeJdqGLCoJ.
For more options, visit https://groups.google.com/groups/opt_out.

--

Nathan Keilar
Hunted Hive Web Studio ~ Innovative Solutions For Real World Problems
Technical Director and Business Manager

EMAIL:     i...@madteckhead.com
PHONE:   +61 (0) 7 3040 3065
SKYPE/TWITTER: https://twitter.com/#!/madteckhead
FACEBOOK:    http://www.facebook.com/nathan.keilar
WEB: http://madteckhead.com

This email (including any attachments) is confidential and may be privileged. If you have received it in error, please notify the sender by return email and delete this message from your system. Any unauthorised use or dissemination of this message in whole or in part is strictly prohibited. Please note that emails are susceptible to change and we will not be liable for the improper or incomplete transmission of the information contained in this communication nor for any delay in its receipt or damage to your system. We do not guarantee that the integrity of this communication has been maintained nor that this communication is free of viruses, interceptions or interference.

Nathan K

unread,

Mar 25, 2013, 3:26:30 AM3/25/13

to Tom Morris, ocr...@googlegroups.com

Thanks Tom, I like the codebase of ocropus much more - there is some really interesting stuff in there. My language of choice is python, so there is that too. Having said that one must use the right tool for the job. It does seem that tesseract is giving much better results than when I last tried it. However, this is probably because I've implemented awesome preprocessing now :)

Over the last day I've dug though the code and thought I'd report my findings as documentation is pretty light at the moment. IMHO this is a major hurdle for the project as it makes it very difficult for potential contributors to get to the point where they can submit pull requests/patches. I'd be happy to add some documentation on my workflow when I figure out exactly what it is :)

Documentation that was helpful:

-------

- Training examples

- Source code comments for all commands in ocropy folder

- Notebook folder - which makes use of IPythons notebook tool, which was new to me. But trust me - much better than reading the json files. Check which branch/tag you are looking at. I think Tom added some more notebooks back in Dec 12.

How I started to build a character model

------

I gave up on creating ground truth at the line level in the absence of a tool that would help me. I was hardly going to create text file for each line, and manually populate it with data from my page level ground truth. I'm sure I'm missing something here, but I think most people on the list must be enjoying the weekend.

Instead I took Toms advice and turned to tesseract to generate box files. I didn't bother editing these, as you can do that in the veeerrry nice ocropus-cedit tool. All that was required was using the 'tess2h5' argument to the ocropus-db command. (note: this does not show up in the help, so dig into the source, it required specifying an -o file that was not documented in the examples). Then running ocropus-cedit I could correct the errors tesseract made.

And thats pretty much where I'm up to.

Other thoughts

------

- I'd love to get my head around generating the page level gt. I believe this relied on OpenFST which I tried to get working today, but it doesn't seem to be used any more by ocropus.

- What is the recommended way to submit changed fixes? I've got several images that cause various components in the pipeline to fail. I've gone in and added some try/excepts to make it fail gracefully. I'm more familiar with github.

Okay, time for some rest.

Thanks for all your efforts developers! Its great to see how the project is coming along 2 years on.

Cheers,

Nathan

-

On 24 March 2013 07:11, Tom Morris <tfmo...@gmail.com> wrote:

I can't help with your ground truth question, but unless you're absolutely committed to Ocropus, I'd suggest checking out Tesseract. My impression is that it's not only more mature, but it's got a much more active community supporting it.

Tom

Tom

unread,

Apr 10, 2013, 1:30:18 AM4/10/13

to ocr...@googlegroups.com, Tom Morris

Hi,

sorry for the long silence. OCRopus 0.7 is out now, and it has a much higher performance recognizer, and it comes with a trained Fraktur model that works quite well.

The new recognizer does not require any kind of box files for training, just text lines.

Tom

Nathan Keilar

unread,

Jul 10, 2013, 2:57:36 AM7/10/13

to ocr...@googlegroups.com

Ohhh... I'm excited about that! Nice work - thanks - look forward to trying it.

--

You received this message because you are subscribed to the Google Groups "ocropus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ocropus+u...@googlegroups.com.
To post to this group, send email to ocr...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msg/ocropus/-/LYK2extfGT8J.

For more options, visit https://groups.google.com/groups/opt_out.

--

Nathan Keilar
Hunted Hive Web Studio ~ Innovative Solutions For Real World Problems
Technical Director and Business Manager

EMAIL: nat...@huntedhive.com
PHONE: +1 604 352 9067 (CAD) / +61 (0) 7 3040 3065 (AUS)
SKYPE/TWITTER: HuntedHive / https://twitter.com/HuntedHive

LINKEDIN: http://au.linkedin.com/in/nathankeilar
FACEBOOK: http://www.facebook.com/HuntedHive
WEB: http://huntedhive.com

Reply all

Reply to author

Forward