Thanks Tom, I like the codebase of ocropus much more - there is some really interesting stuff in there. My language of choice is python, so there is that too. Having said that one must use the right tool for the job. It does seem that tesseract is giving much better results than when I last tried it. However, this is probably because I've implemented awesome preprocessing now :)
Over the last day I've dug though the code and thought I'd report my findings as documentation is pretty light at the moment. IMHO this is a major hurdle for the project as it makes it very difficult for potential contributors to get to the point where they can submit pull requests/patches. I'd be happy to add some documentation on my workflow when I figure out exactly what it is :)
Documentation that was helpful:
-------
- Training examples
- Source code comments for all commands in ocropy folder
- Notebook folder - which makes use of IPythons notebook tool, which was new to me. But trust me - much better than reading the json files. Check which branch/tag you are looking at. I think Tom added some more notebooks back in Dec 12.
How I started to build a character model
------
I gave up on creating ground truth at the line level in the absence of a tool that would help me. I was hardly going to create text file for each line, and manually populate it with data from my page level ground truth. I'm sure I'm missing something here, but I think most people on the list must be enjoying the weekend.
Instead I took Toms advice and turned to tesseract to generate box files. I didn't bother editing these, as you can do that in the veeerrry nice ocropus-cedit tool. All that was required was using the 'tess2h5' argument to the ocropus-db command. (note: this does not show up in the help, so dig into the source, it required specifying an -o file that was not documented in the examples). Then running ocropus-cedit I could correct the errors tesseract made.
And thats pretty much where I'm up to.
Other thoughts
------
- I'd love to get my head around generating the page level gt. I believe this relied on OpenFST which I tried to get working today, but it doesn't seem to be used any more by ocropus.
- What is the recommended way to submit changed fixes? I've got several images that cause various components in the pipeline to fail. I've gone in and added some try/excepts to make it fail gracefully. I'm more familiar with github.
Okay, time for some rest.
Thanks for all your efforts developers! Its great to see how the project is coming along 2 years on.
Cheers,
Nathan
-