Due to the discussion about indexing the records being published by Reclaim the Records, I have some questions about the current state of automated indexing of hand-written records using machine learning, computer vision, or ocr (I don't know which term is best used here).1) What is the current status of machine learning, computer vision, and ocr with regards to hand-written genealogy records? I know that BYU, FamilySearch, and other organizations have worked on solving this problem.
2) What are the current limitations of using machine learning or ocr on hand-written records?3) Is the lack of large data sets part of the limitations of automated indexing of hand-written records? Would there be value in generated a large public data set for this?
4) What value could there be in sponsoring a contest on Kaggle using some of the records being published by Reclaim the Records?
There's been a lot of progress in the field over the last 3-4 years by a European academic consortium called Transcriptorium . More recently, they've launched Transkribus, a publically-accessible version of their software for reading historic documents, and (I see now) have focused on a few new projects like READ. The Transkribus folks did a really nice webinar for iDigBio a few months ago, which is still online.
- They've pretty much completely punted on layout analysis and line finding for the time being as being "too hard"
- handwriting search is a completely separate task which is only loosely related to the transcription task (it's pixel based and doesn't "know" the text of that it's searching for)
On Saturday, April 30, 2016 at 4:32:56 PM UTC-5, Tom Morris wrote:- They've pretty much completely punted on layout analysis and line finding for the time being as being "too hard"
That's interesting, and it's frustrating for our efforts with the NYC Marriage Index.
I'm aware of an effort to do this called TILT, run by (among others) Desmond Schmidt. He did some really nice work on the William Brewster Field Books taking full-page plaintext transcripts and linking the individual words and lines to the relevant parts of the page facsimiles. (Imagine deriving an OCR-like set of bounding boxes from a .txt and .jpg file -- that's what we're talking about.) TILT may be open source, and I'd be astonished if it didn't have some pretty good line- and word-recognition algorithms baked into it.