The state of machine learning, computer vision, OCR for indexing hand-written records

Justin York

unread,

Mar 30, 2016, 11:17:31 AM3/30/16

to root...@googlegroups.com

Due to the discussion about indexing the records being published by Reclaim the Records, I have some questions about the current state of automated indexing of hand-written records using machine learning, computer vision, or ocr (I don't know which term is best used here).

1) What is the current status of machine learning, computer vision, and ocr with regards to hand-written genealogy records? I know that BYU, FamilySearch, and other organizations have worked on solving this problem.

2) What are the current limitations of using machine learning or ocr on hand-written records?

3) Is the lack of large data sets part of the limitations of automated indexing of hand-written records? Would there be value in generated a large public data set for this?

4) What value could there be in sponsoring a contest on Kaggle using some of the records being published by Reclaim the Records?

Tom Morris

unread,

Mar 31, 2016, 12:02:32 PM3/31/16

to rootsdev

On Wednesday, March 30, 2016 at 11:17:31 AM UTC-4, Justin York wrote:

Due to the discussion about indexing the records being published by Reclaim the Records, I have some questions about the current state of automated indexing of hand-written records using machine learning, computer vision, or ocr (I don't know which term is best used here).

1) What is the current status of machine learning, computer vision, and ocr with regards to hand-written genealogy records? I know that BYU, FamilySearch, and other organizations have worked on solving this problem.

You're talking about offline handwritten text recognition (HTR) of historical documents, if you're looking for some search terms. The "offline" is to distinguish it from the handwriting recognition that you do with a stylus where you have velocity, pressure, writing order, etc. It's a multifaceted problem which encompasses a lot more than just the handwriting recognition (which is hard by itself) because of issues like ink bleed through, line/form removal, uneven page fading/staining which affects binarization, etc.

The state of the art is probably represented by the papers here: http://www.icfhr2014.org/accepted-papers/

Recognition accuracy can be all over the map, depending on document quality, vocabulary assumptions, etc, but word error rates of 40%+ aren't unusual for unconstrained problems, so one of the tricks is to constrain things as much as possible. In the marriage register case, knowledge of things such as the small number of writers (clerks in the office), homogenous columns (surname, given name, date), etc, may be useful in providing additional information to the model training.

2) What are the current limitations of using machine learning or ocr on hand-written records?

3) Is the lack of large data sets part of the limitations of automated indexing of hand-written records? Would there be value in generated a large public data set for this?

Corpuses with ground truth are always necessary for any type of machine learning exercise. It's possible that a high quality corpus in this domain would encourage researchers to work on the problem, but it's hard to say for sure. I've seen things like the Mormon Missionary Diaries cited in the papers, so it's definitely the case that people do sometimes make use of corpuses for things other than original intended use. Also, the second paper below makes explicit mention of the lack of available ground truth in this domain.

4) What value could there be in sponsoring a contest on Kaggle using some of the records being published by Reclaim the Records?

I'm not an expert in the field, but my impression is that this area is still too immature and complex to produce useful results on Kaggle, but who knows?

A couple of topical recent papers include:

1. http://hdl.handle.net/10251/40254 Elsevier

Romero Gómez, V.; Fornés, A.; Serrano Martinez Santos, N.; Sánchez Peiró, JA.; Toselli .,

AH.; Frinken, V.; Vidal, E.... (2013). The ESPOSALLES database: An ancient marriage

license corpus for off-line handwriting recognition. Pattern Recognition. 46(6):1658-1669.

doi:10.1016/j.patcog.2012.11.024.

2. http://www.cvc.uab.es/~afornes/publi/conferences/2014_ICPR_DFernandez.pdf

Bh2m: The barcelona historical, handwritten marriages database

D Fernández-Mota, J Almazán, N Cirera, A Fornés, J Lladós

2014 22nd International Conference on Pattern Recognition (ICPR), 256-261

The latter describes the work that goes into creating a new corpus like this. If you really wanted to invest in creating a corpus, I'd identify and team with a machine learning research group up front to make sure that the resulting product is as useful as possible to the researchers.

Tom

Justin York

unread,

Apr 1, 2016, 3:54:48 PM4/1/16

to root...@googlegroups.com

Thanks for the information. That's the type of detailed response I was hoping for.

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ben Brumfield

unread,

Apr 29, 2016, 7:13:44 AM4/29/16

to rootsdev

Justin,

I'm no expert at all, but due to my other work I tend to bump into this a lot.

There's been a lot of progress in the field over the last 3-4 years by a European academic consortium called Transcriptorium . More recently, they've launched Transkribus, a publically-accessible version of their software for reading historic documents, and (I see now) have focused on a few new projects like READ. The Transkribus folks did a really nice webinar for iDigBio a few months ago, which is still online.

The big yearly conference for this is ICDAR, held this year in Johannesburg. You can get a feel for the rest of that community by looking at the Transcriptorium publication list for frequently mentioned conference and journal titles.

If you're looking for background on approaches, there's a good presentation by Karl-Heinz Steinke for iDigBio about the approaches that his group had developed over time. It's a bit dated now, but does a really nice job of introducing the challenges of HTR.

Within the genealogy world, I know that Doug Kennard's research was in this area, and that Mocavo was doing a lot of work on this front. I'm not sure where any of that stands nowadays, however.

Ben

Roger Moffat

unread,

Apr 29, 2016, 11:35:38 AM4/29/16

to root...@googlegroups.com

At RootsTech 2015, as part of the Innovator Summit - ArgusSearch entered a handwriting recognition technology that could read olde German (and other language) handwritten records.

http://devpost.com/software/argussearch

I was quite impressed, but they didn’t win the Innovator Summit I think in no small part due to some glitches when the Judges tried to use the demo site they’d set up.

But it looked very promising!!

Roger

Tom Morris

unread,

Apr 30, 2016, 1:08:28 PM4/30/16

to root...@googlegroups.com

As Ben mentioned, ICDAR is *the* conference in this space. I'd be amazed if a complete unknown parachuted in to RootsTech with a world-beating system. Not that the Twitterverse is omniscient, but as some indicator of what it thinks of the significance of this piece of tech, they've managed to accumulate a full 14 followers since Nov. 2009 https://twitter.com/argussearch

Tom

Tom Morris

unread,

Apr 30, 2016, 5:32:56 PM4/30/16

to root...@googlegroups.com

On Fri, Apr 29, 2016 at 7:13 AM, Ben Brumfield <benw...@gmail.com> wrote:

There's been a lot of progress in the field over the last 3-4 years by a European academic consortium called Transcriptorium . More recently, they've launched Transkribus, a publically-accessible version of their software for reading historic documents, and (I see now) have focused on a few new projects like READ. The Transkribus folks did a really nice webinar for iDigBio a few months ago, which is still online.

I think that the READ project is basically a 3-year EU funded follow-on to the original 3-year EU funded tranScriptorium project which ended in December 2015. The iDigBio webinar is interesting, but at over 2 hours a not very dense presentation of the information. I think the kickoff presentations for the next tranche of the project cover much the same ground in a more efficient form for anyone who's interested: http://read.transkribus.eu/2016/03/31/presentations-from-the-read-partners-now-available/

Some of my takeaways from reviewing the presentations on the first three year's work:

- The researchers are making progress and we'll have could have beta systems to play with in 2016

- They've pretty much completely punted on layout analysis and line finding for the time being as being "too hard"

- For HTR transcription, the focus (at least for these projects) is on machine+human transcription with the machine being used to lighten the transcription load, rather than as a standalone capability, for the time being

- The recognizers require both handwriting and language models which have only limited reusability - e.g. handwriting within a similar period, language models from a similar domain, etc

- handwriting search is a completely separate task which is only loosely related to the transcription task (it's pixel based and doesn't "know" the text of that it's searching for)

There's an overview presentation for the next 3-year project here: http://www.slideshare.net/icaruseu/coopreadconvention-marburg-gnter-mhlberger

One of the good things is that, although they're not 100% committed to open source software (booo!), they'll be assembling and making available a document corpus with associated ground truth and metadata. Also, they're looking to scale things up considerably from what's available today and, at least for most of this work, more data == better results.

It's still early days, but the good news is that there are multiple research groups focused on the remaining problems.

Tom

Ben Brumfield

unread,

Apr 30, 2016, 11:08:13 PM4/30/16

to rootsdev

Tom, thanks for your write up. I hadn't had time to dive into any of this, so all of your insights were new to me, and very much appreciated.

Any more summaries or reports (by you or other members of the group) are very welcome -- most of us struggle to keep up with our most pressing problems, so keeping abreast of research outside our specialties/projects is hard..

On Saturday, April 30, 2016 at 4:32:56 PM UTC-5, Tom Morris wrote:

- They've pretty much completely punted on layout analysis and line finding for the time being as being "too hard"

That's interesting, and it's frustrating for our efforts with the NYC Marriage Index.

I'm aware of an effort to do this called TILT, run by (among others) Desmond Schmidt. He did some really nice work on the William Brewster Field Books taking full-page plaintext transcripts and linking the individual words and lines to the relevant parts of the page facsimiles. (Imagine deriving an OCR-like set of bounding boxes from a .txt and .jpg file -- that's what we're talking about.) TILT may be open source, and I'd be astonished if it didn't have some pretty good line- and word-recognition algorithms baked into it.

Here are pointers to TILT, though I don't have as much context as I did for Transcriptorium etc.:
British Library Digital Scholarship Blog: "Text to Image Linking Tool"
Desmond's Blog
Short TILT Screencast
TILT presentation at the British Library competition

- handwriting search is a completely separate task which is only loosely related to the transcription task (it's pixel based and doesn't "know" the text of that it's searching for)

I'm reminded of the Guatemalan National Police Historic Archives work by Nicholas Woodward at the University of Texas. Nick's moved on to other institutions and projects now, but his talk at the Texas Conference on Digital Libraries included a really interesting discussion of the difference between "find" and "transcribe" when you're dealing with texts difficult to HTR (or in this case OCR).

Ben

Tom Morris

unread,

May 1, 2016, 11:15:49 AM5/1/16

to root...@googlegroups.com

On Sat, Apr 30, 2016 at 11:08 PM, Ben Brumfield <benw...@gmail.com> wrote:

On Saturday, April 30, 2016 at 4:32:56 PM UTC-5, Tom Morris wrote:
- They've pretty much completely punted on layout analysis and line finding for the time being as being "too hard"

That's interesting, and it's frustrating for our efforts with the NYC Marriage Index.

Actually, we should be fine since the index registers are on pre-printed forms, so the layout is known.

I'm aware of an effort to do this called TILT, run by (among others) Desmond Schmidt. He did some really nice work on the William Brewster Field Books taking full-page plaintext transcripts and linking the individual words and lines to the relevant parts of the page facsimiles. (Imagine deriving an OCR-like set of bounding boxes from a .txt and .jpg file -- that's what we're talking about.) TILT may be open source, and I'd be astonished if it didn't have some pretty good line- and word-recognition algorithms baked into it.

It doesn't appear to be open-source and he hasn't replied to a comment from January asking about a demo version. The technology appears to be pretty ad hoc and has a number of limitations. For example, the low accuracy word recognition requires a known transcription to help with correction/re-alignment. It would be interesting to see if it has been improved at all since 2014.

Tom

jonmorrey76

unread,

May 3, 2016, 4:41:22 PM5/3/16

to rootsdev

Regarding (1), I'd summarize the current status of offline handwriting recognition as "usable" in the historical document space, as long as we don't expect it to transcribe as accurately as a person can. FamilySearch has evaluated several commercial systems, including some mentioned in this thread. Word accuracies range from (about) 40% to about 85% in best case scenarios. If properly applied, even 40% accuracy is certainly better than the 0% word accuracy offered by the current "browse-only" records experience. Believe it or not, licensing and/or compute cost (rather than accuracy) is probably the larger obstacle at this point for an organization like FamilySearch. The technology itself has already matured well beyond the point of being useful.

Regarding (2), AFAIK, all state-of-the-art HWR and OCR systems use machine learning approaches at the core. The biggest limitations are time to train the system and availability of training data.

Regarding (3), the answer is an emphatic "yes". We need more training data--particularly for historical handwriting. On a related note, FamilySearch is regularly asked "can't you just harvest training data from your existing record indexes?". The answer is yes and no. As the name "indexing" implies, indexing projects simply index key values from the records. They aren't full transcriptions of *everything* the record says. Also, per the indexing instructions, indexers often make inferences that aren't explicitly in the document at all. For example, an indexer may correctly interpret the handwriting "ditto" as "Alabama" in the US Census. If we don't first scrub that data carefully, we can't use it for machine learning. Mistakes in the training data can lead to some weird results. Anyway, the creation of more training corpora would indeed be a useful community project. Regardless of the ultimate solution, machine learning will undoubtedly be involved. The training data may be even more important to the overall result than the quality of the algorithms.

Regarding (4), I honestly haven't thought about it much. As others in the thread have said, there are some credible research groups like the READ consortium. From what I've seen, I'd put my money on one of those groups rather than the public at large. You never know though. There are a lot of brilliant people out there who aren't involved in formal research.

Paul Reynolds

unread,

Sep 11, 2016, 11:01:47 AM9/11/16

to rootsdev

Just curious, is anyone taking the approach of using open-sourced machine learning for (indexing) hand-writing recognition? I know Google's Tensorflow has been open-sourced and is very adept for this type of application. Using records that have already been indexed to "teach" the program, it would be interesting to see what type of accuracy it would produce...

Reply all

Reply to author

Forward