--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
We are only using Aletheia as a tool to identify each glyph and create tiff/box file pairs for each page processed. We are not using the PAGE format or anything like that.
In fact, I think getting Franken+ to work with Tesseract/jTessBoxEditor input should be a simple matter of adjusting the coordinate system that Franken+ is expecting in the incoming box files (since Tesseract and Aletheia box files have 0,0 in different corners).
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/A1Qq_vfKyRs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Hi Janusz,There are a couple of things I'd like to point out. First of all, you've mentioned 19th Century typefaces in the past, so I'm assuming that that's what you're used to working with. We're dealing with 15th-18th Century documents. Like Bryan, I'm not a font history expert, but from what I've learned over the last year, I'm willing to bet that printing practices and standards in those early centuries of printing were probably a bit different from what they ended up being as everything became more established. Most of the typefaces we are looking at (if not all) were made by hand and so can have quite individual peculiarities. As Nick pointed out, it was not uncommon to create print blocks that contained two or three common letter combinations on one punch (I don't think that's the technically correct word, but I'll use it anyway). They were like ligatures in a way even though the letters weren't actually connected. I'm going to call these unconnected ligatures just for ease of reference throughout this post.If you look closely at this Specimen Sheet from the type caster Francois Guyot (http://collation.folger.edu/2011/09/guyots-speciman-sheet/) you'll see a number of such unconnected ligatures, and we've seen others as Bryan noted. You'll also see a number of upper-case letters which overhang or run under their adjacent letters. The upper-case Q is a common example of this. Most of these are in the italics set, but not all.Owing to the individualistic nature of these typefaces, we are faced with the possibility of having to train Tesseract on every possible typeface--something that is prohibitively expensive, if even possible. We have used Aletheia to train several different typefaces so far, but if we tried to created training for every hand made typeface created over the course of 250 years, we would never finish. Thankfully it is the case that certain type casters were quite influential and that some typefaces in certain places would become "fashionable". So often typefaces from different casters can be quite similar to each other. But just because a type caster made his 'e' look like Guyot's 'e' doesn't mean that he didn't also decide to create a bunch of unconnected ligatures in his type set, or not create the same ones that Guyot thought was important, etc. In fact, due to the inconsistent output of printing presses from this time, I've found that two lower-case e characters from specimen sheets produced 200 years apart can look more like each other than two lower-case e characters printed on the same page of just one document using one of those typefaces. Therefore we are pursuing the possibility that we can train Tesseract to recognize "families" of typefaces which are similar enough to each other that they won't require training Tesseract for each typeface (not to mention the problem of then identifying the documents in our collections which use each typeface).Doing this however, means that the idea of training Tesseract (using only square boxes) to recognize every possible unconnected ligature in our corpus would again be prohibitively expensive (both in terms of time and the expertise required), and probably not possible. If we only used boxes in training Tesseract, we'd have to closely examine every document which we would be OCR'ing with that training in order to make sure that we identified (and collected multiple samples of) each unconnected ligature to add to the training. Otherwise Tesseract won't recognize them. That would seem to defeat the purpose of using a computer to try to optically recognize the characters. It makes much more sense to pull these unconnected ligatures apart and train Tesseract to recognize each character separately so as to increase Tesseract's ability to recognize these characters on multiple documents whether they were printed as unconnected ligatures or not. As Bryan noted, for connected ligatures, like 'sh' 'st' 'ff', etc. we are of course training Tesseract to recognize them as one glyph. And in that work we are using MUFI's unicode values, and even some privately assigned ones (which we have documented by adding them to the list created by PRImA for IMPACT at http://tools.primaresearch.org/Special%20Characters%20in%20Aletheia.pdf).Besides, creating space between character glyphs during training is exactly what's described in Tesseract's own training procedures. That's why we created Franken+: so that we could identify each glyph in a document, and create a Franken-document of tiffs, that match what Tesseract's training document says it needs to be trained with.Another thing is that it is quite common in the documents that we are OCR'ing for standard and italics type to be present on the same page and even the same line. It's even not uncommon at all for documents to be printed with both roman and blackletter fonts throughout the document, again on the same lines. So we need to be able to train Tesseract to recognize both standard and italics. For the italic typefaces, the letters overlap quite often, so here using square blocks wouldn't work. I'm sure that there are some other techniques available to train for italics, but creating a training system that was consistent and easy to use for all the typefaces we are dealing with was a primary goal, as we would not be able to complete our work in the time allowed without the help of unskilled labor.I'd also like to point out that none of the examples that we've provided in any of these discussions represent unusual or special situations. They are VERY TYPICAL for the documents we are dealing with. We also recognize that there are going to be other cases in the 45 million page images we have that none of our team has ever seen before. So we feel that it is essential for us to create training that is "generic" in order to get Tesseract to recognize as many glyphs as possible without requiring us to identify every special case before hand. There will of course be special cases that Tesseract will fail to recognize during the OCR'ing of 45 million pages, which is why we are currently working so hard to create a robust, machine learning-based, post-processing triage system to help us identify these failures.I do understand what you're saying Janusz, and I think that if we were dealing with a much smaller and more specific set of documents from a much shorter time period, we could probably afford to be more specific in our training. But we're not, and so some of the things you're talking about doing just won't work for this project.Also, just so you know, we started by trying to train Tesseract using high-quality page images of documents printed in typefaces we knew we were interested in. These page images were of much better quality than the ones we'll actually be OCR'ing. The results were terrible. We were lucky if we could get Tesseract to recognize 80% of the words on the exact same page we'd used to train Tesseract. And that was including using dictionaries and a unicharambigs file that was created to address the errors Tesseract was making on OCR'ing that page. That's why we created Franken+.Thanks again,Matt Christy
I would have thought the best approach for your situation, where as
you rightly point out there are more ligatures than you have the
time to find and train all of, is to train the common ligatures (as
you're doing), and just trust that less common ligatures will be
identified as separate characters that are close enough to their
non-ligatured versions that they'll be recognised as such.
I haven't had to train an italic font yet. Would the printing sorts
have been slanted for some italic fonts? I suspect so (but don't
know; someone should look it up), which would result in the slight
overlap you see. If that is the case, I wonder if Tesseract takes it
into account? Arguably it should, but as far as I know it just deals
with regular rectangles. There is certainly some extra cleverness it
does to deal with italics... I suspect small overlaps of the kind
that you'll see with italic fonts are essentially just ignored. I
don't know whether that's also true in the training process. It will
be interesting to see how the new training tools to be released deal
with italics.
It would be interesting to retrain tesseract using your approach on
the data described above and to compare the results, but I'm afraid
nobody has time and motivation for it.
Best regards and good luck with your project
> Tool allows to "cut" images on top of glyph data from PAGE file and afterwards
> create Tesseract training page with respective box file. This can be used for
> Tesseract training. I was testing this using script: https://github.com/psnc-dl
> /page-generator/blob/master/src/etc/train.sh and it seems that it can produce
> valid Tesseract profile.
That sounds a lot like the tool that Matthew announced a few days
ago (in this very thread). Can you explain the differences a little,
please?
> Page-generator supports also output from our tool -- Cutouts (http://
> wlt.synat.pcss.pl/cutouts, https://confluence.man.poznan.pl/community/display/
> WLT/Cutouts+application) which allows to work on preparation of training
> material.
That's interesting. Am I correct in thinking that this replaces
Aletheia as a tool to extract glyph images in your workflow? Is the
code available? Is it freely licenced?
Hi All,The Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, as part of its Early Modern OCR Project (eMOP) has created a new tool, called Franken+, that provides a way to create font training for the Tesseract OCR engine using page images. This is in contrast to Tesseract's documented method of font training which involves using a word processing program with a modern font. Franken+ has now been released for beta testing and we invite anyone who's interested to give it a try and to please provide feedback.Franken+ works in conjunction with PRImA's open source Aletheia tool and allows users to easily and quickly identify one or more idealized forms of each glyph found on a set of page images. These identified forms are then used to generate a set of Franken-page images matching the page characteristics documented in Tesseract's training instructions, but with a font used in an actual early modern printed document. Franken+ allows you to create Tesseract box files, but will also guide you through the entire Tesseract training process, producing a .traneddata file, and even allow you to identify and OCR documents using that training. In addition, Franken+ makes it easy to combine training from multiple fonts into one training set.For eMOP we are using Franken+ to create training for Tesseract from page images of early modern printed works, but we also think it can be used just as effectively to train Tesseract using images of any kind of font that's not readily available via a word processor. For example, I've seen posts in this group about wanting to train Tesseract to read the signs on the front of buses.You can find out more about Franken+ at http://emop.tamu.edu/node/54 and http://dh-emopweb.tamu.edu/Franken+/. The code is also available open source at https://github.com/idhmc-tamu/eMOP/tree/master/Franken%2B.Thanks,Matt Christy