Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

2,089 views
Skip to first unread message

matthew christy

unread,
Dec 6, 2013, 3:10:56 PM12/6/13
to tesser...@googlegroups.com
Hi All,

The Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, as part of its Early Modern OCR Project (eMOP) has created a new tool, called Franken+, that provides a way to create font training for the Tesseract OCR engine using page images. This is in contrast to Tesseract's documented method of font training which involves using a word processing program with a modern font. Franken+ has now been released for beta testing and we invite anyone who's interested to give it a try and to please provide feedback.

Franken+ works in conjunction with PRImA's open source Aletheia tool and allows users to easily and quickly identify one or more idealized forms of each glyph found on a set of page images. These identified forms are then used to generate a set of Franken-page images matching the page characteristics documented in Tesseract's training instructions, but with a font used in an actual early modern printed document. Franken+ allows you to create Tesseract box files, but will also guide you through the entire Tesseract training process, producing a .traneddata file, and even allow you to identify and OCR documents using that training. In addition, Franken+ makes it easy to combine training from multiple fonts into one training set.

For eMOP we are using Franken+ to create training for Tesseract from page images of early modern printed works, but we also think it can be used just as effectively to train Tesseract using images of any kind of font that's not readily available via a word processor. For example, I've seen posts in this group about wanting to train Tesseract to read the signs on the front of buses.

You can find out more about Franken+ at http://emop.tamu.edu/node/54 and http://dh-emopweb.tamu.edu/Franken+/. The code is also available open source at https://github.com/idhmc-tamu/eMOP/tree/master/Franken%2B.

Thanks,
Matt Christy

Janusz S. Bien

unread,
Dec 6, 2013, 3:25:27 PM12/6/13
to tesser...@googlegroups.com, matthew christy
Quote/Cytat - matthew christy <matt.c...@gmail.com> (Fri 06 Dec
2013 09:10:56 PM CET):

> Hi All,
>
> The Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas
> A&M University, as part of its Early Modern OCR Project
> (eMOP<http://emop.tamu.edu/>)
> has created a new tool, called Franken+, that provides a way to create font
> training for the Tesseract OCR engine using page images. This is in
> contrast to Tesseract's documented
> method<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>of
> font training which involves using a word processing program with a
> modern font. Franken+ has now been released for beta testing and we invite
> anyone who's interested to give it a try and to please provide feedback.
>
> Franken+ works in conjunction with PRImA's open source Aletheia
> tool<http://www.primaresearch.org/tools.php>

Aletheia is not an open source tool. Not only the source is not
available, but you can download it only for "personal research" after
registration.

It's a pity your very interesting tool has non-free prerequisites.

Best regards

Janusz

--
Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra
Lingwistyki Formalnej)
Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

matthew christy

unread,
Dec 6, 2013, 4:27:59 PM12/6/13
to tesser...@googlegroups.com
Hi Janusz,

You're right, Aletheia is not open-source. My mistake on a poor choice of words. However, it is free to use after registering, which is also free. The only restriction that I'm sure about on it's use is in a commercial product. I'll see if I can get a comment on that from someone at PRImA.

Thanks,
Matt

Shree Devi Kumar

unread,
Dec 6, 2013, 8:42:11 PM12/6/13
to tesser...@googlegroups.com
Matthew,
I had tried registering for Aletheia a few months ago. No response so far. 
Shree 

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Janusz S. Bien

unread,
Dec 7, 2013, 12:45:21 AM12/7/13
to tesser...@googlegroups.com
Quote/Cytat - Shree Devi Kumar <shree...@gmail.com> (Sat 07 Dec 2013
02:42:11 AM CET):

> Matthew,
> I had tried registering for Aletheia a few months ago. No response so far.
> Shree

Somehow I'm not surprised.

I'm familiar with the program as I had to work with it as a partner of
the IMPACT project. This is the only program which support the PAGE
format. If this format is suitable for you, then of course it is the
only choice.

We prefer to work with hOCR, so the data created with Aletheia were
immediately converted to hOCR, cf. pageparser at
https://bitbucket.org/jwilk/marasca-wbl.

BTW, our ultimate goal is to create so called DjVu corpora, cf.

http://poliqarp.wbl.klf.uw.edu.pl

We intend to replace the dirty OCR created with FineReader by the
output of trained tesseract, so we are looking for a good training tool.

zdenko podobny

unread,
Dec 7, 2013, 2:00:51 AM12/7/13
to tesser...@googlegroups.com
I have the same experience.

Zdenko

gtess...@gmail.com

unread,
Dec 7, 2013, 4:12:50 AM12/7/13
to tesser...@googlegroups.com
I also can't be registered!

Nick White

unread,
Dec 7, 2013, 9:28:33 AM12/7/13
to tesser...@googlegroups.com
Hi Matt,

Firstly, I share the general feeling that depending so strongly on a
proprietary tool sucks a great deal (and would do even if they weren't
so bad at processing registrations).

Aside from that though, my main question is what this tool does that
one of the box editors like jTessBoxEditor doesn't do? Is the
workflow nicer? Are there other useful features it brings? Are the
existing tools not well set up for training from historical documents?

Am I right in thinking that the main feature you bring is
essentially the ability to remove parts of a scanned page you don't
want (whether because the character samples aren't very
representative or some other reason)? That's what I got from reading
the webpage you linked to. But I don't see why that's preferable to
just not including boxes around the parts you don't care for. Am I
missing something?

Thanks, I look forward to learning more.

Nick

Tom Morris

unread,
Dec 7, 2013, 1:40:23 PM12/7/13
to tesser...@googlegroups.com
It's great to have another open source tool in the toolbag.  (It's GPL v3 BTW for those who don't appreciate the irony of distributing a GPL license in proprietary Microsoft Word format.)

I'll echo what the others have said about openness and freedom, or lack thereof.  Not only is the Aletheia tool closed source and tightly controlled, the same is true of the libraries to read the PAGE XML file format.  That's ludicrous!  If Aletheia were open source, you could have directly fixed the bug with random glyph detection there instead of working around it after the fact.  The fact that Franken+ is open source is cool, but making it Windows-only (.NET) is pretty uncool and limits how it can be reused.  There's a reason that other similar tools chose Java, C++/Qt or other portable technologies for their implementations.

It seems to me that the community is already small enough that programs like eMOP and IMPACT would want to not fragment it any further and would be focusing on creating an end-to-end open source tool chain that could be continuously improved by all parties.

Like Nick, I'd also like to see performance figures.  Bryan's video presentation (recommended for those who haven't viewed it) says that performance was improved "considerably" but doesn't give any figures and I don't see any on the web site.

Tom

Tom Morris

unread,
Dec 7, 2013, 2:31:42 PM12/7/13
to tesser...@googlegroups.com
p.s. You probably want to add a .gitignore file so that you aren't committing binaries to the repository.  Also, it seems like Franken+, as a standalone tool, really could use its own repo.  They're lightweight and free and that would give you a separate bug tracker, wiki, etc, as well as the ability for others to fork the repo for the tool without having to pull in the entire eMOP repo.

Nick White

unread,
Dec 9, 2013, 7:39:06 AM12/9/13
to tesser...@googlegroups.com
I just watched the presentation, which I missed before, so I'll
ask a slightly better question as a result.

From the video, the 3 reasons given for creating Franken+ were:
- To find and pick the best exemplars of a character
- To easily find places that glyphs were misidentified
- To ignore small scanning artifacts

The last of these could be done with any box editor by removing the
box, but I am not aware of a particularly helpful interface for the
first two. Am I correct in thinking that an interface that compared
the different examples of each character, grouped by character, is
the main reason for Franken+ being built?

Presuming that is correct, why didn't you add an extra tab to
jTessBoxEditor with such a view? I would have thought that would be
easier, but more importantly it wouldn't rely on a proprietary
workflow, and more generally (unless I'm missing something) should
be more straightforward and faster to work with, as it doesn't
require passing the output from one tool into another.

I look forward to hearing more,

Nick

matthew christy

unread,
Dec 9, 2013, 10:20:39 AM12/9/13
to tesser...@googlegroups.com
Hi all,

Thanks for the comments. I was not aware that there were concerns with Aletheia's availability or trouble getting access to the tool. We have not had any problems with that ourselves. 

We are only using Aletheia as a tool to identify each glyph and create tiff/box file pairs for each page processed. We are not using the PAGE format or anything like that. Early in our process of trying to create training for Tesseract with early modern printed documents we had begun to use Aletheia as a tool to create tiff/box pairs for Tesseract (with some translation, of course). At the point that we decided that we had to create some new mechanism for training Tesseract, we already had quite a lot of page images processed with Aletheia which had all been corrected and checked by hand. So we started from that when creating Franken+. 

We have discussed internally some of the suggestions that you are all making above. Being able to use Tesseract's built in box file generator and jTessBoxEditor as input instead of Aletheia is one. Creating a version that runs on other platforms is another that would follow the above (as Aletheia also only runs on Windows). Making our own repository for Franken+ is also a good idea. It will take some time to get to all of that though. In the meantime we wanted to share what we had created, and this is a beta release. As open-source code we also hope that others will feel free to make some of these changes themselves and share with the rest of us. In fact, I think getting Franken+ to work with Tesseract/jTessBoxEditor input should be a simple matter of adjusting the coordinate system that Franken+ is expecting in the incoming box files (since Tesseract and Aletheia box files have 0,0 in different corners).

-

Franken+ was created not only to allow us to identify the best possible exemplars of each glyph in our training documents, but to generate tiffs with accompanying box files that could be used to train Tesseract as well. The early modern printed documents that we are trying to use to train Tesseract are far from ideal. They suffer from many problems introduced at every stage of the process from the original printing, to 250-550 years of use and storage, to the digitization of the document. We found in early testing that the more examples of glyphs we tried to train Tesseract with--often with highly variable example images for each glyph--the worse Tesseract did. 

In a nutshell Franken+ allows us to process several pages (with Aletheia), see every exemplar of each glyph discovered, pick some small number of ideal samples for each (we are in the process of testing whether Tesseract does better when Franken+ is used to pick one example of each glyph, 5 examples, or more), and generate a set of tiff images and box files that are produced by creating a Franken-doc using only those set of exemplars identified in Franken+. However, we have found that using Franken+ has other advantages: being able to easily identify miss-labeled glyphs, doing typeface comparison of glyphs in a document, quickly identifying and removing "junk" glyphs, being able to quickly identify the different point sizes used in a doc, etc. And we also added some extra features that allow users to complete the Tesseract training process with Franken+ rather than having to go back to the command line.

We have seen a variable amount of improvement in our OCR results with Tesseract using training generated with Franken+. Some of that improvement has been quite good; upwards of 15-20%, without adding dictionaries. We are continuing to test variables in Franken+ training to see what generates the best results. We'll add all of that information to the eMOP and Franken+ pages when we have it.

We do appreciate the comments and suggestions and if anyone is interested in getting Franken+ to work with Tesseract's tiff/box pairs before we can get to it, please do. We are always happy to get help and we do want Franken+ to be an active open-source project.
Thanks,
Matt Christy

Bryan Tarpley

unread,
Dec 9, 2013, 1:40:43 PM12/9/13
to tesser...@googlegroups.com
Folks,

I'm the developer of Franken+.  I appreciate you taking your time to provide some off-the-cuff remarks.  I must agree with everyone here so far:  it would be ideal to have a tool like Franken+ that ingests tiff/box pairs (not tif/PAGE XML) that also runs on something other than Windows.  So why did we go this route?  Because we're a grant funded project with limited funds and hard deadlines.  To put this in perspective a little better, the eMOP project is tasked with OCR'ing over 45 million tiff images--all poor scans of documents that were printed using early printing presses--within two years.  This is an example of the cringe-worthy documents we're dealing with:  http://sarahwerner.net/blog/wp-content/uploads/2012/04/eebo-hamlet-1024x709.jpg 

We began using Aletheia because it was the only tool we were aware of at the time which allows us to binarize an image, clean up artifacts, and bound not only characters but words, lines, paragraphs, columns, pages, etc for font-training purposes.  The student workers who we pay to do much of this work have varying levels of comfort/expertise with computers, so Aletheia also proved to be the most GUI driven, user-friendly tool out there.  Before we found that we needed to develop Franken+, we'd already spent hundreds of hours amassing training data using Aletheia.  We continue to use Aletheia because of its ability to draw polygons around characters (not boxes).  If you look at the example image I linked above, you can see that there are many instances of characters where it would be impossible to draw a box around it and isolate only a single character.  Unfortunately, there are no open-source tools that we were aware of at the time that allowed us to block off characters using polygons, not boxes.  If anyone is willing to develop such a tool (or find one that already exists), I'd be happy (in my free time) to modify Franken+ such that it ingests that kind of thing.

We never expected to have to develop Franken+, so we used the .NET platform because that is the environment for which I feel most comfortable developing, and my time was very limited (I developed this as a graduate research assistant, squeezing in hours between my PhD studies in English and my other duties at the IDHMC).  Since it relies on Aletheia (which also must run on Windows), this was not a problem for us.

Respectfully,
Bryan

matthew christy

unread,
Dec 9, 2013, 5:05:25 PM12/9/13
to tesser...@googlegroups.com
Hi all, 

Some more corrections after talking to the developer some more (I'm just a Project Manager these days).



We are only using Aletheia as a tool to identify each glyph and create tiff/box file pairs for each page processed. We are not using the PAGE format or anything like that. 
Yeah, we are using the PAGE format. We were also using it previously when trying to develop Tesseract training directly from Aletheia and had developed a relatively simple XSLT to convert from PAGE to box file XML formats. Using the PAGE format has not been an issue for us, since we could easily transform it.


In fact, I think getting Franken+ to work with Tesseract/jTessBoxEditor input should be a simple matter of adjusting the coordinate system that Franken+ is expecting in the incoming box files (since Tesseract and Aletheia box files have 0,0 in different corners).
I realized after talking to Bryan that someone would also have to develop code cut the images of the boxes from the page image tiff based on the boxes identified in the box file. However, since Tesseract and the jTessBoxEditor are based on squares instead of polygons these glyph images will end up with a lot of noise due to character overlap. So that will also have to be edited out. 

Thanks,
Matt 

Janusz S. Bien

unread,
Dec 10, 2013, 1:02:37 AM12/10/13
to tesser...@googlegroups.com, matthew christy
Quote/Cytat - matthew christy <matt.c...@gmail.com> (Mon 09 Dec
2013 11:05:25 PM CET):

> I realized after talking to Bryan that someone would also have to develop
> code cut the images of the boxes from the page image tiff based on the
> boxes identified in the box file. However, since Tesseract and the
> jTessBoxEditor are based on squares instead of polygons these glyph images
> will end up with a lot of noise due to character overlap. So that will also
> have to be edited out.

Where the polygons come from? The hot print technology doesn't allow
for overlapping characters, the "sort" body was always rectangular,
cf. e.g.

http://en.wikipedia.org/wiki/Sort_%28typesetting%29

You mean probably characters belonging to ligatures. Ligatures in my
opinion should be treated as single Unicode characters and assigned
Private Use Area code if not available in the standard.

Bryan Tarpley

unread,
Dec 10, 2013, 12:13:37 PM12/10/13
to tesser...@googlegroups.com, matthew christy
jsbien,

I've attached an example from one of our documents.  Consider the capital 'T' which overhangs the 'u', and the 'k' which underlies the 'e'.  We've also found instances where, on certain fonts, almost all of the italics characters overlap.  These are not ligatures.

Thanks,
Bryan
example.png

Nick White

unread,
Dec 10, 2013, 12:31:19 PM12/10/13
to tesser...@googlegroups.com
Hi Bryan,

On Tue, Dec 10, 2013 at 09:13:37AM -0800, Bryan Tarpley wrote:
> I've attached an example from one of our documents. Consider the capital 'T'
> which overhangs the 'u', and the 'k' which underlies the 'e'. We've also found
> instances where, on certain fonts, almost all of the italics characters
> overlap. These are not ligatures.

Curious... Is this a title? If so, maybe they used fancier methods
(e.g. custom cutting the squares)? The T only overhangs the u a tiny
bit, and as it's an italic font anyway I suspect that could be the
ink spreading a touch. But the K certainly looks a lot like a
ligature (whether custom designed for the title or not).

I recently read the book "A View of Early Typography" by Harry
Carter, who mentions that Aldus used at least 65 different ligatures
for all sorts of letter joins. Granted he was exceptional, but also
prolific. I thoroughly recommend that book, incidentally - it's
heavy going, but awesome.

IIRC there's nothing stopping you from treating things like that as
a character that output multiple letters when training, if it
doesn't make sense to preserve the ligature (which for cases like
this it probably wouldn't).

If your university has an old printing press, go visit it and find
someone to show you around - it's great fun!

Nick

Nick White

unread,
Dec 10, 2013, 1:45:53 PM12/10/13
to tesser...@googlegroups.com
Hi Brian, nice to hear from you.

> We began using Aletheia because it was the only tool we were aware of at the
> time which allows us to binarize an image, clean up artifacts, and bound not
> only characters but words, lines, paragraphs, columns, pages, etc for
> font-training purposes. The student workers who we pay to do much of this work
> have varying levels of comfort/expertise with computers, so Aletheia also
> proved to be the most GUI driven, user-friendly tool out there.

When you say it binds words, lines, paragraphs for font training
purposes, can you explain what you mean? I haven't used Aletheia, so
it isn't obvious to me.

Do you mean that the interface is separated by words, so people
correcting the box files can (for example) see that "babe" is
misrecognised as "bard" and then just click near the word and type
"babe"? I can see that this could be a faster approach to correcting
things, potentially. I don't think the current box editors we have
are very focused towards this sort of "proofreading" model, and
perhaps they should be more so.

Looking forward to hearing more from you,

Nick

Bryan Tarpley

unread,
Dec 10, 2013, 2:35:00 PM12/10/13
to tesser...@googlegroups.com
Nick,

No--the example I provided was from a footnote.  I'm sure you're right that the original printer used "ligatures" in the sense that two or more characters were present on the same plate (Forgive me for not knowing book history terminology!  We work with folks at the Cushing library here who are book history scholars, and they fill that knowledge gap for people like me :) ).  The problem is that these custom "ligatures" are not available as single characters in unicode.  We originally tried to place multiple characters in single "boxes" to train Tesseract.  The results for us were poor.  While you may put more than one character per line in a Tesseract box file, you cannot use more than one character at a time in the unicharambigs file, for instance (Google claims you can but you can't--it's a bug).  We made a decision to treat most "ligatures" as separate characters, and while we're still amassing testing data, the results are better.  Granted, for certain ligatures like the "fl" or "sl," they have unicode values, so we use those.

With Franken+, using polygons to bound those characters that normally overlap with others has allowed us to snip them out of context and reproduce synthetic tiff images where they do not overlap.  These synthetic images (where each of the characters are pristine and none overlap) are what we're using to train Tesseract.

In terms of your question about Aletheia, while bounding lines, paragraphs, etc are not necessary for training Tesseract, we're using several post-processing algorithms to detect whether problems with initial OCR are due to poor line segmentation, reading order, column detection, etc., and in order to "train" these algorithms we need sample data, hence the meticulous bounding of characters, words, lines, paragraphs, columns, etc of our training tiff images using Aletheia.

The eMOP project will be releasing its entire workflow, including the source code for these post-processing algorithms, all of our Aletheia training data, and all of the tiff/box pairs we used to train Tesseract.  With the right hardware, in theory, anyone could replicate it.  We're hoping that the game changer for us will be our meticulous, font specific training on the front-end, the power of our Brazos supercomputing cluster to do enormous, parallelized OCR'ing at large scale, and our post-processing "triage" methods which will tell us whether poor results are due to the use of the wrong font, bad segmentation, the presence of images on the page, etc.  We'll also have several web-based tools for crowd-sourcing corrections (like Typewright and Aletheia Layout Editor) on some of the data that OCR just can't crack.

I hope that answers some of your questions--thanks for the feedback!
Bryan

Janusz S. Bien

unread,
Dec 10, 2013, 2:49:31 PM12/10/13
to tesser...@googlegroups.com, Bryan Tarpley
Quote/Cytat - Bryan Tarpley <bpta...@gmail.com> (Tue 10 Dec 2013
08:35:00 PM CET):

> Nick,
>
> No--the example I provided was from a footnote. I'm sure you're right that
> the original printer used "ligatures" in the sense that two or more
> characters were present on the same plate (Forgive me for not knowing book
> history terminology! We work with folks at the Cushing library here who
> are book history scholars, and they fill that knowledge gap for people like
> me :) ).

The terminology is strange and confusing: "type" or "sort".

> The problem is that these custom "ligatures" are not available as
> single characters in unicode.

So what? As I've already mentioned, you can assign code points from
Unicode Private Use Area. This is actually what Medieval Unicode Font
Initiative is doing.

> We originally tried to place multiple
> characters in single "boxes" to train Tesseract. The results for us were
> poor. While you may put more than one character per line in a Tesseract
> box file, you cannot use more than one character at a time in the
> unicharambigs file, for instance (Google claims you can but you can't--it's
> a bug).

I don't think you would have this problem with PUA characters.

> We made a decision to treat most "ligatures" as separate
> characters, and while we're still amassing testing data, the results are
> better. Granted, for certain ligatures like the "fl" or "sl," they have
> unicode values, so we use those.
>
> With Franken+, using polygons to bound those characters that normally
> overlap with others has allowed us to snip them out of context and
> reproduce synthetic tiff images where they do not overlap. These synthetic
> images (where each of the characters are pristine and none overlap) are
> what we're using to train Tesseract.

In other words, you train Tesseract on different character shapes then
those actually occuring in texts.

Bryan Tarpley

unread,
Dec 10, 2013, 3:28:41 PM12/10/13
to tesser...@googlegroups.com
Janusz,

I'm going to try to interpret your comments as constructive criticism :)

We tried using MUFI.  There simply does not exist in MUFI a unicode value for "ke," for example (we looked:  http://www.ub.uib.no/elpub/2003/r/000001/MUFI-standard-1.0.pdf).  I strongly disagree that we're training on different character shapes than those occurring in the texts.  We're actually cutting out images of the characters themselves and training on those.  What you are saying is that we should not treat them as separate entities, that we should value typographical faithfulness over readability in our OCR.  You seem to be advocating a kind of purity or exact consistency with the original typesetting that is not the immediate goal of the eMOP project.  Our ultimate concern is to make these texts searchable for early modern scholars--not to produce 100% typographically faithful textual simulacra.  We believe this caliber of work (the production of scholarly digital editions) is best left to textual scholars, not machines.  How is a scholar supposed to search for instances of the word "turkey" if there are no unicode values they could enter using the keyboard (or even copy and paste from the character map) for "ke?"  There exist great initiatives like the TCP which are more interested in the kind of digitization you seem to be advocating.

Best,
Bryan
--
Bryan Tarpley
Graduate Research Assistant
Texas A&M | IDHMC
LAAH 439
bpta...@tamu.edu

Janusz S. Bien

unread,
Dec 10, 2013, 3:48:02 PM12/10/13
to tesser...@googlegroups.com, Bryan Tarpley, tesser...@googlegroups.com
Quote/Cytat - Bryan Tarpley <bpta...@gmail.com> (Tue 10 Dec 2013
09:28:41 PM CET):

> Janusz,
>
> I'm going to try to interpret your comments as constructive criticism :)

That is definitely my intention.

>
> We tried using MUFI. There simply does not exist in MUFI a unicode value
> for "ke," for example (we looked:
> http://www.ub.uib.no/elpub/2003/r/000001/MUFI-standard-1.0.pdf).

You can make your own assignment. You can get an idea how it was done
in the IMPACT project e.g. from my note

http://bc.klf.uw.edu.pl/288/

The problem is that you need also the font compatible with your
assignments. In the IMPACT project the font used by Aletheia was
changed as often as it was needed. I understand this can be a problem
for you if you are not familiar with font development software.

> I
> strongly disagree that we're training on different character shapes than
> those occurring in the texts. We're actually cutting out images of the
> characters themselves and training on those. What you are saying is that
> we should not treat them as separate entities, that we should value
> typographical faithfulness over readability in our OCR. You seem to be
> advocating a kind of purity or exact consistency with the original
> typesetting that is not the immediate goal of the eMOP project.

This is not a question of ideology but of Tesseract accuracy and
efficiency. I'm not a Tesseract expert so it is just a hypothesis that
better results can be achieved training on the original data.

> Our
> ultimate concern is to make these texts searchable for early modern
> scholars--not to produce 100% typographically faithful textual simulacra.
> We believe this caliber of work (the production of scholarly digital
> editions) is best left to textual scholars, not machines. How is a scholar
> supposed to search for instances of the word "turkey" if there are no
> unicode values they could enter using the keyboard (or even copy and paste
> from the character map) for "ke?"

You have just normalize the text before using it in the search engine.
If your search engine is sufficiently sophisticated, you can offer
several versions of your texts. In our search engine the user by
default searches the normalized text but can search also for original
spelling with ligatures. More information is available in my note

http://bc.klf.uw.edu.pl/289/

and the search engine is available at

http://poliqarp.wbl.klf.uw.edu.pl/en/IMPACT_GT_1/
http://poliqarp.wbl.klf.uw.edu.pl/en/IMPACT_GT_2/

> There exist great initiatives like the
> TCP which are more interested in the kind of digitization you seem to be
> advocating.

I'm not familiar with this project. I will appreciate a link.

Bryan Tarpley

unread,
Dec 10, 2013, 4:15:31 PM12/10/13
to Janusz S. Bien, tesser...@googlegroups.com
Janusz,

The TCP (Text Creation Partnership) is interested in creating ground truth for historic texts by hand-keying them:  http://www.textcreationpartnership.org/ 

We use thousands of their documents for ground truth comparisons, and have generated our word frequency lists using them.  I just realized that they only use a limited set of ligatures in their transcriptions, however.  I apologize for reading your suggestions as though you were advocating typographical accuracy above searchability.  Our initial findings are that trying to train Tesseract to recognize these ligatures is less effective than training it to treat them as separate characters.  In other words, we're having better results normalizing on the front end, both in terms of accuracy and efficiency re:Tesseract.

Having a sophisticated search engine that offers different versions of text would be interesting--we'll have to look into that.  Clemens Neudecker from IMPACT is one of our collaborators.

Thanks,
b

Nick White

unread,
Dec 10, 2013, 5:49:26 PM12/10/13
to tesser...@googlegroups.com
Hi Bryan, I'm responding to parts of several of your messages below.

On Tue, Dec 10, 2013 at 03:15:31PM -0600, Bryan Tarpley wrote:
> Our initial findings are that trying to train
> Tesseract to recognize these ligatures is less effective than training it to
> treat them as separate characters. In other words, we're having better results
> normalizing on the front end, both in terms of accuracy and efficiency
> re:Tesseract.

That is suprising, because Tesseract segments characters in boxes
(just like its makebox mode) when it does OCR, so I'd expect
overlapping ligatures to be better detected when trained for than as
separate characters. I suppose ligatures may well on average vary
more, which might explain it. But still, it's suprising.

> While you may put more than one
> character per line in a Tesseract box file, you cannot use more than
> one character at a time in the unicharambigs file, for instance (Google
> claims you can but you can't--it's a bug)

Have you reported this on the issue tracker? Please do if you
haven't; that is certainly a bug that should be fixed (and shouldn't
be too difficult to fix).
http://code.google.com/p/tesseract-ocr/issues/list

> The eMOP project will be releasing its entire workflow, including
> the source code for these post-processing algorithms, all of our
> Aletheia training data, and all of the tiff/box pairs we used to
> train Tesseract. With the right hardware, in theory, anyone could
> replicate it. We're hoping that the game changer for us will be our
> meticulous, font specific training on the front-end, the power of
> our Brazos supercomputing cluster to do enormous, parallelized
> OCR'ing at large scale, and our post-processing "triage" methods
> which will tell us whether poor results are due to the use of the
> wrong font, bad segmentation, the presence of images on the page,
> etc. We'll also have several web-based tools for crowd-sourcing
> corrections (like Typewright and Aletheia Layout Editor) on some
> of the data that OCR just can't crack.

All good and laudable, certainly, and I am very happy to hear it.
Though the reliance on the proprietary Aletheia throws a cog in the
works; anybody can not replicate it without the permission and
support of the Aletheia people, and nor can anybody but them really
disect how that part of the system works. I know we keep harping on
about it, but it is really important to a lot of us. Particularly,
for me, for a publically funded academic project.

I look forward very much to the scripts that try to predict the
reasons for poor results - the ways they figure that out can surely
be fed back in to Tesseract to improve it further.

Thanks for all the details about what you're up to, it's very
interesting indeed.

Nick

P.S. Apologies for mis-spelling your name earlier.

Bryan Tarpley

unread,
Dec 10, 2013, 8:18:57 PM12/10/13
to tesser...@googlegroups.com
Nick,

Some of the ligatures for which we have unicode equivalents (like "sl" and "fl"), and which clearly form a single, contiguous shape, are without a doubt best treated as a single character.  But others such as the "ke" "ligature" I provided in my attachment earlier in this thread is not composed of two letters that form a contiguous shape--they are clearly separate letters that only "overlap" when you draw boxes around them.  We've found that when two letters aren't touching, Tesseract has trouble identifying them together as a single ligature, /especially/ given that the character "e" by itself looks exactly the same as the one in "ke."  In those cases, even though the printer may have combined the "k" and the "e" onto the same plate to form the "ligature" "ke," (what's the better word for plate here?), it is better to train Tesseract to recognize them as separate characters, from what we've found.   I feel like I'm talking in circles, so if this is making no sense, I can try to give example images of what I'm talking about tomorrow.

No worries about mispelling my name--I once got a card addressed to "Brain Tarpley" ;)

Thanks,
b
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

---
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/A1Qq_vfKyRs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

matthew christy

unread,
Dec 11, 2013, 9:17:00 AM12/11/13
to tesser...@googlegroups.com
Hi Janusz,

There are a couple of things I'd like to point out. First of all, you've mentioned 19th Century typefaces in the past, so I'm assuming that that's what you're used to working with. We're dealing with 15th-18th Century documents. Like Bryan, I'm not a font history expert, but from what I've learned over the last year, I'm willing to bet that printing practices and standards in those early centuries of printing were probably a bit different from what they ended up being as everything became more established. Most of the typefaces we are looking at (if not all) were made by hand and so can have quite individual peculiarities. As Nick pointed out, it was not uncommon to create print blocks that contained two or three common letter combinations on one punch (I don't think that's the technically correct word, but I'll use it anyway). They were like ligatures in a way even though the letters weren't actually connected. I'm going to call these unconnected ligatures just for ease of reference throughout this post. 

If you look closely at this Specimen Sheet from the type caster Francois Guyot (http://collation.folger.edu/2011/09/guyots-speciman-sheet/) you'll see a number of such unconnected ligatures, and we've seen others as Bryan noted. You'll also see a number of upper-case letters which overhang or run under their adjacent letters. The upper-case Q is a common example of this. Most of these are in the italics set, but not all. 

Owing to the individualistic nature of these typefaces, we are faced with the possibility of having to train Tesseract on every possible typeface--something that is prohibitively expensive, if even possible. We have used Aletheia to train several different typefaces so far, but if we tried to created training for every hand made typeface created over the course of 250 years, we would never finish. Thankfully it is the case that certain type casters were quite influential and that some typefaces in certain places would become "fashionable". So often typefaces from different casters can be quite similar to each other. But just because a type caster made his 'e' look like Guyot's 'e' doesn't mean that he didn't also decide to create a bunch of unconnected ligatures in his type set, or not create the same ones that Guyot thought was important, etc. In fact, due to the inconsistent output of printing presses from this time, I've found that two lower-case e characters from specimen sheets produced 200 years apart can look more like each other than two lower-case e characters printed on the same page of just one document using one of those typefaces. Therefore we are pursuing the possibility that we can train Tesseract to recognize "families" of typefaces which are similar enough to each other that they won't require training Tesseract for each typeface (not to mention the problem of then identifying the documents in our collections which use each typeface).

Doing this however, means that the idea of training Tesseract (using only square boxes) to recognize every possible unconnected ligature in our corpus would again be prohibitively expensive (both in terms of time and the expertise required), and probably not possible. If we only used boxes in training Tesseract, we'd have to closely examine every document which we would be OCR'ing with that training in order to make sure that we identified (and collected multiple samples of) each unconnected ligature to add to the training. Otherwise Tesseract won't recognize them. That would seem to defeat the purpose of using a computer to try to optically recognize the characters. It makes much more sense to pull these unconnected ligatures apart and train Tesseract to recognize each character separately so as to increase Tesseract's ability to recognize these characters on multiple documents whether they were printed as unconnected ligatures or not. As Bryan noted, for connected ligatures, like 'sh' 'st' 'ff', etc. we are of course training Tesseract to recognize them as one glyph. And in that work we are using MUFI's unicode values, and even some privately assigned ones (which we have documented by adding them to the list created by PRImA for IMPACT at http://tools.primaresearch.org/Special%20Characters%20in%20Aletheia.pdf).

Besides, creating space between character glyphs during training is exactly what's described in Tesseract's own training procedures. That's why we created Franken+: so that we could identify each glyph in a document, and create a Franken-document of tiffs, that match what Tesseract's training document says it needs to be trained with.

Another thing is that it is quite common in the documents that we are OCR'ing for standard and italics type to be present on the same page and even the same line. It's even not uncommon at all for documents to be printed with both roman and blackletter fonts throughout the document, again on the same lines. So we need to be able to train Tesseract to recognize both standard and italics. For the italic typefaces, the letters overlap quite often, so here using square blocks wouldn't work. I'm sure that there are some other techniques available to train for italics, but creating a training system that was consistent and easy to use for all the typefaces we are dealing with was a primary goal, as we would not be able to complete our work in the time allowed without the help of unskilled labor.

I'd also like to point out that none of the examples that we've provided in any of these discussions represent unusual or special situations. They are VERY TYPICAL for the documents we are dealing with. We also recognize that there are going to be other cases in the 45 million page images we have that none of our team has ever seen before. So we feel that it is essential for us to create training that is "generic" in order to get Tesseract to recognize as many glyphs as possible without requiring us to identify every special case before hand. There will of course be special cases that Tesseract will fail to recognize during the OCR'ing of 45 million pages, which is why we are currently working so hard to create a robust, machine learning-based, post-processing triage system to help us identify these failures.

I do understand what you're saying Janusz, and I think that if we were dealing with a much smaller and more specific set of documents from a much shorter time period, we could probably afford to be more specific in our training. But we're not, and so some of the things you're talking about doing just won't work for this project.

Also, just so you know, we started by trying to train Tesseract using high-quality page images of documents printed in typefaces we knew we were interested in. These page images were of much better quality than the ones we'll actually be OCR'ing. The results were terrible. We were lucky if we could get Tesseract to recognize 80% of the words on the exact same page we'd used to train Tesseract. And that was including using dictionaries and a unicharambigs file that was created to address the errors Tesseract was making on OCR'ing that page. That's why we created Franken+.

Thanks again,
Matt Christy

matthew christy

unread,
Dec 11, 2013, 9:19:03 AM12/11/13
to tesser...@googlegroups.com
Hi Nick,

Yes, we have found a lot of things about what Tesseract is doing to be surprising.


Thanks again for your feedback,
Matt

Nick White

unread,
Dec 11, 2013, 9:45:12 AM12/11/13
to tesser...@googlegroups.com
On Wed, Dec 11, 2013 at 06:19:03AM -0800, matthew christy wrote:
> I did report that bug: https://code.google.com/p/tesseract-ocr/issues/detail?id
> =906&q=christy.

Oh great, thanks for that. Sorry it's been left unanswered for so
long; I should really have been helping out on the issue tracker
more, but haven't had time. I'll take a look at it soon!

Nick

Nick White

unread,
Dec 11, 2013, 10:01:15 AM12/11/13
to tesser...@googlegroups.com
On Wed, Dec 11, 2013 at 06:17:00AM -0800, matthew christy wrote:
> it was not uncommon to create print blocks that contained two or three
> common letter combinations on one punch (I don't think that's the
> technically correct word, but I'll use it anyway). They were like ligatures
> in a way even though the letters weren't actually connected. I'm going to
> call these unconnected ligatures just for ease of reference throughout this
> post.

Just to clarify the terminology, ligatures don't have to join the
letters they're a part of, they basically just refer to letters next
to each other that were designed together and whose 'boxes' overlap.
So there's no need to distinguish between "connected" and
"unconnected" ligatures; they're all just ligatures.

Nick White

unread,
Dec 11, 2013, 10:42:45 AM12/11/13
to tesser...@googlegroups.com
Hi Matt,

On Wed, Dec 11, 2013 at 06:17:00AM -0800, matthew christy wrote:
> If we only used boxes
> in training Tesseract, we'd have to closely examine every document which we
> would be OCR'ing with that training in order to make sure that we
> identified (and collected multiple samples of) each unconnected ligature to
> add to the training. Otherwise Tesseract won't recognize them. That would
> seem to defeat the purpose of using a computer to try to optically
> recognize the characters. It makes much more sense to pull these
> unconnected ligatures apart and train Tesseract to recognize each character
> separately so as to increase Tesseract's ability to recognize these
> characters on multiple documents whether they were printed as unconnected
> ligatures or not.

You may be right, but I'm not entirely convinced. "Pulling apart"
the k and e in your recent example I'm not sure makes sense, because
you're unlikely to see a k with that long of a tail that isn't part
of a ligature anyway. So if Tesseract saw a ligature like the ke
(and it hadn't been trained for it as one character), it would
probably break it down into a k and e such that much of the
tail of the k was not part of the k box anyway. Unless tesseract
worked by splitting glyphs into arbitrary shapes (which it doesn't,
and won't) I don't think it makes sense for you to train it for
ligatures using them.

I would have thought the best approach for your situation, where as
you rightly point out there are more ligatures than you have the
time to find and train all of, is to train the common ligatures (as
you're doing), and just trust that less common ligatures will be
identified as separate characters that are close enough to their
non-ligatured versions that they'll be recognised as such.

> Besides, creating space between character glyphs during training is exactly
> what's described in Tesseract's own training procedures. That's why we
> created Franken+: so that we could identify each glyph in a document, and
> create a Franken-document of tiffs, that match what Tesseract's training
> document says it needs to be trained with.

I think that the issue of needing plenty of space between letters
when training is less acute than it used to be, so this may not be
a big issue anymore. It was a big problem with Tesseract 2.x,
certainly, but should be less so now. There are advantages to using
"realistic" spacing, regarding it more accurately estimating
characters' positions on the line and closeness to their neighbours.
It may still be that the characters in your source documents are
still too close for comfort, but I wouldn't bet on it.

> For the italic typefaces, the letters
> overlap quite often, so here using square blocks wouldn't work. I'm sure
> that there are some other techniques available to train for italics, but
> creating a training system that was consistent and easy to use for all the
> typefaces we are dealing with was a primary goal, as we would not be able
> to complete our work in the time allowed without the help of unskilled
> labor.

I haven't had to train an italic font yet. Would the printing sorts
have been slanted for some italic fonts? I suspect so (but don't
know; someone should look it up), which would result in the slight
overlap you see. If that is the case, I wonder if Tesseract takes it
into account? Arguably it should, but as far as I know it just deals
with regular rectangles. There is certainly some extra cleverness it
does to deal with italics... I suspect small overlaps of the kind
that you'll see with italic fonts are essentially just ignored. I
don't know whether that's also true in the training process. It will
be interesting to see how the new training tools to be released deal
with italics.

Janusz S. Bien

unread,
Dec 12, 2013, 12:02:08 AM12/12/13
to tesser...@googlegroups.com, matthew christy
Dear Matthew, thank you for your long letter.

To make a long story short, I'm familiar with the old typography
problems but I have no experience with tesseract training.

I may however point you to the report concerning an experiment
consisting in training tesseract on old Polish texts with the same
problems which you describe:

http://lib.psnc.pl/publication/428

Both the texts, as images and PAGE files, are publicly available at

http://dl.psnc.pl/activities/projekty/impact/results/

Please note that the trained dataset is also available at

http://dl.psnc.pl/download/tesseract_traineddata.zip

The training used "classical" rectangular method.

To say the truth, I don't know how efficient the training was as I'm
not aware of any large scale application of the trained dataset. Using
it is one of the user options at Virtual Transcription Laboratory
(http://wlt.synat.pcss.pl/wlt-web/index.xhtml), but I have no idea who
uses it and for what.

It would be interesting to retrain tesseract using your approach on
the data described above and to compare the results, but I'm afraid
nobody has time and motivation for it.

Best regards and good luck with your project

Clemens Neudecker

unread,
Dec 12, 2013, 7:04:08 AM12/12/13
to tesser...@googlegroups.com, matthew christy
Dear all,

Thanks to Matt and Bryan for making me aware of this interesting discussion!

My name is Clemens Neudecker and I have been the Technical Manager of the IMPACT project (www.impact-project.eu). Without going into greater detail about the points that have already been discussed at length, I would nevertheless like to elaborate a bit on the background of IMPACT and what decisions led to the use of Aletheia and PAGE in particular. Hopefully this will shed a bit more light on the situation.

At the time when the IMPACT project was conceived (2007), unfortunately neither Tesseract, Ocropus or any other open source OCR system was close to delivering competitive results for in particular historical documents. On the other hand, national and research libraries all over the world were mainly using Abbyy's commercial OCR for their digitization programmes. Thus the decision was made to cooperate with a commercial supplier as this would guarantee that improvements that would be made in the project would also end up in the real-life production workflows of the museums, libraries and archives sector. Other commercial partners were added to the consortium of research groups and libraries, always based on the assumption that a problem as big as OCR for the wide array of historical documents would be easier to tackle when combining research and commercial sector partners and experience. Nevertheless we were also closely watching the developments in the community, and I believe we have not failed to mention on many occassions the fact that Tesseract (and especially release 3.x) has become a serious competitor for commercial solutions, with the benefit of being developed in the open.

So within IMPACT we had a situation where there was a great diversity of software tools for OCR that were being worked on in the project, some open source, some commercial, some only available for research. etc. For all these different modules, proper evaluation needed to be done, which in turn meant that very specific ground truth had to be produced in large quantities and with a very high granularity. This was why Aletheia was built: it aims to address the two main issues with the ground truth production in IMPACT:

1) The (various) ground truth data had to be extremely specific - the PAGE format was the only format at the time providing a set of elements out of the box that was rich enough to express all these specific use cases.

2) A system had to be built that would be useable by lay people on a production scale - around 50k pages of ground truth had to be produced in a cost-efficient way using offshore service providers.

More than 50k pages of ground truth have since been produced using Aletheia, and feedback from the production was always integrated into the tool in a timely fashion by the colleagues from PRImA. I am very sorry to hear that so many people obviously have had problems obtaining access - I have informed the people over at PRImA and am confident they will be able to provide you all with the software asap. 

Which leads me to the second point I want to make - about open source. In my personal opinion, open source tools and community building are the best strategy for addressing the challenges OCR still has (and there are many) in a sustainable way. In my role as the Technical Manager of IMPACT I was also responsible for the technical integration of the software that was built. And we deliberately decided against an integrated software product that would have had to be commercial due to the integration of intellectual property from commercial companies, but instead rather build an interoperable framework based on established open source technologies such as Apache components and Taverna, a well established open source workflow management tool from the bio sciences. This allowed a loose coupling of commercial and open source tools and a transparent evaluation between them. The system follows standard practices for interoperability and has been implemented using Java and web services for the greatest possible interoperability. It has been developed in the open under an Apache 2.0 license (thus allowing even commercial exploitation). You can find the sources here if your interested: https://github.com/impactcentre/interoperability-framework.

Next to that, we have also been advocating open source to the various partners that developed software in the IMPACT project. But we also have to respect their instutitonal policies and intellectual property. However, since the end of the IMPACT project, more and more software tools have continuously been made available with source code. The main aim of the IMPACT Centre (www.digitisation.eu), a not-for-profit organisation founded to sustain IMPACT outcomes and foster community building, has since been to combine existing and new developments into the exact open-source, transparent and fully interoperable OCR tool chain that was mentioned earlier. In this function we are also a collaborator in the eMOP effort. However, as everyone who has been building open source software in an international collaborative setup will know, this is a long and tedious process and we are still very much busy with making tools ready for release. To give you a quick account of what is currently available and in the pipeline:

- the interoperable framework mentioned above: https://github.com/impactcentre/interoperability-framework
- the inventory extraction (clustering) tool from IMPACT: https://github.com/impactcentre/inventory-extraction
- one of the post-correction tools from IMPACT: https://github.com/impactcentre/PoCoTo
- an OCR evaluation tool with support for PAGE, but also hOCR and other formats: https://github.com/impactcentre/ocrevalUAtion
- a retrieval system that can leverage dictionaries and language technologies built in IMPACT: https://github.com/INL/BlackLab

Further modules from IMPACT that will be released as open source within Q1-2/2014:

- a Java tool for training Tesseract based on PAGE xml instances, including some basic classes to operate on PAGE
(I am currently beta testing this, it will remove some of the dependencies on Aletheia in the eMOP process)
- more tools for building OCR lexica and historical dictionaries

Watch the space at https://github.com/impactcentre and http://www.digitisation.eu/tools/ in order to hear of all these tools being made available! Also, from what I understand, one of the outcomes of eMOP will be a (although feature-reduced) web-based and open-source version of Aletheia. Besides, the XSD for PAGE is available and e.g. basic Java classes for working with PAGE can in principle be automatically generated from that.

Regarding ground truth, while a couple of IMPACT datasets with ground truth have already been released (see www.digitisation.eu/data/), also here more can be expected to follow in the course of 2014. I believe that, once released in its entirety, the availability of 50k pages of ground truth for historical documents in PAGE format is one of the biggest assets of PAGE and Aletheia. With more of that ground truth being released and produced, and more tools being made available that can operate on the PAGE format, I hope this will create some momentum in the OCR community also beyond former IMPACT consortium partners.

Within the IMPACT Centre we are very much busy with making all these resources available, and we would very much encourage everyone here to get involved with the Centre (you can register for free as a user), and be in touch about the needs, concerns and expectations of the community towards IMPACT. I personally have been involved with OCR technology for more than a decade, and it is close to my heart. As Nick mentioned, the community is not very large, and having been at ICDAR and other events, there is still rooom for improvement with regards to sharing of results, methods and implementations. The IMPACT Centre aims to provide a sustainable infrastructure where such community collaboration can develop - but for that we also need YOUR help and input.

Very much looking forward to be in touch, here or over at www.digitisation.eu,

Best regards,
Clemens

Nick White

unread,
Dec 12, 2013, 8:18:23 AM12/12/13
to tesser...@googlegroups.com
Hi Bryan,

On Tue, Dec 10, 2013 at 07:18:57PM -0600, Bryan Tarpley wrote:
> We've found that when two letters
> aren't touching, Tesseract has trouble identifying them together as a single
> ligature, /especially/ given that the character "e" by itself looks exactly the
> same as the one in "ke."

Oh, I see. That's something Tesseract ought to do better, really, if
it knows there are some 'characters' trained which are big enough
that a combined box may make sense. I'll have a look into the code
which does the boxing at some point to see if I can find a way to
improve it, but probably that won't be for some time.

> In those cases, even though the printer may have
> combined the "k" and the "e" onto the same plate to form the "ligature" "ke,"
> (what's the better word for plate here?), it is better to train Tesseract to
> recognize them as separate characters, from what we've found.

That sound sensible. However, as I mentioned in an earlier email, I
question the wisdom of training these characters using
non-rectangular polygons, as Tesseract will be breaking the ligature
into rectangles anyway, so e.g. it'll never see the flourish of the
tail of 'k' in the same box as the main part of 'k', so training for
a 'k' with the full flourish can't help it.

> I feel like I'm
> talking in circles, so if this is making no sense, I can try to give example
> images of what I'm talking about tomorrow.

We both have a little I suspect, but hopefully we understand each
other completely now. I at least believe I do...

Nick

Nick White

unread,
Dec 12, 2013, 8:55:55 AM12/12/13
to tesser...@googlegroups.com, clemens....@gmail.com
Dear Clemens,

There's lots of great stuff in your email, thanks so much for
sending it! It'll take me a while to get through; I'm likely to
reply again later on.

I just took a look at the ocrevalUAtion tool, and read the pages at
https://sites.google.com/site/textdigitisation/ - it looks very
useful indeed! One comment I have regarding the above website is
that both of the engines listed under the "Free (online)" section
(i2ocr and free-ocr) actually use Tesseract for their OCR, so are
perhaps better located underneath the Tesseract section.

Nick

Clemens Neudecker

unread,
Dec 12, 2013, 9:39:19 AM12/12/13
to tesser...@googlegroups.com, clemens....@gmail.com
Hi Nick,

Thanks for the encouraging reply, looking forward to further feedback and comments!

I think the main distinction the https://sites.google.com/site/textdigitisation/ OCR list was trying to make is between online (web-based) services and stand-alone tools, but I will point out to the colleagues who populate that site that they should mention both the online systems are based on Tesseract - thanks for pointing that out.

Regards,
Clemens

matthew christy

unread,
Dec 18, 2013, 4:21:14 PM12/18/13
to tesser...@googlegroups.com
I would have thought the best approach for your situation, where as
you rightly point out there are more ligatures than you have the
time to find and train all of, is to train the common ligatures (as
you're doing), and just trust that less common ligatures will be
identified as separate characters that are close enough to their
non-ligatured versions that they'll be recognised as such.

Nick, that's exactly what we're doing. And just like we aren't trying to identify all the possible ligatures in the documents we are OCRing. We're also not trying to identify all the possible ligatures in the documents from which we are creating the training. There are lots of characters like the 'k' in the ke that Bryan pointed out, that under-hang their neighbors, and I don't think all of them are ligatures. Look at that Guyot specimen sheet I linked to earlier. So I'm not sure how we would even identify ligatures when the characters are not connected. Regardless, given that Tesseract will most likely use square boxes while trying to recognize characters, the best we can do is to create training for every contiguous glyph that we can identify and hope that Tesseract will have enough information to identify them when OCR'ing whether the glyphs under/over-hang each other or not.


I haven't had to train an italic font yet. Would the printing sorts
have been slanted for some italic fonts? I suspect so (but don't
know; someone should look it up), which would result in the slight
overlap you see. If that is the case, I wonder if Tesseract takes it
into account? Arguably it should, but as far as I know it just deals
with regular rectangles. There is certainly some extra cleverness it
does to deal with italics... I suspect small overlaps of the kind
that you'll see with italic fonts are essentially just ignored. I
don't know whether that's also true in the training process. It will
be interesting to see how the new training tools to be released deal
with italics.
 I've always assumed that they are slanted and I will ask our book history expert next time I see him. Some italics fonts are more slanted than others and can have a good deal of overlap. I've also kind of wondered if specifying that a font is italic during training is how to indicate that it should use slanted boxes while OCR'ing, or something like that.

matthew christy

unread,
Dec 18, 2013, 4:25:55 PM12/18/13
to tesser...@googlegroups.com, matthew christy
Hi Janusz,

Thanks for the links. I had read that IMPACT report before, but hadn't seen the other material. 



It would be interesting to retrain tesseract using your approach on  
the data described above and to compare the results, but I'm afraid  
nobody has time and motivation for it.

Yes, that would be interesting. There are a number of things I'd like to try, but like you said, there's just no time for it all. 

Best regards and good luck with your project
Thanks 

Janusz S. Bien

unread,
Dec 20, 2013, 10:52:41 AM12/20/13
to tesser...@googlegroups.com, matthew christy
Quote/Cytat - matthew christy <matt.c...@gmail.com> (Wed 18 Dec
2013 10:21:14 PM CET):

>> I haven't had to train an italic font yet. Would the printing sorts
>> have been slanted for some italic fonts? I suspect so (but don't
>> know; someone should look it up), which would result in the slight
>> overlap you see. If that is the case, I wonder if Tesseract takes it
>> into account? Arguably it should, but as far as I know it just deals
>> with regular rectangles. There is certainly some extra cleverness it
>> does to deal with italics... I suspect small overlaps of the kind
>> that you'll see with italic fonts are essentially just ignored. I
>> don't know whether that's also true in the training process. It will
>> be interesting to see how the new training tools to be released deal
>> with italics.
>>
> I've always assumed that they are slanted and I will ask our book history
> expert next time I see him. Some italics fonts are more slanted than others
> and can have a good deal of overlap. I've also kind of wondered if
> specifying that a font is italic during training is how to indicate that it
> should use slanted boxes while OCR'ing, or something like that.

I've never seen a mention of slanted printing sorts. Perhaps they were
just kerned, cf.

http://en.wikipedia.org/wiki/Kerning

Best regards

adudczak

unread,
Dec 20, 2013, 11:17:56 AM12/20/13
to tesser...@googlegroups.com, matthew christy
Dear all,

We've just opensourced a tool which allows to create Tesseract training material out of the PAGE XMLs from Aletheia.  Source code (Java) of the tool is available here: https://github.com/psnc-dl/page-generator -- binaries can be also downloaded from github. This is a command line tool so it should be easy to use it as a part of your scripts.

Tool allows to "cut" images on top of glyph data from PAGE file and afterwards create Tesseract training page with respective box file. This can be used for Tesseract training. I was testing this using script: https://github.com/psnc-dl/page-generator/blob/master/src/etc/train.sh and it seems that it can produce valid Tesseract profile.

Page-generator supports also output from our tool -- Cutouts (http://wlt.synat.pcss.pl/cutoutshttps://confluence.man.poznan.pl/community/display/WLT/Cutouts+application) which allows to work on preparation of training material. 

Kind regards,
Adam Dudczak

--
Digital libraries team, PSNC

Nick White

unread,
Dec 20, 2013, 3:58:17 PM12/20/13
to tesser...@googlegroups.com
Dear Adam,

> Tool allows to "cut" images on top of glyph data from PAGE file and afterwards
> create Tesseract training page with respective box file. This can be used for
> Tesseract training. I was testing this using script: https://github.com/psnc-dl
> /page-generator/blob/master/src/etc/train.sh and it seems that it can produce
> valid Tesseract profile.

That sounds a lot like the tool that Matthew announced a few days
ago (in this very thread). Can you explain the differences a little,
please?

> Page-generator supports also output from our tool -- Cutouts (http://
> wlt.synat.pcss.pl/cutouts, https://confluence.man.poznan.pl/community/display/
> WLT/Cutouts+application) which allows to work on preparation of training
> material.

That's interesting. Am I correct in thinking that this replaces
Aletheia as a tool to extract glyph images in your workflow? Is the
code available? Is it freely licenced?

Many thanks, I look forward to learning more.

Nick

adudczak

unread,
Dec 30, 2013, 4:36:55 AM12/30/13
to tesser...@googlegroups.com


W dniu piątek, 20 grudnia 2013 21:58:17 UTC+1 użytkownik Nick White napisał:

> Tool allows to "cut" images on top of glyph data from PAGE file and afterwards
> create Tesseract training page with respective box file. This can be used for
> Tesseract training. I was testing this using script: https://github.com/psnc-dl
> /page-generator/blob/master/src/etc/train.sh and it seems that it can produce
> valid Tesseract profile.

That sounds a lot like the tool that Matthew announced a few days
ago (in this very thread). Can you explain the differences a little,
please?


You mean FRANKEN+? Yes, to some extent it is similar. Page-generator was originally developed for the purpose of IMPACT project, it was used in experiments described in this report http://lib.psnc.pl/publication/428. Since the beginning it was thought as a command line tool which can be easily integrated into larger workflow. In first step it takes PAGE XML and PNG file and prepares "cutted" font. You can manually review which glyphs should go into the training set (we don't have such a nice browser as FRANKEN) and launch second step. In second step page-generator assembles training images and prepares corresponding box file. 

Page-generator was developed in 2011 but we were not able to release it as opensource till now.
 
> Page-generator supports also output from our tool -- Cutouts (http://
> wlt.synat.pcss.pl/cutouts, https://confluence.man.poznan.pl/community/display/
> WLT/Cutouts+application) which allows to work on preparation of training
> material.

That's interesting. Am I correct in thinking that this replaces
Aletheia as a tool to extract glyph images in your workflow? Is the
code available? Is it freely licenced?


To large extent - yes. The biggest difference is the fact that Aletheia can handle non-square polygons for marking characters (issue of overlapping characters mentioned in this discussion). In Cutouts you can only specify square boxes for characters (they're initially loaded from box file) but webinterface has a tool which allows you to manually remove parts of overlapping glyphs from the given box. IMHO effort is similar to making non-square selection but it fits very well into Tesseract training model. 

Aletheia is a desktop tool, Cutouts can be used to crowdsource preparation of training materials, apart from the main interface there is also an "audit"/moderation interface which allows you to validate results of work of your crowd. Each glyph is represented as an XML and three images (original selection with overlapping parts of different characters, binarized image of a glyph, and final version after manual removal of overlapping "noise"). 

As for license and the source code we would like to release this as opensource but this require some additional work I hope that it will happen at some point but don't know when ;-|. I will keep you posted if you are interested in further development of this tool.

Kind regards,
Adam


sushma ms

unread,
Feb 2, 2016, 2:24:54 AM2/2/16
to tesseract-ocr
Hi Bryan,

I badly need you help in Tesseract-OCR,

I saw your below video and made to ask you few doubt,

I wanted to know how to add the new trained data to the tessdata so that it can be activated by default,
no need to provide "-l lang"

ex: tesseract test.png test -l eng2

I used the below link and created the trained data,

now i want to add it to tessdata and make it execute the trained data by default, will you please let me know the steps how i can do it.

Thank you,
Sushma

On Saturday, December 7, 2013 at 1:40:56 AM UTC+5:30, matthew christy wrote:
Hi All,

The Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, as part of its Early Modern OCR Project (eMOP) has created a new tool, called Franken+, that provides a way to create font training for the Tesseract OCR engine using page images. This is in contrast to Tesseract's documented method of font training which involves using a word processing program with a modern font. Franken+ has now been released for beta testing and we invite anyone who's interested to give it a try and to please provide feedback.

Franken+ works in conjunction with PRImA's open source Aletheia tool and allows users to easily and quickly identify one or more idealized forms of each glyph found on a set of page images. These identified forms are then used to generate a set of Franken-page images matching the page characteristics documented in Tesseract's training instructions, but with a font used in an actual early modern printed document. Franken+ allows you to create Tesseract box files, but will also guide you through the entire Tesseract training process, producing a .traneddata file, and even allow you to identify and OCR documents using that training. In addition, Franken+ makes it easy to combine training from multiple fonts into one training set.

For eMOP we are using Franken+ to create training for Tesseract from page images of early modern printed works, but we also think it can be used just as effectively to train Tesseract using images of any kind of font that's not readily available via a word processor. For example, I've seen posts in this group about wanting to train Tesseract to read the signs on the front of buses.

You can find out more about Franken+ at http://emop.tamu.edu/node/54 and http://dh-emopweb.tamu.edu/Franken+/. The code is also available open source at https://github.com/idhmc-tamu/eMOP/tree/master/Franken%2B.

Thanks,
Matt Christy
Reply all
Reply to author
Forward
0 new messages