Hi all,
I recently finally got around to organising and releasing some
(well, a lot of) ground truth files for the language I have been
training for ages now, Ancient Greek. By "ground truth" I mean real
page scans with the corresponding (hand-typed) correct text, which
is essential to be able to test the accuracy of OCR results.
I thought it might be helpful or interesting for others to share how
I went about it.
In my case the best source was an old (public domain) book that I
had the hand-typed text for, for which several different scans of
the book existed. I then split the text to one file per page, and
named it the same as the page scan file for that page, but with a
.txt file extension.
This book also had translations of the text in Latin, which I didn't
want to preserve, so I selected only the Ancient Greek parts and
stored their locations using the .uzn format. I did this using a
little program I wrote a while ago that uses the Tesseract C-API to
analyse the page layout of this type of book, select the relevant
parts, and detect the language of each section, printing an uzn file
describing them all. It is very specific to this type of book, but
in case you're curious you can find it in migneuzn.c in the
repository:
http://ancientgreekocr.org/mignetools.git
A while ago I forked a repository of the ISRI OCR evaluation tools
to make them work easily with UTF-8, and included some helper
scripts:
http://ancientgreekocr.org/ocr-evaluation-tools.git
Of particular relevance here is the 'tessaccsummary' script, which
when given a directory of images and corresponding ground truth text
and a .traineddata file will OCR each page and print the accuracy,
and an average summary at the end. It is all quite basic, but very
handy.
I decided to store the ground truth files in a git repository; while
in some ways it isn't an ideal way to store lots of binary files
(like page scans), actually the page scans are never likely to
change, so the size won't get out of hand as it would if the binary
files changed regularly, so I think it's fine. That said it is about
4.5GiB on disk. The ground truth repository is at
http://ancientgreekocr.org/grcground.git but as I say it's pretty
massive, so please don't clone it unless you think you'll actually
at least look at it, as the bandwidth will cost me :)
I think it would be really good if others interested in other
languages collected and shared some ground truth files. The more
rigorous testing we do of our OCR training files, the better our
results will end up being. I am working with Latin OCR now, so will
probably do something similar for that soon. Is anyone else
interested?
Nick