We need a data set to run tests on for accuracy of our OCR

158 views
Skip to first unread message

Debayan Banerjee

unread,
Apr 19, 2011, 4:03:06 PM4/19/11
to indi...@googlegroups.com
Hi,

I think its time we started looking for a set of images covering most
Indian scripts that we can run tests on. In the longer run this set
should serve as the standard benchmark for all Indic OCRs. I know some
institutes in India have such sets but they are all restrictively
licensed.
There are people on this list who knwo what kind of images we should
be looking for. I request those list members to chip in with such
data.

--
Debayan Banerjee

satyaakam goswami

unread,
Apr 19, 2011, 10:34:06 PM4/19/11
to indi...@googlegroups.com
no i do not know what you are looking for , can you name those institutes ?   i have a scanner does putting a Hindi book under it and getting images , is that what you mean by  data set , can you elaborate?

-Satya

M.N.S.Rao

unread,
Apr 19, 2011, 10:59:59 PM4/19/11
to indi...@googlegroups.com
Hi,
Will the scanned pages in kannada be enough or are you looking for custom
chars for test. If scanned pages I have a lot that I sent to YAGPO project.
Pl feel free to ask for scannig of kannada pages for test.
MNS Rao

Ashish Mahabal

unread,
Apr 19, 2011, 11:24:42 PM4/19/11
to indi...@googlegroups.com

Hi,

I have pdf pages from marathi vishvakosh. Some are scanned pages and
some are from machine conversions. The later should be easy if the OCR
is good, and the former will be a good test.

Let me know where I can deposit them.

Cheers,
ashish
http://www.astro.caltech.edu/~aam

Ashish Mahabal http://www.astro.caltech.edu/~aam

Abhaya Agarwal

unread,
Apr 20, 2011, 12:07:45 AM4/20/11
to indi...@googlegroups.com
We should collect the data of different types separately.
  1. Recent Laser printed material on white paper makes for the easiest data set. This can be generated in as much amount as required and pretty quickly. See my comment at the end of the mail.
  2. Scanned pages of recently offset printed books. - Hard to get properly licensed data
  3. scanned pages of old books: this will be hardest. Apart from the age factor, the quality of printing in Indian language books is usually very bad with uneven ink spread. If you look closely at one of the attachments I sent earlier, you will also find very uneven inter word spacing. Getting the images for various languages is not an issue here, we can find sufficient number of out of copyright books in the DLI database and collect them.
Apart from scanned pages, we will also need the ground truth for these images. The simplest ground truth is the text version of the scanned pages. Although ideally, we would want word level mappings also. Character level mappings would be cherry on the top. We would also need to decide on data format for storing the word and character level mappings. Does Tesseract already has something? 

So primarily 2 tasks:
  • If someone already has scanned pages + the digital version of the same : let us collect all that. Then we can build a small interface to allow anyone to come and align it.
  • If someone only has the scanned pages, word level alignments can be generated right at the time of typing. We automatically analyze the image and break into words and then show one word at a time for entry.
An alternative way for generating data of type 1 is to take printout of already existing digital text in different fonts and then scan those pages. This text can be of out of copyright books or something else which is suitably licensed. If we insert some word boundary marker in the text before printing - we should be able to generate reliable word level mappings automatically.

Regards,
Abhaya
--
-------------------------------------------------
blog: http://abhaga.blogspot.com
Twitter: http://twitter.com/abhaga
-------------------------------------------------

M.N.S.Rao

unread,
Apr 20, 2011, 6:18:15 AM4/20/11
to indi...@googlegroups.com
This is the next mail
---------------------
There is a limitation of 4 Mb for the attachments. Hence I am attaching only 4 png files (for the sake of reducing the size)and text of kannada in this mail and 4 more in the next mail.
suryanurudia.txt
surya5.PNG
surya6.PNG
surya7.PNG
surya8.PNG

M.N.S.Rao

unread,
Apr 20, 2011, 6:18:16 AM4/20/11
to indi...@googlegroups.com
There is a limitation of 4 Mb for the attachments. Hence I am attaching only 4 png files (for the sake of reducing the size)and text of kannada in this mail and 4 more in the next mail.
----- Original Message -----
Sent: Wednesday, April 20, 2011 9:37 AM
suryanurudia.txt
surya1.PNG
surya2.PNG
surya3.PNG
surya4.PNG

Sankarshan Mukhopadhyay

unread,
Apr 20, 2011, 10:37:29 AM4/20/11
to indi...@googlegroups.com
On Wed, Apr 20, 2011 at 9:37 AM, Abhaya Agarwal
<abhaya....@gmail.com> wrote:

> So primarily 2 tasks:
>
> If someone already has scanned pages + the digital version of the same : let
> us collect all that. Then we can build a small interface to allow anyone to
> come and align it.
> If someone only has the scanned pages, word level alignments can be
> generated right at the time of typing. We automatically analyze the image
> and break into words and then show one word at a time for entry.

archive.org seems to have copies of books that are essentially scanned
pages. Does that help as one of the sources for obtaining data ?

--
sankarshan mukhopadhyay
<http://sankarshan.randomink.org/blog/>

M.N.S.Rao

unread,
Apr 20, 2011, 10:50:35 AM4/20/11
to indi...@googlegroups.com
What I had drafted but just then your mail arrive.
Anyway I want to say that :
Selecting a data file for preparing a language traineddata file for indic
languages has become a challenging job as it is nearly impossible to include
all possible combintions of the characters by the very nature of the
structures of these languages. It is very necessary to take the assistance
from experts in the field of linguistics in this context. Some thought may
be given on this aspect.
MNS Rao

----- Original Message -----
From: "Dr. Atul Negi" <atul...@gmail.com>
To: "indic-ocr" <indi...@googlegroups.com>
Sent: Wednesday, April 20, 2011 8:04 PM
Subject: [Indic-OCR] Re: We need a data set to run tests on for accuracy of
our OCR


> Here are my views. I chose to comment with this post since it has
> several points that
> echo mine so I am just reinforcing what has been said.


>
> On Apr 20, 9:07 am, Abhaya Agarwal <abhaya.agar...@gmail.com> wrote:
>> We should collect the data of different types separately.
>>

>> 1. Recent Laser printed material on white paper makes for the easiest


>> data set. This can be generated in as much amount as required and
>> pretty
>

> I very much agree with this point, however the laser print data needs
> to be generated
> systematically so that the entire character set with all the
> variations of maatraas, combinations
> of consonants and sanyukt akshar etc etc is generated. Now this has to
> be done intelligently
> and not by brute force. We need to take the common sets of words.
> Therefore we need expert
> lingusitic knowledge to give this set of data.
>
> Therefore start by generating print outs for aksharmala with all
> barakhadi. Follow this up with words
> that are used in the KG and 1st class books that are used for a
> systematic introduction to the language.
> In telugu there is a book called as "pedda bala shikshaa" where the
> choice of words ensures an exposure
> to all the aksharaa mostly atleast and also simple well known words.
>
> We should or can use the 1st class or kg books themselves for the OCR
> training data.
> Or at least the books of upto middle schools may be used for the
> simpler vocabulary.
>
> Note for a full fledged test we need to take up the text in various
> fonts and styles like
> bold and italics.
>
>
>> 2. Scanned pages of recently offset printed books. - Hard to get
>> properly
>> licensed data
> Here I think it means we need to get those books which dont have
> copyright,
>
> Old books can satisfy the condition. They are out of copyright.
>
> Religious texts are out of copyright so they have been printed by many
> many printers
> in various fonts.
>
>
>> 3. scanned pages of old books: this will be hardest. Apart from the

>> age
>> factor, the quality of printing in Indian language books is usually
>> very bad
>> with uneven ink spread. If you look closely at one of the attachments
>> I sent
>> earlier, you will also find very uneven inter word spacing. Getting
>> the
>> images for various languages is not an issue here, we can find
>> sufficient
>> number of out of copyright books in the DLI database and collect them.
>

> you said it ! ink spread etc makes it real tough.


>
>> Apart from scanned pages, we will also need the ground truth for these
>> images. The simplest ground truth is the text version of the scanned
>> pages.
>

> What ground truth really requires is that not just the text but it
> needs a mapping
> stating that for the given rectangular area of the image, what is the
> associated text.
>
> There are several approaches for creating of ground truth data. The
> simplest is like in
> with XML tags for the text and giving the bounding box for the text
> content. Tags are needed
> for like <page> <page no.>, <header> <footer> <line>, <word>,
> <picture>, <non-ocr symbol>, etc etc.
> Punctuation is very important to capture and keep in the ground truth
> data.
> For example is there a dot at the end of the word or is it some
> noise ?


>
>> Although ideally, we would want word level mappings also. Character level
>> mappings would be cherry on the top. We would also need to decide on data
>> format for storing the word and character level mappings. Does Tesseract
>> already has something?
>
>

> character or akshara level mappings is quite difficult to create.
> Since I dont know anything
> how Tesseract does things more informed folks can give their view.
>
> I would like to point out that IL OCR needs almost a parallel project
> that generates a critical mass of
> test sets and ground truths without which things will not work.
> Ideally somebody needs to donate
> the effort to scan a book out of copyright and then to create the
> ground truth. It is a work of good magnitude.
>
>
>> So primarily 2 tasks:
>>
>> - If someone already has scanned pages + the digital version of the

>> same
>> : let us collect all that. Then we can build a small interface to
>> allow
>> anyone to come and align it.

>> - If someone only has the scanned pages, word level alignments can be

>> >> debaya...@gmail.com>

Dr. Atul Negi

unread,
Apr 20, 2011, 10:34:47 AM4/20/11
to indic-ocr
Here are my views. I chose to comment with this post since it has
several points that
echo mine so I am just reinforcing what has been said.

On Apr 20, 9:07 am, Abhaya Agarwal <abhaya.agar...@gmail.com> wrote:
> We should collect the data of different types separately.
>
> 1. Recent Laser printed material on white paper makes for the easiest
> data set. This can be generated in as much amount as required and pretty

I very much agree with this point, however the laser print data needs
to be generated
systematically so that the entire character set with all the
variations of maatraas, combinations
of consonants and sanyukt akshar etc etc is generated. Now this has to
be done intelligently
and not by brute force. We need to take the common sets of words.
Therefore we need expert
lingusitic knowledge to give this set of data.

Therefore start by generating print outs for aksharmala with all
barakhadi. Follow this up with words
that are used in the KG and 1st class books that are used for a
systematic introduction to the language.
In telugu there is a book called as "pedda bala shikshaa" where the
choice of words ensures an exposure
to all the aksharaa mostly atleast and also simple well known words.

We should or can use the 1st class or kg books themselves for the OCR
training data.
Or at least the books of upto middle schools may be used for the
simpler vocabulary.

Note for a full fledged test we need to take up the text in various
fonts and styles like
bold and italics.


> 2. Scanned pages of recently offset printed books. - Hard to get properly
> licensed data
Here I think it means we need to get those books which dont have
copyright,

Old books can satisfy the condition. They are out of copyright.

Religious texts are out of copyright so they have been printed by many
many printers
in various fonts.


> 3. scanned pages of old books: this will be hardest. Apart from the age
> factor, the quality of printing in Indian language books is usually very bad
> with uneven ink spread. If you look closely at one of the attachments I sent
> earlier, you will also find very uneven inter word spacing. Getting the
> images for various languages is not an issue here, we can find sufficient
> number of out of copyright books in the DLI database and collect them.

you said it ! ink spread etc makes it real tough.

> Apart from scanned pages, we will also need the ground truth for these
> images. The simplest ground truth is the text version of the scanned pages.

What ground truth really requires is that not just the text but it
needs a mapping
stating that for the given rectangular area of the image, what is the
associated text.

There are several approaches for creating of ground truth data. The
simplest is like in
with XML tags for the text and giving the bounding box for the text
content. Tags are needed
for like <page> <page no.>, <header> <footer> <line>, <word>,
<picture>, <non-ocr symbol>, etc etc.
Punctuation is very important to capture and keep in the ground truth
data.
For example is there a dot at the end of the word or is it some
noise ?

> Although ideally, we would want word level mappings also. Character level
> mappings would be cherry on the top. We would also need to decide on data
> format for storing the word and character level mappings. Does Tesseract
> already has something?


character or akshara level mappings is quite difficult to create.
Since I dont know anything
how Tesseract does things more informed folks can give their view.

I would like to point out that IL OCR needs almost a parallel project
that generates a critical mass of
test sets and ground truths without which things will not work.
Ideally somebody needs to donate
the effort to scan a book out of copyright and then to create the
ground truth. It is a work of good magnitude.


> So primarily 2 tasks:
>
> - If someone already has scanned pages + the digital version of the same
> : let us collect all that. Then we can build a small interface to allow
> anyone to come and align it.
> - If someone only has the scanned pages, word level alignments can be
> >> debaya...@gmail.com>

Debayan Banerjee

unread,
Apr 21, 2011, 4:27:05 PM4/21/11
to indi...@googlegroups.com
Hi,

I have created separate repository for this purpose at https://github.com/debayan/Indic-OCR-Benchmark-Data . Here is how we are going to add data to it:

1) Person interested in contributing will first upload relevant pages to some 3rd party image hosting sites, like picasa or flickr, and pass a link to this list. If uploading to a 3rd party site is not possible, the interested person may adopt any other method to upload the image to a publicly accessible location which all list members can access.

2) On verifying the image's quality and suitability one of the people with commit access to this repository will commit the pertinent images to the right folder.

3) If a particular person is sending in useful images on a regular basis s/he will gain commit access to the repository.

All the images being donated should obviously belong to some Creative Commons or similar suitably open license.

Now the question is what kind of images do we need, and what should be the folder structure of the repository.

The top level of the directory will have folders belonging to different languages. Each language specific folder will have the following folders:

1) Laser Printed Images -> Will contain scanned laser printed images

2) Typeset Images -> Will contain scanned images of not-so-old books which have been typeset

3) Old Typeset Images -> Will contain scanned images of old pages with ink spread.

4) Parking Folder -> This will contain images that are good but do not have associated ground truth yet


For each folder above we will have the following sub-folders:

1) Noisy

2) Clean

The noisy folder may contain images of different kinds of noise.


Each image must be accompanied with corresponding metadata, or as is called ground truth. What format should we follow? I recommend the simple yet effective "box file" format. 

A box file is nothing but a file containing each character per line and then a 4 tuple mentioning the co-ordinates on which that character appears in the corresponding image.

This is the format Tesseract uses, and there are several tools for automating your task of creating this data. Find out more about this at http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Box_File_Editors

I invite comments and suggestions.


--
Debayan Banerjee

M.N.S.Rao

unread,
Apr 21, 2011, 5:19:02 PM4/21/11
to indi...@googlegroups.com
I have uploaded some kannada pages at https://github.com/mnsrao/Indic-OCR
----- Original Message -----

Debayan Banerjee

unread,
Apr 21, 2011, 5:22:25 PM4/21/11
to indi...@googlegroups.com
On 22 April 2011 02:49, M.N.S.Rao <mns...@gmail.com> wrote:
I have uploaded some kannada pages at https://github.com/mnsrao/Indic-OCR

I dont see anything there. It says "This repository's default branch is empty!" on the home page.


--
Debayan Banerjee

M.N.S.Rao

unread,
Apr 21, 2011, 5:31:20 PM4/21/11
to indi...@googlegroups.com
----- Original Message -----
Sent: Friday, April 22, 2011 2:52 AM
Subject: Re: [Indic-OCR] Re: We need a data set to run tests on for accuracy of our OCR



Debayan Banerjee

unread,
Apr 21, 2011, 5:34:29 PM4/21/11
to indi...@googlegroups.com
On 22 April 2011 03:01, M.N.S.Rao <mns...@gmail.com> wrote:

I have to create a Windows Live ID to access these files. Please upload them to something like Picasa or Flickr instead.

--
Debayan Banerjee

M.N.S.Rao

unread,
Apr 21, 2011, 5:48:54 PM4/21/11
to indi...@googlegroups.com
Text attached as picasa doesnot accept txt
MNS Rao
----- Original Message -----
Sent: Friday, April 22, 2011 3:04 AM
Subject: Re: [Indic-OCR] Re: We need a data set to run tests on for accuracy of our OCR



kelayya.txt

Debayan Banerjee

unread,
Apr 21, 2011, 5:52:38 PM4/21/11
to indi...@googlegroups.com


2011/4/22 M.N.S.Rao <mns...@gmail.com>

Text attached as picasa doesnot accept txt

Great! Thanks for your patience.

--
Debayan Banerjee

Abhaya Agarwal

unread,
Apr 25, 2011, 11:43:50 PM4/25/11
to indi...@googlegroups.com
archive.org seems to have copies of books that are essentially scanned
pages. Does that help as one of the sources for obtaining data ?


They are a good source but some times only processed binarized images are available. So we will need to manually check and retrieve. We can also fetch data from Digital Library of India collection although their "Out of Copyright" classification is mostly useless for Indian purposes. So again, we need to be a little careful when picking up material from there.

Regards,
Abhaya

Abhaya Agarwal

unread,
Apr 25, 2011, 11:51:34 PM4/25/11
to indi...@googlegroups.com
One suggestion: For the laser printed images, can we use the Wikipedia data? We can choose a fixed set of articles which are available in most Indic languages and print and scan them? I can take this up for few languages as soon as my printer is fixed.

Also for the laser printed images, we should keep the font information somewhere since that will be available. We can have the same document printed in 2-3 fonts and keep it. Should we do that by creating further directories or by keeping a metadata file?

Also I think it is ok if all the combination of characters are not covered to start with. If our data is drawn from natural sources, we automatically have the combinations present according to their natural distribution. If some particular things bothers us later on, we can fill in the gaps with special data.

Regards,
Abhaya
Reply all
Reply to author
Forward
0 new messages