I think its time we started looking for a set of images covering most
Indian scripts that we can run tests on. In the longer run this set
should serve as the standard benchmark for all Indic OCRs. I know some
institutes in India have such sets but they are all restrictively
licensed.
There are people on this list who knwo what kind of images we should
be looking for. I request those list members to chip in with such
data.
--
Debayan Banerjee
I have pdf pages from marathi vishvakosh. Some are scanned pages and
some are from machine conversions. The later should be easy if the OCR
is good, and the former will be a good test.
Let me know where I can deposit them.
Cheers,
ashish
http://www.astro.caltech.edu/~aam
Ashish Mahabal http://www.astro.caltech.edu/~aam
----- Original Message -----From: Abhaya Agarwal
Sent: Wednesday, April 20, 2011 9:37 AM
> So primarily 2 tasks:
>
> If someone already has scanned pages + the digital version of the same : let
> us collect all that. Then we can build a small interface to allow anyone to
> come and align it.
> If someone only has the scanned pages, word level alignments can be
> generated right at the time of typing. We automatically analyze the image
> and break into words and then show one word at a time for entry.
archive.org seems to have copies of books that are essentially scanned
pages. Does that help as one of the sources for obtaining data ?
--
sankarshan mukhopadhyay
<http://sankarshan.randomink.org/blog/>
----- Original Message -----
From: "Dr. Atul Negi" <atul...@gmail.com>
To: "indic-ocr" <indi...@googlegroups.com>
Sent: Wednesday, April 20, 2011 8:04 PM
Subject: [Indic-OCR] Re: We need a data set to run tests on for accuracy of
our OCR
> Here are my views. I chose to comment with this post since it has
> several points that
> echo mine so I am just reinforcing what has been said.
>
> On Apr 20, 9:07 am, Abhaya Agarwal <abhaya.agar...@gmail.com> wrote:
>> We should collect the data of different types separately.
>>
>> 1. Recent Laser printed material on white paper makes for the easiest
>> data set. This can be generated in as much amount as required and
>> pretty
>
> I very much agree with this point, however the laser print data needs
> to be generated
> systematically so that the entire character set with all the
> variations of maatraas, combinations
> of consonants and sanyukt akshar etc etc is generated. Now this has to
> be done intelligently
> and not by brute force. We need to take the common sets of words.
> Therefore we need expert
> lingusitic knowledge to give this set of data.
>
> Therefore start by generating print outs for aksharmala with all
> barakhadi. Follow this up with words
> that are used in the KG and 1st class books that are used for a
> systematic introduction to the language.
> In telugu there is a book called as "pedda bala shikshaa" where the
> choice of words ensures an exposure
> to all the aksharaa mostly atleast and also simple well known words.
>
> We should or can use the 1st class or kg books themselves for the OCR
> training data.
> Or at least the books of upto middle schools may be used for the
> simpler vocabulary.
>
> Note for a full fledged test we need to take up the text in various
> fonts and styles like
> bold and italics.
>
>
>> 2. Scanned pages of recently offset printed books. - Hard to get
>> properly
>> licensed data
> Here I think it means we need to get those books which dont have
> copyright,
>
> Old books can satisfy the condition. They are out of copyright.
>
> Religious texts are out of copyright so they have been printed by many
> many printers
> in various fonts.
>
>
>> 3. scanned pages of old books: this will be hardest. Apart from the
>> age
>> factor, the quality of printing in Indian language books is usually
>> very bad
>> with uneven ink spread. If you look closely at one of the attachments
>> I sent
>> earlier, you will also find very uneven inter word spacing. Getting
>> the
>> images for various languages is not an issue here, we can find
>> sufficient
>> number of out of copyright books in the DLI database and collect them.
>
> you said it ! ink spread etc makes it real tough.
>
>> Apart from scanned pages, we will also need the ground truth for these
>> images. The simplest ground truth is the text version of the scanned
>> pages.
>
> What ground truth really requires is that not just the text but it
> needs a mapping
> stating that for the given rectangular area of the image, what is the
> associated text.
>
> There are several approaches for creating of ground truth data. The
> simplest is like in
> with XML tags for the text and giving the bounding box for the text
> content. Tags are needed
> for like <page> <page no.>, <header> <footer> <line>, <word>,
> <picture>, <non-ocr symbol>, etc etc.
> Punctuation is very important to capture and keep in the ground truth
> data.
> For example is there a dot at the end of the word or is it some
> noise ?
>
>> Although ideally, we would want word level mappings also. Character level
>> mappings would be cherry on the top. We would also need to decide on data
>> format for storing the word and character level mappings. Does Tesseract
>> already has something?
>
>
> character or akshara level mappings is quite difficult to create.
> Since I dont know anything
> how Tesseract does things more informed folks can give their view.
>
> I would like to point out that IL OCR needs almost a parallel project
> that generates a critical mass of
> test sets and ground truths without which things will not work.
> Ideally somebody needs to donate
> the effort to scan a book out of copyright and then to create the
> ground truth. It is a work of good magnitude.
>
>
>> So primarily 2 tasks:
>>
>> - If someone already has scanned pages + the digital version of the
>> same
>> : let us collect all that. Then we can build a small interface to
>> allow
>> anyone to come and align it.
>> - If someone only has the scanned pages, word level alignments can be
>> >> debaya...@gmail.com>
----- Original Message -----From: Debayan Banerjee
I have uploaded some kannada pages at https://github.com/mnsrao/Indic-OCR
----- Original Message -----From: Debayan BanerjeeSent: Friday, April 22, 2011 2:52 AMSubject: Re: [Indic-OCR] Re: We need a data set to run tests on for accuracy of our OCR
Sorry; Some mistake.Material available at
----- Original Message -----From: Debayan BanerjeeSent: Friday, April 22, 2011 3:04 AMSubject: Re: [Indic-OCR] Re: We need a data set to run tests on for accuracy of our OCR
Text attached as picasa doesnot accept txt
archive.org seems to have copies of books that are essentially scanned
pages. Does that help as one of the sources for obtaining data ?