Form Recognizer using Ocr

Rahul Dochak

unread,

Oct 18, 2019, 1:33:09 AM10/18/19

to tesseract-ocr

Hi All,

I have a task and I could see a way to approach this but i do not know how to ,what i am trying to do is this:

I want to make a form recogniser and then extract text from the fields inside the forms,the form are in the form of scanned pdf's and i do not know the forms or the fields beforehand only knows about the form name .

I want to scan the pdf and convert it to text and then search for the form name and check if I have a predefined template for that form type if not then I have to somehow get the location of all the fields as I do not have the required fields for a form type,and make a template for future use with the same form type and extract the data of the fields to json. I could not find a way to make a template on the go for a new form type . Guidance in to the right direction will be helpful.

Thanks in advance.

Rahul.

Shree Devi Kumar

unread,

Oct 18, 2019, 1:46:54 AM10/18/19

to tesseract-ocr

You can try with uzn files. See https://jsoma.github.io/kull/#/

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6edb4f1a-c44c-4f9c-b929-f3079b223eb6%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Rahul Dochak

unread,

Oct 18, 2019, 1:59:56 AM10/18/19

to tesseract-ocr

Can you elaborate the process . If that is not much of an ask.

Rahul

On Friday, October 18, 2019 at 11:16:54 AM UTC+5:30, shree wrote:

You can try with uzn files. See https://jsoma.github.io/kull/#/

On Fri, Oct 18, 2019 at 11:03 AM Rahul Dochak <rahuld...@gmail.com> wrote:

Hi All,

I have a task and I could see a way to approach this but i do not know how to ,what i am trying to do is this:
I want to make a form recogniser and then extract text from the fields inside the forms,the form are in the form of scanned pdf's and i do not know the forms or the fields beforehand only knows about the form name .
I want to scan the pdf and convert it to text and then search for the form name and check if I have a predefined template for that form type if not then I have to somehow get the location of all the fields as I do not have the required fields for a form type,and make a template for future use with the same form type and extract the data of the fields to json. I could not find a way to make a template on the go for a new form type . Guidance in to the right direction will be helpful.

Thanks in advance.
Rahul.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6edb4f1a-c44c-4f9c-b929-f3079b223eb6%40googlegroups.com.

Shree Devi Kumar

unread,

Oct 18, 2019, 2:07:00 AM10/18/19

to tesseract-ocr

See https://github.com/jsoma/tesseract-uzn

Basically uzn files predefine zones on the page and then each of those would be recognized

Search in the forum for past posts

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dfdd6f29-266b-47fb-8cb0-1fce3da7116e%40googlegroups.com.

Reply all

Reply to author

Forward