Training the LSTM language model explicitly in an unsupervised manner

Rahul Tyagi

unread,

Oct 15, 2018, 7:08:16 AM10/15/18

to tesseract-ocr

Hi,

I am trying to run tesseract-ocr on invoices to detect user ID's, Invoice numbers, tax codes etc. I think tesseract has not been trained on this kind of data so i need to fine tune the network on my data. Now it will be a bit difficult for me to get labelled data to fine tune tesseract as stated in training-tesseract wiki page. So wanted to know if its possible to only tune the language model of tesseract-ocr in an unsupervised way just like the language models trained for English Language Understanding i.e. showing the language model just the pins and ids by passing the output generated at previous (t-1) timestep as input to current timestep (t).

Soumik Ranjan Dasgupta

unread,

Oct 15, 2018, 8:08:03 AM10/15/18

to tesser...@googlegroups.com

No, tesseract cannot be trained in an unsupervised manner, it needs ground truth labels to train from scratch or fine-tune. Please provide a sample image to test if possible.

On Mon, Oct 15, 2018 at 12:38 PM Rahul Tyagi <rahul....@gmail.com> wrote:

Hi,

I am trying to run tesseract-ocr on invoices to detect user ID's, Invoice numbers, tax codes etc. I think tesseract has not been trained on this kind of data so i need to fine tune the network on my data. Now it will be a bit difficult for me to get labelled data to fine tune tesseract as stated in training-tesseract wiki page. So wanted to know if its possible to only tune the language model of tesseract-ocr in an unsupervised way just like the language models trained for English Language Understanding i.e. showing the language model just the pins and ids by passing the output generated at previous (t-1) timestep as input to current timestep (t).

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/276e6dcf-f0b5-43e0-a794-d1bb69c88857%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Regards,

Soumik Ranjan Dasgupta

Rahul Tyagi

unread,

Oct 15, 2018, 9:16:38 AM10/15/18

to tesseract-ocr

I am not trying to train the whole model in an unsupervised way, I just want to train the language model which act as the final layer of tesseract to generate variable length sequence, this will act like a pre-training step. Just like other language models as can be seen in the image we provide output of previous timestep as input to the next timestep, similar to that i can provide my own sequences so that the network has some additional information about my sequences and afterwards it can be tuned in a supervised manner by training on supervised data.

Reply all

Reply to author

Forward