Training LSTM model on Colab

Ruchika Tyagi

unread,

Oct 5, 2021, 5:36:43 AM10/5/21

to tesseract-ocr

hello,

I am new to Tesseract and trying to use it for one of the use case.

I wonder if there is any way to use the already trained models through Colab? And further train them if required.

I am actually looking for outputs after layers and may be remove the top layer for further processing. However, till now I have not found anything relevant around this.

Can anyone please help?

Thanks

Zdenko Podobny

unread,

Oct 5, 2021, 5:42:02 AM10/5/21

to tesser...@googlegroups.com

Generally: new user + "i want to train tesseract" = fail

If you are asking for help/support, provide information about what you have already tried, some examples of input images, tools you are able/plan to use...

Zdenko

ut 5. 10. 2021 o 11:36 Ruchika Tyagi <ruv...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3620138c-fb82-42ff-8080-1cb85c5119d3n%40googlegroups.com.

Ruchika Tyagi

unread,

Oct 5, 2021, 10:51:35 AM10/5/21

to tesseract-ocr

hi Zdenko,

Thanks for your feedback!

I have implemented the following things in Colab:

1/ installed tesseract ocr and pytesseract

2/ Used pytesseract.image_to_string to convert the image of scanned document to text.

The output text is like:

sae S\Pewnowet refer Yo We Uniovetha, Bops don't a where MWAH ple Commvadityer gre. Avediarie tee wode Onden OMe wol ' and On Wigs kcale. of Oferakin, nee. es: [rer Bat Chain in Prd Vegelanie “roger | SP in Pst Vegelasie “Wieder | ; AD Me ]8 inc ug Maer Contumneg hom Nes “I —> ty Uae | . Mere ed Serigh Soma)

Which is not making sense.

So I was asking if there are ways to dig deeper into tesseract built in model and understand the output of each layer. And then try some enhancements to decode this better.

But for that, I need to know the model in detail and should be able to use it in Colab. and I am not able to find any relevant text around it. All I could find is tuning of model from command line that too on Linux machines.

So if there is any, would request you to provide a reference.

Ruchika

Zdenko Podobny

unread,

Oct 5, 2021, 12:27:00 PM10/5/21

to tesser...@googlegroups.com

First of all:

Unless you share input image, it does not make sense to share output.

Next - read the doc. You can start here https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md

If you fail with image preprocessing and document analysis/text detection, training will not help you.

If you need to know the model in detail - you will need to read the source code (I am afraid) .

Zdenko

ut 5. 10. 2021 o 16:51 Ruchika Tyagi <ruv...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cb321ecc-a28e-4e82-96ef-b4d28d328f10n%40googlegroups.com.

Reply all

Reply to author

Forward