Questions regarding fine tuning of Tesseract 4.00alpha LSTM

76 views

Skip to first unread message

Wang Zhimin

unread,

Nov 13, 2017, 4:04:27 AM11/13/17

to tesseract-ocr

Hi all,

Thank you in advance.

I have questions regarding the accuracy improvement with fine tuning of the LSTM model.

BACKGROUND:

I want to use tesseract to recognise DNA/RNA sequences from PDF/TIFF. However, the accuracy is not great as the images have different font types and sizes.

Method:

I understand that I probably have two options:

With the source images, I run the tesseract to generate the boxes, manually correcting them using jTessBoxEditor to edit them and retrain a new eng_dna.traindata file.
With the current eng best LSTM train data file, fine tune the network with a bunch of sequences texts.

Questions and concerns:

Can I mix different font type in the training data images?
Do I need to rely on any existing train data file? Since I want to recognise some normal words and numbers in the DNA/RNA sequence images too.
I understand LSTM is line based recognition. Will it accept the mix font training images with boxes.
Which one is the right one for my problem? Really have no clue and experience when it comes to training your own model.