Questions regarding fine tuning of Tesseract 4.00alpha LSTM
76 views
Skip to first unread message
Wang Zhimin
unread,
Nov 13, 2017, 4:04:27 AM11/13/17
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to tesseract-ocr
Hi all,
Thank you in advance.
I have questions regarding the accuracy improvement with fine tuning of the LSTM model.
BACKGROUND:
I want to use tesseract to recognise DNA/RNA sequences from PDF/TIFF. However, the accuracy is not great as the images have different font types and sizes.
Method:
I understand that I probably have two options:
With the source images, I run the tesseract to generate the boxes, manually correcting them using jTessBoxEditor to edit them and retrain a new eng_dna.traindata file.
With the current eng best LSTM train data file, fine tune the network with a bunch of sequences texts.
Questions and concerns:
Can I mix different font type in the training data images?
Do I need to rely on any existing train data file? Since I want to recognise some normal words and numbers in the DNA/RNA sequence images too.
I understand LSTM is line based recognition. Will it accept the mix font training images with boxes.
Which one is the right one for my problem? Really have no clue and experience when it comes to training your own model.