I was struggling just like you, until I found this github repository: https://github.com/OCR-D/ocrd-train
It will make your life super easy. All you need to do is to put your images in tif format and your text should have the same image name with extension .gt.txt. It will take care of all the rest for you. (you might need to update the Makefile according to your local machine)
Whether to train from scratch or fine-tune depends on your own language, data and the problem you are trying to solve. For me the fine tunining is what I need cause I am happy with the current performance but need to add upon it.
All the useful details you might need can be found in this answer
Thanks to @ShreeShrii for providing support on every matter.
Regards