The docs are pretty bad so I'm not surprised you didn't find an answer.
We also needed to train against a unusual font so here's our experience. Your situation might be different.
1. the training data needs to be much much bigger than 100 lines. We took the ".wordlist" file from the language data directory, added our own words to the top and use that to generate ground truth. It's about 50,000 lines.
2. each line should be separately rendered to a picture, gt.txt file containing the text in question and a .box file into ground-truth. So that's three files for each of the 50,000 lines, total 150,000 files
Unfortunately, text2Image would not work for our specific font so we ended up writing our own code to generate the image and box files. It reads the wordlist file line by line, renders an image with the text line and uses the font info to extract the character boundaries. (unlike text2image our program figures out the overall bounding rectangle of the text, adds a margin, and creates the image exactly the right size. text2image, at least in my experience, often creates a huge image with mostly whitespace around it.
3. use the 50,000 sets of ground truth files to train the model
Hope that helps.
Danny