Hello,
I'm working on historical newspaper from the interwar period written in 3 different languages : corsican, french and italian.
After many tries, Tesseract seems to be the best OCR for me but the layout analysis of a newspaper is complex.
However, using the API of Gallica (French national library), I can have access to an OCR (bad quality) and usable ALTO files.
My question is : can I use those ALTO files to make Tesseract follow the same segmentation as the basic OCR?
I don't know if my question makes sense.
Thanks a lot,
Vincent Sarbach-Pulicani