I'm trying to ocr over 2000 pdf copies of Bordeaux's 18th-century newspaper. My goal is to recreate the Bordeaux theater repertoire from 1784 to 1790. This should be easy if I can identify the word "Spectacles" and then find any words that are italicized after Spectacles. These words are either the name of a theatrical work or the name of an artist.
I've set up a workflow in Jupyter Notebook that has begun the process. I've attached a copy of the pdf (bp2.pdf) and a copy of my code and output (bord_prj.html).
Here are my trouble spots. I would appreciate any suggestions to the following questions.
1. 18th-century French spelling and type
I was wondering if there were any better training sets for 18th-century French that deal with the long s's and with 18th-century spellings (i.e. the ois, oit, oient verb endings).
2. Retaining formatting
I'm using pytesseract to ocr a jpeg of the pdf. I haven't found how to retain style format in my ocr text.
3. Missing steps in my workflow
I'm currently using a binarization function to make the ocr work better. To improve the results, I'll also need to put the columns of text into bounding boxes.
4. Processing multiple files
Once I've figured out the first steps, I'll need to set up a workflow that allows me to process multiple pdfs.