18th-century French

74 views
Skip to first unread message

Scott M. Sanders

unread,
Dec 26, 2019, 2:17:46 PM12/26/19
to tesseract-ocr

I'm trying to ocr over 2000 pdf copies of Bordeaux's 18th-century newspaper. My goal is to recreate the Bordeaux theater repertoire from 1784 to 1790. This should be easy if I can identify the word "Spectacles" and then find any words that are italicized after Spectacles. These words are either the name of a theatrical work or the name of an artist.

I've set up a workflow in Jupyter Notebook that has begun the process. I've attached a copy of the pdf (bp2.pdf) and a copy of my code and output (bord_prj.html).

Here are my trouble spots. I would appreciate any suggestions to the following questions.

1. 18th-century French spelling and type
 I was wondering if there were any better training sets for 18th-century French that deal with the long s's and with 18th-century spellings (i.e. the ois, oit, oient verb endings).

2. Retaining formatting
I'm using pytesseract to ocr a jpeg of the pdf. I haven't found how to retain style format in my ocr text.

3. Missing steps in my workflow
I'm currently using a binarization function to make the ocr work better. To improve the results, I'll also need to put the columns of text into bounding boxes. 

4. Processing multiple files
Once I've figured out the first steps, I'll need to set up a workflow that allows me to process multiple pdfs.
bp2.pdf
brd_rep.html

Scott M. Sanders

unread,
Dec 26, 2019, 2:21:26 PM12/26/19
to tesseract-ocr
If you can't see the bad_rep.html, here is a pdf version.
brd_rep.pdf

Shree Devi Kumar

unread,
Dec 26, 2019, 10:57:21 PM12/26/19
to tesseract-ocr
Please see the repo tesseract-ocr/tesstrain, specifically wiki pages regarding training for Fraktur. 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b6cfaad4-44d8-4893-b7d7-eb1847cfacfc%40googlegroups.com.

Scott M. Sanders

unread,
Dec 27, 2019, 11:59:12 AM12/27/19
to tesseract-ocr
I added the following code, which has improved the results. I thought that adding 'alto' would create an xml file with formatting information, but it didn't work. Is there another way to retain formatting information in Tesseract?

config = ("-l fra --oem 1 --psm 1 alto")
text = pytesseract.image_to_string(Image.open('readonly/greyscale_noise.jpg'),config= config) 

Shree Devi Kumar

unread,
Dec 27, 2019, 12:10:34 PM12/27/19
to tesseract-ocr
Formatting info is not retained in tesseract4. It was available in 3.0x

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Scott M. Sanders

unread,
Dec 27, 2019, 12:13:54 PM12/27/19
to tesseract-ocr
Would you recommend that I use another OCR distribution to retain formatting information? I've been considering Kraken, Calamari, Google Vision API, Amazon Rekognition or OCR4all (which was developed for early print).


Le vendredi 27 décembre 2019 12:10:34 UTC-5, shree a écrit :
Formatting info is not retained in tesseract4. It was available in 3.0x

On Fri, Dec 27, 2019, 22:29 Scott M. Sanders <sms...@nyu.edu> wrote:
I added the following code, which has improved the results. I thought that adding 'alto' would create an xml file with formatting information, but it didn't work. Is there another way to retain formatting information in Tesseract?

config = ("-l fra --oem 1 --psm 1 alto")
text = pytesseract.image_to_string(Image.open('readonly/greyscale_noise.jpg'),config= config) 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages