Poor results of Tesseract performing a play card evaluation

213 views
Skip to first unread message

Paulus Present

unread,
Oct 29, 2023, 4:22:47 PM10/29/23
to tesseract-ocr
Dear forum members
I used Tesseract to get 10 Regions Of Interest from a Lorcana play card, but it didn' succeed very well. It did not succeed in figuring out the numbers nor the name of the character. I presume this is because of the image preprocessing as the fonts are not really anything special. Could you help me figuring out how I could bring Tesseract to better perform on the PNG? I add 1 sample card and the py code used to deploy Tesseract as well as the resulting Excel table and the extraced Region Of Interest TIFFs.
I will be happy with any help anyone can provide. Thanks in advance!
Paulus
processed_ariel-whoseit_collector-large.png_(32, 1147, 192, 1172).png
processed_ariel-whoseit_collector-large.png_(70, 75, 111, 122).png
extracted_data.xlsx
processed_ariel-whoseit_collector-large.png_(68, 1117, 264, 1144).png
processed_ariel-whoseit_collector-large.png_(757, 812, 817, 1103).png
processed_ariel-whoseit_collector-large.png_(47, 806, 744, 1101).png
ariel-whoseit_collector-large.png
processed_ariel-whoseit_collector-large.png_(45, 705, 319, 745).png
processed_ariel-whoseit_collector-large.png_(44, 652, 266, 705).png
processed_ariel-whoseit_collector-large.png_(738, 670, 790, 727).png
processed_ariel-whoseit_collector-large.png_(245, 760, 607, 796).png
processed_ariel-whoseit_collector-large.png_(638, 673, 692, 726).png
GetCardDataFromImages.py

Art Rhyno

unread,
Oct 29, 2023, 9:58:56 PM10/29/23
to tesser...@googlegroups.com

Maybe use a different segmentation mode? Try changing the line:

 

text = pytesseract.image_to_string(cropped_image, lang='eng').strip()

 

to:

 

text = pytesseract.image_to_string(cropped_image, lang='eng', config='--psm 6').strip()

 

That should help.

 

art

 

From: tesser...@googlegroups.com <tesser...@googlegroups.com> On Behalf Of Paulus Present
Sent: Sunday, October 29, 2023 4:21 PM
To: tesseract-ocr <tesser...@googlegroups.com>
Subject: [tesseract-ocr] Poor results of Tesseract performing a play card evaluation

 

You don't often get email from present...@gmail.com. Learn why this is important

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9c2e162e-dce2-4a81-8138-5268b4e16423n%40googlegroups.com.

Des Bw

unread,
Oct 30, 2023, 4:18:39 AM10/30/23
to tesseract-ocr
How about processing the images using ScanTailor or some other tool before feeding them to Tesseract?

Paulus Present

unread,
Oct 30, 2023, 5:58:50 AM10/30/23
to tesseract-ocr
Hi Art,
Your suggestion already yields better results. Thx very much for this suggestion. The numbers are properly recognized now. The script however still struggels on the Body text of the card.
It yields:

Roe) Nt SSILU Ta Whenever you play an
item, you may ready this character.
“You want thingamabobs? | got twenty.”

It doesn't seem to deal well with the different background of the first keywords in ALLCAPS. However I cannot easily separate the KEYWORD zone to be considered separetly cause this can be spaced anywhere vertically depending on the total space and layout needed for the text itself. For some cards there can even be 2 KEYWORD zones.
It also doesn't seem to recognize the quite elongated 'I' character in the quote at the bottom.
Thanks for any help you or someone else can provide! Much obliged.
Paulus

processed_ariel-whoseit_collector-large.png_(47, 806, 744, 1101).png

Art Rhyno

unread,
Oct 30, 2023, 10:14:33 AM10/30/23
to tesser...@googlegroups.com

Hi Paulus,

Yes, I am not sure why Tesseract struggles with the first all caps region in that section. The colors are so clean in that image that you might be able to use something like opencv to extract regions based on color in addition to location. One other idea is to leverage Tesseract’s accuracy metrics. These are available in the API and also in the hocr output. For example, the first word “LOOK” is rendered as:

<span class='ocrx_word' id='word_1_1' title='bbox 9 70 85 101; x_wconf 11'>010]</span>

Tesseract doesn’t fare well but it does give a low confidence value (“11”) and the coordinates of the word “9 70 85 101”.  You could consider using those to extract the region for the word(s) and using Tesseract on that on its own.

Message has been deleted
Message has been deleted
Message has been deleted

Paulus Present

unread,
Nov 20, 2023, 5:03:15 AM11/20/23
to tesseract-ocr
Hi Art

Thanks so much for the help and also for the mail with code you sent! My reply is late, but that is because I also do this behind the hours as a hobby and I had some stuff to figure out. I tested the HOCR method you suggested and it indeed came up with notably better results than out of the box OCR multiline analysis. However, the results were still far from perfect and I wanted to do as least post editing as possible. In this case, as it is about card instructions that follow a strict syntax, it is important that really every character is transcribed correctly, to the last accent. I do not imagine that this is any different at your end where you do this professionally and on historical scans. How cool that you wanted to help me and that there is technical overlap in such diverse topics (historical documenst vs play cards :D)!

Because HOCR did still give me lots of post editing work that I wanted to avoid as much as possible (Ravensburger is launching 204 cards every 3 months - and this is just a hobby :)) I did look further for other solutions and wanted to let you know what I found as you did for me. Maybe you will have some use for it and I would be more then happy to hear it. For the last year I have been looking quite heavily into the GPT capabilities of GPT-4 and as of last Monday GPT-4 Vision (which still is in preview) comes with image input-output capabilities. Under the hood I suppose GPT-4 was linked with DALL-E. However it be, I decided to test GPT-4 Vision on the body text and this yielded incredibly good results. Based on the image recognition and language reading capabilities it has out of the box, it is able to really accurately transcribe the content of the body text of the card. Because I can only call the vanilla GPT-4 Vision from the API I had to provide it a very extensive prompt telling it what to do with the special symbols and layout elements that were encountered in the image. See my custom prompt below:

"I present you an image with the body text of a Lorcana play card. "
"Please transcribe the text from the image following the below instructions: "
"- You should ignore all non-textual information apart from numbers, punctuation marks and the special graphical symbols mentioned by me below. "
"- Punctuation marks like commas, hyphens or dashes can be positioned close to graphical symbols. Punctuation marks are never part of a graphical symbol and must always be transcribed. Make sure not to miss a single punctuation mark. "
"- You should transcribe all numbers you find and keep them in their number format. "
"- Each special graphical symbol (hexagon, diamond, sunburst, ...) that you encounter has to be transcribed as '{s}'. "
"- If there is a black rectangular background of a text, this can never be considered as a symbol. Ignore it. Only keep the white text you find therein. "
"- Symbols can never be at the start of a text line. If you think to see a symbol there, ignore it."
"- Symbols are to be used singularly and not in sequence. "
"- It could be there is an artistic horizontal divider line in the image. Don't consider it as a symbol even if it has intricate linework. "
"- All text under the artistic horizontal divider line is a flavor text and has to be prefixed by '/FlavorText: ' in the transcription. "
"- '/FlavorText: ' will only be written once. Don't repeat '/FlavorText: ' even if there is a new textline in the flavor text. "
"- In the FlavorText every punctuation mark must be transcribed as well. Don't forget any comma, point or other punctuation mark."
"- Provide the transcription clearly, with no repetitions, formatting or explanation. "
"- Don't use your inbuilt Python funcionality. "

Now, I suspect that in future development of the OpenAI API and their models it will become possible to query a custom GPT version which could be pretrained manually by simply coversing with it about a sample text you provide and telling it what it transcribed wrong and how to correct. I already tested this in the browser interface and my findings are that if you present GPT-4 in the browser just 1 image which you ask to transcribe it does a fine job. You then proceed by poinying out it's errors and ask it to correct thereby helping GPT-4 to extend it's knowledge base in that conversation. You then feed it a 2nd text and do the same process. After a certain set of images it will transcribe perfectly from the first attempt, simply because it has acquired the skills to read the documentation based on your specific steering and instructions. When you would then be able to query this model in the API, your prompt would simply need to be an image and it would know what to do and result you the transcribed text back perfectly. For now I still need the long prompt ;).

I thought this info might be useful to you, but can imagine that you are well aware of this already seen as you are already in this field.

I have attached my code and the docs you need to run it. All paths in the code need to be changed of course to the locations where you will put the source files.
Also you need your own OpenAI API key which you can get here https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key (It is a paid service so you need to add min $5 on your account)

Some additional info:
- The script uses a combination of OCR, classic image recognition and GPT-4 Vision to get all different datapoints. Where OCR or image recognition sufficed I applied this because a deterministic procedure seems preferably when sufficient.
- The special symbols in the text were found with classical image recognition and put in a dictionary based on location in the body text. I then used this dict to replace all symbol placeholders in the GPT-4 transcription with the actual symbols from the image recognition dict. This is overly complex, but was the only way to get the accuracy I wanted and being able to only prompt vanilla GPT-4 Vision. When you will be able to query your own custom trained GPT-4 Vision the replacement step will not be necessary as with training, it can learn to recognize the symbols itself. I have tested this in the browser interface and this is the case. When I correct a symbol transcription once it remembers this for future transcriptions in the same chat.
- The script part for GPT-4 Vision already implements a batching method, but as batching is not yet allowed on the OpenAI API side, the batch size is set to '1'. However in the browser interface you can upload up to 10 images in 1 prompt, so I suspect this will become available via the API sometime in the future. It suffices then to increase the BATCH_size on line 489 to start using this option.

So, this was a very long explanation and I am sorry for that. I am however very enthousiastic about the results and you surely helped me along. Thanks again! Also curious to see how you would use this in your field :)

Let me know what you think! :)

Kind regards
Paulus
GetCardDataFromImagesBatch_Anonimized.py
ImageRecognitionSources.zip
TestImages.zip

Art Rhyno

unread,
Nov 20, 2023, 8:21:39 AM11/20/23
to tesser...@googlegroups.com

Wow, thanks, it will take me a while to parse this but it sounds very promising.

 

art

 

From: tesser...@googlegroups.com <tesser...@googlegroups.com> On Behalf Of Paulus Present
Sent: Monday, November 20, 2023 4:56 AM
To: tesseract-ocr <tesser...@googlegroups.com>
Subject: Re: [tesseract-ocr] Poor results of Tesseract performing a play card evaluation

 

You don't often get email from present...@gmail.com. Learn why this is important

Previous message >>

 

Now, I suspect that in future development of the OpenAI API and their models it will become possible to query a custom GPT version which could be pretrained manually by simply coversing with it about a sample text you provide and telling it what it transcribed wrong and how to correct. I already tested this in the browser interface and my findings are that if you present GPT-4 in the browser just 1 image which you ask to transcribe it does a fine job. You then proceed by poinying out it's errors and ask it to correct thereby helping GPT-4 to extend it's knowledge base in that conversation. You then feed it a 2nd text and do the same process. After a certain set of images it will transcribe perfectly from the first attempt, simply because it has acquired the skills to read the documentation based on your specific steering and instructions. When you would then be able to query this model in the API, your prompt would simply need to be an image and it would know what to do and result you the transcribed text back perfectly. For now I still need the long prompt ;).

I thought this info might be useful to you, but can imagine that you are well aware of this already seen as you are already in this field.

I have attached my code and the docs you need to run it. All paths in the code need to be changed of course to the locations where you will put the source files.
Also you need your own OpenAI API key which you can get here https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key (It is a paid service so you need to add min $5 on your account)

Some additional info:
- The script uses a combination of OCR, classic image recognition and GPT-4 Vision to get all different datapoints. Where OCR or image recognition sufficed I applied this because a deterministic procedure seems preferably when sufficient.
- The special symbols in the text were found with classical image recognition and put in a dictionary based on location in the body text. I then used this dict to replace all symbol placeholders in the GPT-4 transcription with the actual symbols from the image recognition dict. This is overly complex, but was the only way to get the accuracy I wanted and being able to only prompt vanilla GPT-4 Vision. When you will be able to query your own custom trained GPT-4 Vision the replacement step will not be necessary as with training, it can learn to recognize the symbols itself. I have tested this in the browser interface and this is the case. When I correct a symbol transcription once it remembers this for future transcriptions in the same chat.
- The script part for GPT-4 Vision already implements a batching method, but as batching is not yet allowed on the OpenAI API side, the batch size is set to '1'. However in the browser interface you can upload up to 10 images in 1 prompt, so I suspect this will become available via the API sometime in the future. It suffices then to increase the BATCH_size on line 489 to start using this option.

So, this was a very long explanation and I am sorry for that. I am however very enthousiastic about the results and you surely helped me along. Thanks again! Also curious to see how you would use this in your field :)

Let me know what you think! :)

Kind regards
Paulus

>> Attachments

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages