image_to_data recognize 'conf' as string

357 views
Skip to first unread message

Brunitus Nishimura

unread,
Sep 17, 2021, 3:27:34 AM9/17/21
to tesseract-ocr
Could anyone explain why tesseract recognizes "conf" as a string ?
  1. import pytesseract
  2. import cv2
  3. from pytesseract import Output
  4.  
  5. pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
  6. img = cv2.imread(r"C:\Users\Documents\Python\OCR\Programa\teste_manuscrito_01.jpg")
  7. rgb = cv2.cvtColor (img, cv2.COLOR_BGR2RGB)
  8.  
  9. resultado = pytesseract.image_to_data(rgb, lang='por', output_type=Output.DICT)
  10.  
  11. print(resultado)
  12.  
  13. {'level': [1, 2, 3, 4, 5, 4, 5, 5],
  14. 'page_num': [1, 1, 1, 1, 1, 1, 1, 1],
  15. 'block_num': [0, 1, 1, 1, 1, 1, 1, 1],
  16. 'par_num': [0, 0, 1, 1, 1, 1, 1, 1],
  17. 'line_num': [0, 0, 0, 1, 1, 2, 2, 2],
  18. 'word_num': [0, 0, 0, 0, 1, 0, 1, 2],
  19. 'left': [0, 38, 38, 38, 38, 102, 102, 307],
  20. 'top': [0, 79, 79, 79, 79, 228, 233, 228],
  21. 'width': [700, 607, 607, 607, 607, 532, 77, 327],
  22. 'height': [400, 236, 236, 92, 92, 87, 76, 87],
  23. 'conf': ['-1', '-1', '-1', '-1', '90.214363', '-1', '77.749153', '61.677670'],
  24. 'text': ['', '', '', '', 'TESTANDO', '', 'O', 'OCR...']}

///////////////////////////////////////////

confianca = int(resultado['conf'] [i])

ValueError: invalid literal for int() with base 10: '90.214363'

Zdenko Podobny

unread,
Sep 17, 2021, 3:32:58 AM9/17/21
to tesser...@googlegroups.com
It is not the tesseract that produces conf as string but pytesseract.

You can not convert float string to int directly in python. Check your python tutorial how to do it correctly.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6339b18a-95a6-4085-abff-7a0e25af4cd4n%40googlegroups.com.

Quang Linh

unread,
Dec 1, 2021, 3:31:35 AM12/1/21
to tesseract-ocr
you can change int() to float() 

Vào lúc 14:32:58 UTC+7 ngày Thứ Sáu, 17 tháng 9, 2021, zdenop đã viết:

Mohammad Waqas Shoukat Ali

unread,
Dec 2, 2021, 12:14:33 AM12/2/21
to tesser...@googlegroups.com
Can someone guide me how to enable Regix inside tesseract output to capture certain keywords ? What is best approach to catch certain keywords and its corresponding values, Any advise or help really appreciated. 


Regards,

Reply all
Reply to author
Forward
0 new messages