Tesseract Output not correct in hindi text.

23 views

Skip to first unread message

lalit joshi

unread,

Jun 26, 2024, 2:42:54 AM (4 days ago) Jun 26

to tesseract-ocr

I am trying to build an app where I have to extract some data from pdf containing election roll data for the indian constituencies. I have attached a sample PDF. Below is the code I am running:-

data = []
current_page = np.array(pdf2image.convert_from_path('/home/spxlpt087/Downloads/New folder/2024-FC-EROLLGEN-S07-49-FinalRoll-Revision2-HIN-61-WI.pdf',
first_page=3,
last_page=3,
dpi=300)[0])
sharpened_image = cv2.filter2D(current_page, -1, kernel_sharpening)
kernel = np.ones((1, 1), np.uint8)
img_dilation = cv2.dilate(sharpened_image, kernel, iterations=5)
gray_img = cv2.cvtColor(img_dilation, cv2.COLOR_BGR2GRAY)
thr = cv2.threshold(gray_img, 128, 255, cv2.THRESH_BINARY_INV)[1]
cnts = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
contours_new = ()
cnts_tables = [cnt for cnt in cnts if cv2.contourArea(cnt) > 50000]
rects = sorted([cv2.boundingRect(cnt) for cnt in cnts_tables], key=lambda r: (r[1], r[0]))
for i_r, (x, y, w, h) in enumerate(rects, start=1):
cell = current_page[y+1:y+h-1, x+1:x+w-1]
text = pytesseract.image_to_string(
cell,
config='--oem 3 --psm 11', #--oem 1 --psm 3
lang='Devanagari+eng',
nice=1)
text = text.replace('\f', '')
text = text.replace('\n\n', '')
print(text)
data.append(text)

The data I am getting :-
SOI0798389iनाम: उघापिता का नामः जगदीश चंद्र~मकान नं. : 001आयु : 33 लिंग : महिला $011000827[|नाम : सुरेश कुमारपिता का नाम: राजबीर~मकान नं, : 01आयु : 21 लिंग : पुरुष MXMI5749203नाम : अशोक कुमारपिता का नाम: रुपचन्दफोटो उपलब्धमकान नं, : 79आयु : 39 लिंग : पुरुष $011142009नाम : प्रकाश कौरपति का नामः राम लुभायामकान नं. : 10~आयु : 49 लिंग : महिला [ 3]$011145184नाम : मोनिकापति का नामः धर्मवीरमकान नं. : 11फोटो उपलब्धआयु : 27 लिंग : महिला $011146356नाम : अरसियापिता का नाम: राजकुमारमकान नं. : 11फोटो उपलब्ध हैआयु : 18 लिंग : महिल SOI07983637नाम : सुनीलपिता का नामः रामदिया~मकान न॑. : 17आयु : 24 लिंग : पुरुष $011146208[9नाम : सुमनपति का नामः सुभाषफोटो उपलब्धमकान न॑. : 18आयु : 31 लिंग : महिला $011146133|नाम : सुभाषपिता का नाम: राज कुमार~मकान न॑. : 18आयु : 33 लिंग : पुरुष | 10] 0$011141548नाम : वीरेंद्रपिता का नाम: रामफलमकान नं. : 19~आयु : 28 लिंग : पुरुष [dl 1$011146257नाम : रोहितपिता का नाम: दीपकमकान नं. : 19फोटो उपलब्धंआयु : 20 लिंग : पुरुष | 12$010958629नाम : सुमनपति का नाम: रामदियामकान नं. : 34फोटो उपलब्धआयु : 33 लिंग : महिला SO1092028013 |नाम : वंशीकापिता का नामः जितेंद्र~मकान न॑. : 35आयु : 22 लिंग : महिला $011145994नाम : वीरेन्द्रपिता का नाम: लक्ष्मण दास~मकान न॑. : 37आयु : 29 लिंग : पुरुष | 15$011141563नाम : सोनी कुमारीपिता का नाम: लक्ष्मणमकान न॑. : 37फोटो उपलब्धआयु : 21 लिंग : महिला $011143296[ 1a]नाम : सवितापति का नामः जयबीरमकान नं. : 45~आयु : 39 लिंग : महिला | 7]$011152537नाम : ज्योतिपति का नामः प्रवीनमकान नं. : 45~आयु : 27 लिंग : महिला $011164177a)नाम : संजयपिता का नाम: रामदियामकान नं. : 50फोटो उपलब्धआयु : 22 लिंग : पुरुष 19 |$011152560नाम: मीनूपति का नामः सोनूमकान नं. : 52~आयु : 24 लिंग : महिला $010562264नाम : सूरजमलपिता का नाम: राम सरूपमकान नं. : 54~“आयु : 69 लिंग : पुरुष MXM1751585[ 21नाम : कवितापिता का नाम: प्रवीन कुमारमकान नं. : 54/1फोटो उपलब्धआयु : 37 लिंग : महिला MXM1010024[ 2]नाम: रामापिता का नामः मेहर चन्दमकान नं. : 55~आयु : 67 लिंग : पुरुष MXM1009422| 23]नाम : मुनीशपिता का नामः रामामकान नं. : 55~आयु : 40 लिंग : पुरुष Ll$010959718नाम : परवीनपति का नामः सतीश कुमारमकान नं, : 55फोटो उपलब्धआयु : 32 लिंग : महिला 25]S010998013नाम : सतीश कुमारपिता का नाम: राममकान नं. : 55~आयु : 31 लिंग : पुरुष SOI0998054[ 2२]नाम : कुसुम देवीपति का नामः मुनीशमकान नं. : 55फोटो उपलब्धआयु : 24 लिंग : महिला ——ˆˆ#227$011240217नाम : राजकुमारपिता का नाम: च॑दगीमकान नं. : 56फोटो उपलब्धआयु : 73 लिंग : पुरुष HR/09/71/004262528]नाम : जीवनीपति का नामः राज कुमारमकान नं. : 56~आयु : 69 लिंग : महिला HR/09/71/0042599[ 2] 29 |नाम: ईश्वरपिता का नामः मांगे राम~मकान नं. : 56आयु : 65 लिंग : पुरुष | 30HR/09/71/0042600नाम : किताबोपति का नामः ईश्वरमकान नं. : 56फोटो उपलब्धआयु : 64 लिंग : महिला
So, I am abe to extract data from this text:-
SOI0798389 नाम: उघा आयु : 33 लिंग : महिला SO000827 नाम : सुरेश कुमार आयु : 21 लिंग : पुरुष MXM1574920 नाम : अशोक कुमार आयु : 39 लिंग : पुरुष OoN142009 नाम : प्रकाश कौर आयु : 49 लिंग : महिला SOo45184 नाम : मोनिका आयु : 27 लिंग : महिला OIN46356 नाम : अरस्रिया आयु : 18 लिंग : महिला SOI0798363 नाम : सुनील आयु : 24 लिंग : पुरुष OIN46208 नाम : सुमन आयु : 31 लिंग : महिला SO46133 नाम: सुभाष आयु : 33 लिंग : पुरुष SON41548 नाम : वीरेंद्र आयु : 28 लिंग : पुरुष SO46257 नाम : रोहित आयु : 20 लिंग : पुरुष SOI0958629 नाम : सुमन आयु : 33 लिंग : महिला SOI0 नाम : वंशीका आयु : 22 लिंग : महिला SON41563 नाम : सोनी कुमारी आयु : 21 लिंग : महिला SO143296 नाम : सविता आयु : 39 लिंग : महिला SOI1152537 नाम : ज्योति आयु : 27 लिंग : महिला SON64177 नाम : संजय आयु : 22 लिंग : पुरुष INS2560 नाम: मीनू आयु : 24 लिंग : महिला SO0562264 नाम : सूरजमल आयु : 69 लिंग : पुरुष MXM1751585 नाम : कविता आयु : 37 लिंग : महिला MXM1010024 नाम : रामा आयु : 67 लिंग : पुरुष MXM1009422 नाम : मुनीश आयु : 40 लिंग : पुरुष SOI0959718 नाम : परवीन आयु : 32 लिंग : महिला SOI0998013 नाम : सतीश कुमार आयु : 31 लिंग : पुरुष SOI0998054 नाम : कुसुम देवी आयु : 24 लिंग : महिला SOI240217 नाम: राजकुमार आयु : 73 लिंग : पुरुष HR/ नाम : जीवनी आयु : 69 लिंग : महिला HR/ नाम: ईश्वर आयु : 65 लिंग : पुरुष HR/ नाम : किताबो आयु : 64 लिंग : महिला

but I am having problem with the voter number as tesseract sometime considering the character as %. Also, I have 10000+ pdf in same format can anyone how can I fast this process as this is taking too much time approx 10 minutes for 1 pdf.

Thanks!

2024-FC-EROLLGEN-S07-49-FinalRoll-Revision2-HIN-61-WI.pdf

Ger Hobbelt

unread,

Jun 26, 2024, 4:21:48 PM (3 days ago) Jun 26

to tesser...@googlegroups.com

If you want more speed, give tesseract less to work on. Your scenario sounds like you will have a large number of PDFs, all containing the same (scanned) form. From the look of this sample, it seems page alignment, etc. has already been taken care of, so that would allow us to assume that all those forms (scans), would we stack them all on top of one another, all look the same, i.e. the data you are looking for is to be found at predetermined fixed rectangle coordinates within the page.

Create a mask that erases everything else to white, so only the fields of interest remain and feed that to tesseract. Output TSV or HOCR to get coordinates alongside the OCRed text and you can reconstruct the fields' content easily. At least that's the assumption here & now.

The key is: image preprocessing

In your case, there's a lot that can be done in that preprocessing stage so that tesseract has only a few text areas to process in an otherwise white page.

Reference material: read it all, as a lot depends on context and you are the one who can determine whether each item is applicable / may have an effect in your particular scenario.

- https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html

- image scaling can have a significant impact; see https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ

- https://groups.google.com/g/tesseract-ocr/c/B2-EVXPLovQ/m/lP0zQVApAAAJ (process flow)

Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web: http://www.hobbelt.com/
http://www.hebbut.net/
mail: g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3f002c65-517c-4a32-8d52-cde41a69485an%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages