Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Text Lines Split Incorrectly

109 views
Skip to first unread message

Matt Johnson

unread,
Oct 23, 2024, 1:13:05 AM10/23/24
to tesseract-ocr
I am having an issue with Tesseract splitting text lines incorrectly for the attached file of a metes and bounds legal description.  It returns this:

Sa 2-44-2637

a)

THENCE North 30 deg. 38 min. 53 sec. East, 68.33 feet;

THENCE North 66 deg.
THENCE South 69 deg.

THENCE South 83 deg.
THENCE South 57 deg.
THENCE South 55 deg.
THENCE South 52 deg.

THENCE North 85 deg.
THENCE North 73 deg.
THENCE North 53 deg.
THENCE North 15 deg.
THENCE North 39 deg.
THENCE North 22 deg.

53 min. 07 sec. East, 42.01 feet;
35 min. 09 sec. East, 51.93 feet;
58 min. 20 sec. East, 33.40 feet;
01 min. 02 sec. East, 20.46 feet;
56 min. 01 sec. East, 45.88 feet;
24 min. 02 sec. East, 35.24 feet;
55 min. 25 sec. East, 35.59 feet;
28 min. 31 sec. East, 45.24 feet;
54 min. 24 sec. East, 41.78 feet;
19 min. 11 sec. West, 36.63 feet;
16 min. 38 sec. West, 61.69 feet;
42 min. 12 sec. West, 26.40 feet;

THENCE North 71 deg. 43 min. 49 sec. East, 98.69 feet;
THENCE South 83 deg. 02 min. 24 sec. East, at 171.41 feet past the westerly right-
of-way line of the said C.R.1&P.&F.W.&D.R.R., in all a total distance of 174.72 feet;

THENCE South 45 deg.
THENCE South 50 deg.

THENCE South 03 deg.
THENCE South 20 deg.

THENCE North 88 deg.
THENCE North 85 deg.
THENCE North 65 deg.

THENCE South 44 deg.

THENCE North 81 deg.
THENCE North 84 deg.

THENCE South 36 deg.
THENCE South 78 deg.

03 min. 10 sec. East, 89.19 feet;
09 min. 30 sec. East, 215.24 feet;

30 min. 21 sec, East, 54.82 feet;
02 min. 16 sec. East, 37.70 feet;
19 min. 17 sec. East, 44.39 feet;
38 min. 35 sec. East, 27.30 feet;
08 min. 14 sec. East, 24.55 feet;

23 min. 51 sec East, 36.03 feet;
44 min. 46 sec. East, 20.02 feet;
26 min. 11 sec. East, 29.33 feet;
39 min. 46 sec. East, 24.03 feet;
58 min. 07 sec. East 39.20 feet;

la)

534-10-0715

THENCE North 86 deg. 05 min. 01 sec. East, at 5.73 feet pass the easterly right-of-way
line of the said C.R.L&P. & F.W. & D.R.R., in all a total distance of 22.72 feet;

THENCE North 65 deg.
THENCE North 28 deg.
THENCE North 38 deg.
THENCE North 55 deg.
THENCE North 78 deg.
THENCE North 50 deg.
THENCE North 26 deg.
THENCE North 19 deg.
THENCE North 14 deg.
THENCE North 42 deg.
THENCE North 12 deg.
THENCE North 39 deg.

48272427.3

03 min. 06 sec. East, 35.22 feet;
51 min. 31 sec. East, 24.61 feet;
48 min. 42 sec. East, 22.80 feet;
36 min. 13 sec. East, 32.36 feet;
50 min. 25 sec. East, 74.30 feet;
08 min. 54 sec. East, 28.09 feet;
41 min. 13 sec. East, 22.71 feet;
28 min. 17 sec. East, 26.11 feet;
06 min. 25 sec. West, 20.13 feet;
38 min. 46 sec. West, 43.42 feet;
00 min. 05 sec. West, 113.03 feet;
02 min. 17 sec. East, 34.22 feet;

Page 7 of 28


Any ideas on how to fix this?
page4.png

Tom Morris

unread,
Oct 23, 2024, 10:42:25 AM10/23/24
to tesseract-ocr
On Wednesday, October 23, 2024 at 1:13:05 AM UTC-4 mattjo...@gmail.com wrote:
I am having an issue with Tesseract splitting text lines incorrectly for the attached file of a metes and bounds legal description.  It returns this:
[...]
Any ideas on how to fix this?

It would be helpful if you included the version you are using, language model, the command line, etc.

The most likely fix is to use a different page segmentation mode on the command line.

Tom 

Matt Johnson

unread,
Oct 23, 2024, 3:11:18 PM10/23/24
to tesseract-ocr
You are correct, I was able to resolve this by using these two page segmentation modes:
  • PSM 6 (single uniform block) 
  • PSM 4 (single column variable sizes)

I use Tesseract with python and ran into this issue with both  pytesseract.image_to_data and  pytesseract.image_to_text commands with version 5.2 of Tesseract.

Thanks

محمود محمد

unread,
Dec 10, 2024, 2:53:40 AM12/10/24
to tesseract-ocr
I want you to guide me on how to deal with Tesseract jTessBoxEditor to create a training model on 10 images in Arabic and run the model Hello Tesseract with Mahmoud Abdel Aleem I saw your contributions in GitHub about Tesseract and I benefited from you well Thank you for your useful contributions, Tesseract I want you to help me with the following: 1- I have a set of digital images of book covers, 10 images in Arabic, I want to convert them to text using Tesseract 2- The conversion model is inaccurate and does not recognize most of the words ara.traineddata in the tessdata file in Tesseract 3- I created a model ara1.traineddata using jtessboxeditor where I created boxes for each image and modified them in a sample image then created a file ara1.traineddata and put it in the tessdata file in Tesseract and repeated the experiment on the image that was trained on but it did not succeed I think there is an error in the work steps that I am doing using jtessboxeditor If possible Tesseract let me know the correct steps for training and creating a .traineddata file using jtessboxeditor even create a custom model for 10 digital images so that I can make Tesseract recognize them and convert them to text If possible help me by sending an illustrative image of the steps I would be grateful for your cooperation
Reply all
Reply to author
Forward
0 new messages