Page segmentation finding wrong columns

52 views
Skip to first unread message

Daniel Bonniot de Ruisselet

unread,
Aug 27, 2014, 2:50:05 PM8/27/14
to tesser...@googlegroups.com
Hi,

First of all, thanks for this very useful piece of software!

Here's an issue I'm seeing on 3.03 and git HEAD. On the attached image, page segmentation (-psm 3, also default) seems to find some valid but also one invalid column. Going through the output:

10
15
20
25
30
35

EP 2 377 850 A1

This is a good detecting of the narrow column on the left, and of the top line.

1-(2-(dimethylamino)-4-(trifluoromethyl)benzyl)-3-(2,3-dihydro-2—oxo-1H-benzo[d]imidazo|—4-yl)ure
1-(4-(trif|uoromethyl)-2—(pyrrolidin-1-y|)benzy|)-3-(2,3-dihydro-2—oxo—1H-benzo[d]imidazol-4-yl)urea
[...]

Also good.

1-( -(trif|uoromethyl)-2—(pyrrolidin-1-y|)benzy|)-3-(2,3-dihydro—2—oxobenzo[d]oxazo|—4-y|)urea
1-( -(trif|uoromethyl)-2—(piperidin-1-y|)benzyl)-3-(2,3-dihydro-2—oxobenzo[d]oxazol-4-yl)urea
[...]

Here one character (the 4) is missing from each line.

4
4
[...]

The 4s seem to have been detected as a separate column, which is not desired. Seems to me a column should not be detected here, both because the 4s are actually close to other characters (no column separation), and because this column largely overlaps with the main (widest) one.

Would someone familiar with the code be able to check why this is happening? If pointed in the right direction, I could have a try as well :)

Cheers,

Daniel

chem.png
Reply all
Reply to author
Forward
0 new messages