Well, it won't do him any good because he's using tessnet2, so he
won't get the fix if/when I find it.
Actually, my current thought is that setting segmentation to line mode
might be enough to solve this problem, but I haven't gotten around to
checking. I'm a little too wrapped up in internationalising Tesseract
(which is an issue a little closer to my own interests).
--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.
My apologies; I assumed 'he', which was quite a sexist assumption to make.
This is the same issue in reality that I posted earlier about handwriting
above or below the text being grouped in with the same text when read that
caused bad reads.
It is helpful to have a bit better understanding of what is happening under
the hood that is causing this problem.
I suppose I don’t understand why the space before/after the word is not
"enough" for it to see those as different objects?
Do you think tosp_table_xht_sp_ratio could have any impact on this if I
tweak it?
I am not really sure I understand the significance of the values passed for
this option though.
Thanks
Austin
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.
Nah. Most of the open source OCR guis use unpaper for this, though.
> This is the same issue in reality that I posted earlier about handwriting
> above or below the text being grouped in with the same text when read that
> caused bad reads.
> It is helpful to have a bit better understanding of what is happening under
> the hood that is causing this problem.
>
> I suppose I don’t understand why the space before/after the word is not
> "enough" for it to see those as different objects?
> Do you think tosp_table_xht_sp_ratio could have any impact on this if I
> tweak it?
No; that's the ratio used to determine the space between words (1/3rd
of the height of 'x').
You would set that ratio to something else if you get too many words
being output without spaces between them (needs to be lower), or if
you get spaces between letters (needs to be higher).
1- Live with it
2- Figure out how to get the lines off the page before I read them...
Right?
Thanks
-----Original Message-----
From: Jimmy O'Regan
Sent: Monday, July 19, 2010 9:56 AM
To: tesser...@googlegroups.com
Subject: Re: Tesseract Reading Issue
On 19 July 2010 15:34, Austin Henderson <henderso...@gmail.com> wrote:
> Thank you for your feedback.
> I am working with some automated image pre-processing to try to remove the
> lines before reading and having better results.
> I just wanted to make sure I didn�t miss an optional setting that would
> allow it to differentiate better between these blocks.
>
Nah. Most of the open source OCR guis use unpaper for this, though.
> This is the same issue in reality that I posted earlier about handwriting
> above or below the text being grouped in with the same text when read that
> caused bad reads.
> It is helpful to have a bit better understanding of what is happening
> under
> the hood that is causing this problem.
>
> I suppose I don�t understand why the space before/after the word is not
As a developer I am cautious to estimate the amount of time a code change will take. I am thrilled to have the code and look forward to enhancements as they are ported to .net environments. For now I am cleaning up the image in pre processing steps to remove blobs that are inconsistent with others - this is not a problem in my use case and gets around this tesseract issue just fine.
Thanks to thegroup for clarifying what the issue was. It helped me solve my problem.
On Jul 19, 2010 1:01 PM, "patrickq" <patrick.q...@gmail.com> wrote:
Wrong ... option 2 won't really work unless you want to cut-out
individual words. This image where everything in on one line still
fails with the same insane forcing of the letters in "John" to be
interpreted as tall letters:
http://www.scanbizcards.com/johndoeoneline.jpg
I think option 2 should be for all of us together now to beg Jimmy to
spend the 3-4 hours required to just tell Tesseract to quit this
persistent folly of pretending that all blocks are of the same
heights. This is issue is arguably the most damaging Tesseract flaw
for mixed text material (which is almost everything except books).
On Jul 19, 1:34 pm, "Austin Henderson" <henderson.aus...@gmail.com>
wrote:
> Ok so safe to say for now my options are..
>
> 1- Live with it
> 2- Figure out how to get the line...
> On 19 July 2010 15:34, Austin Henderson <henderson.aus...@gmail.com> wrote:
> > Thank you for your...
> > I just wanted to make sure I didn�t miss an optional setting that would
> > allow it to differentiate better between these blocks.
>
> Nah. Most of the open source OCR guis...
> > I suppose I don�t understand why the space before/after the word is not
> > "enough" for it to see those as different objects?
> > Do you think tosp_table_xht_sp_ratio coul...
> > "j...@widgets.com":http://www.scanbizcards.com/johndoe.jpg
> > Just because the email address uses a smaller font, Tesseract 3.0
> > stubbornly insists on inte...
> For more options, visit this group athttp://groups.google.com/group/tesseract-ocr?hl=en.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group...
:D I like you a lot right now.
> I am thrilled to have the code and look forward to enhancements
> as they are ported to .net environments.
Nobody has mentioned any plans to write a .net wrapper for Tesseract
3, and the developer of tessnet2 has mentioned that he would rather
pay for someone to reimplement Tesseract than touch it again, so I
wouldn't hold my breath, if I were you.
(On a related note, I spent a little while yesterday looking at some
truly horrifically written spaghetti code[1], so I'm a little less
unsympathetic than before, but I think he's seriously underestimating
the magnitude of such a reimplementation).
[1] Reminded me of this: http://www.ioccc.org/
> "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com.
Eh? I'm not aware of anyone's commit rights having been taken away. If
you had commit rights before, you should still have them.
You don't automatically get commit rights just by joining a mailing
list, not on any open source project. If you want to commit, you have
to ask the project owner to add you. In this case, that's Ray Smith.
His email address should be easy to find.
> I just get last alpha 3 version.
> Can you explain me this syntax in imgtiff.cpp
> tprintf (_("Resolution=%d\n"), *res);
> What this underscore mean?
> This is not C++ ISO?
>
It's a gettext convenience macro, for localisation. It's a small first
step towards making Tesseract translatable, mostly made to see what
broke (thanks to Zdenko, btw, for finding the breakage).
>> (On a related note, I spent a little while yesterday looking at some
>> truly horrifically written spaghetti code[1], so I'm a little less
>> unsympathetic than before, but I think he's seriously underestimating
>> the magnitude of such a reimplementation).
>>
>
> I don't underestimating. And that's why a university or engineer
> student will help us during 3 month.
> I write in C/C++ since I'm 18 (I'm 41 now), I'm self employed since 10
> years, I handled so many projects I know exactly how much work it
> needs.
Hey, by all means, prove me wrong.
> I also know we can't get a good solution triking small pieces of
> tesseract code, we need to get the big picture and rewrite it.
That's you opinion; it's also your time and your money, so use them as
you see fit.
The usual convention on mailing lists is, when you want to comment on
a particular statement, you respond to *that* email, instead of
finding something completely unrelated and inserting your two cents
there.
Clearly, you didn't understand. Maybe you should read it again instead
of trying to reconstruct it from memory.
> And if somebody need HD Photo support you add it also? And camera RAW
> also, can you add it?
>
It looks like you're trying to take your misunderstanding to the level
of absurdity.
> So I confirm, I'll never write the tessnet3 wrapper.
Yeah, that was a really long winded way of eventually meandering to
the point, wasn't it?