There might still be a few glitches in the build system.
--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.
I've put the code of 3.01 on GitHub -
http://github.com/jimregan/tesseract-ocr. I'd intended to push the
merge into SVN today, but stupidly used the http address when building
the git repository instead of the https address, so I can't push back
directly without rewriting the references. That might be for the best
though, as I think it might be worth leaving 3.00 as is for a week or
two, before pushing out the update.
There might still be a few glitches in the build system.
gettimeofday is a simple enough function - if there isn't an existing
Windows implementation around, it'll be easy enough to implement.
Makes me wonder /why/ anything in Tesseract would need to know the
time though - it's probably something that gives a slightly nicer idea
of how long processing took, but that isn't strictly necessary.
http://doxygen.postgresql.org/gettimeofday_8c-source.html
On Sat, 04 Dec 2010 09:09:33 +0100I am using the last SVN, version 3.01, on a Linux 2.6.29 kernel, AMD-64
processor. Here are the issues I commented, first the most important
one:
1) When OCRing in mode 3 (with block recognition) I get rubbish in the
text at seemingly random places, but mostly when switching columns.
Lines appear like:
<><><><><><><>
or
-,9-,9-,9-,9-,9
or
ííííííííííííííí
The original is perfectly clean, as it was generated from a PDF, at 300
dpi (also tried at 450 dpi).
These problems do not appear when using Pageseg_mode 6 (no block
detection). There everything is clean.
2) It seems that the -psm <n> command line option doesn't work. I
could only get different modes to work using a configuration file.
3) Though I specified -l spa (and have the data installed), I still get
recognition of certain, non-spanish, characters, such as the cent (c +
|) sign. I believe this has been reported before.
Greetings,
John