New version of tesseractdotnetwrapper

2,309 views
Skip to first unread message

Cong Nguyen

unread,
Jul 5, 2011, 9:16:06 PM7/5/11
to tesser...@googlegroups.com
Dear all,

New version of tesseractdotnetwrapper has released on 2011 July, 04.

It is based-on tesseract-ocr v3.01 r590.

Here is the link: 


Changed logs and notes:
- libleptxxx.dll is replaced by integrating directly its static librarian inside tesseract project,
- use ROI, UseROI properties for recognition in region of interest (see: Simple3.cs in IPoVnOCRer project),
- added: CreateBinaryPix(), CreateGreyPix() in PixConverter (PixFromImage maybe will be deprecated),
- created new document layout (adapt to tesseract-engine v3.01 r590) structure: DocumentLayout >> Block >> Paragraph >> TextLine >> Word >> Character/Symbol,
- be able to set tessdata path,
- be able to recognize with others OcrEngineMode,
- be able to analyze layout only,
- be able to recognize in parallel, see Simple3.cs in IPoVnOCRer project,
- use IPoVn.IPCore to load/save/crop/invert/binarize image in generic image format,
- test tesseract.dll in .net 4/vs2010 please look at Simple1.cs only, other *.cs maybe need to compile IPoVn.IPCore project,
- be able to own your flow to process:
----- 0. Do some pre-processing first (in my case: adaptive thresholding was performed)
----- 1. AnalyseLayout() -> get blobs (block/paragraph/textline/word....)
----- 2. Do some image processing for each blobs
----- 3. Recognize for each ROI/blobs with OcrEngineMode/PageSegmentMode corresponding to
----- 4. Do some post-processing (VietOCR is an example).

IPoVnOCRer project:
- Simple1.cs: use tesseract.dll only
----- example to recognize and analyze layout.
- Simple2.cs: tesseract.dll + IPoVn.IPCore + IPoVnSystem
----- example to recognize and analyze layout after performing adaptive thresholding.
- Simple3.cs: tesseract.dll + IPoVn.IPCore + IPoVnSystem
----- example to recognize in ROIs


Hope that it is helpful.
Cong.

Sarel van der Merwe

unread,
Jul 6, 2011, 4:42:56 AM7/6/11
to tesser...@googlegroups.com
Hi Cong,

I'm using VS2010.

Not sure on how to compile the IPoVn.IPCore project, must i compile it
in c# or C++?

I opened the "IPoVn - Image Processing of Vietnamese.sln" using Visual
Studio 2010.
Tried to build and got the missing assembly reference errors

Please help...

> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

cong nguyenba

unread,
Jul 6, 2011, 5:59:30 AM7/6/11
to tesser...@googlegroups.com
The solution is developed in vs2008.

IPoVn.IPCore project is clr project, it mixed between .net and c++.

Simple1.cs uses tesseract.dll only. So, you can exclude Simple2.cs,
Simple3.cs; and remove IPoVn.IPCore.dll, IPoVnSystem from IPoVn.OCRer
project.

Andreas Reiff

unread,
Jul 6, 2011, 6:01:46 AM7/6/11
to tesseract-ocr
I get an AccessViolationException, trying to adapt your code for my
needs: Attempted to read or write protected memory. This is often an
indication that other memory is corrupt.

The code is more or less copied from your simple1 - my bitmap does not
come out of a file but from a screenshot (part of the screen).

public static void Recognize(Bitmap bmp)
{
string language = "eng";
int oem = (int)eOcrEngineMode.OEM_DEFAULT;

using (TesseractProcessor processor = new TesseractProcessor())
{
DateTime started = DateTime.Now;
DateTime ended = DateTime.Now;

string tessdataFolder = @"D:\Temp\IPoVnOCRer\IPoVn\Test
\Tessdata";

processor.Init(tessdataFolder, language, oem);


string text = "";
unsafe
{
started = DateTime.Now;
text = processor.Recognize(bmp);
ended = DateTime.Now;

Console.WriteLine("Duration recognition: {0} ms\n\n",
(ended - started).TotalMilliseconds);
}

Console.WriteLine(
string.Format("RecognizeMode: {1}\nRecognized Text:\n{0}\n+
+++++++++++++++++++++++++++++++\n", text,
((eOcrEngineMode)oem).ToString()));

}
}

BTW, thx for writing a wrapper - if it works, it solves just about all
my problems. :)

Quan Nguyen

unread,
Jul 6, 2011, 8:05:22 AM7/6/11
to tesseract-ocr
Andreas,

Try adding a slash to the data path, such as:

string tessdataFolder = @"D:\Temp\IPoVnOCRer\IPoVn\Test\Tessdata\";

I'm curious as to why you use unsafe block in your code.

Quan

Andreas Reiff

unread,
Jul 6, 2011, 10:02:40 AM7/6/11
to tesseract-ocr
Hello Quan!

That did the trick, many thanks!

By the way, I am using unsafe, because it is in the example
Simple1.cs.

Apart from that, I would rather not use it, since it propagates up..
and it doesn't prevent an application from crashing anyway.

If you find the time, could you answer one more related question: I
want to do screen text recognition, like text on menus, in notepad,
and the like. Your testdata seems to be rather bad for this (now that
it is running, I could test). How best to handle this? Create/get new
testdata? Is it possible to use it without testdata at all?

I would have expected screen recognition to be especially easy, since
there is no noise. But then again, I have spent too little time to
look into this yet.

Best wishes,
Andreas
> > my problems. :)- Zitierten Text ausblenden -
>
> - Zitierten Text anzeigen -

Sven Pedersen

unread,
Jul 6, 2011, 11:57:01 AM7/6/11
to tesser...@googlegroups.com
For screen captures it is necessary to increase the resolution, since
it is usually 72-90dpi you must rescale them to 200-300dpi, then
you'll see a drastic improvement in accuracy. I don't know anything
about the C# stuff though...
--Sven

> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

--
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

Andreas Reiff

unread,
Jul 6, 2011, 1:13:17 PM7/6/11
to tesseract-ocr
Thanks, Sven!

Actually, that is what I did, and it is working.. between better and
great (of course, my expectations go up with me seeing what is
possible).

I have also written my own convert-to-grayscale and max-contrast,
allowing even for some increase above max values (putting some values
close to max and min to black and white as well).

It is working a lot better.

I still get occasional bad results (though way less frequently).

So, for anyone wanting to do this as well: scaling up the image by a
factor of 3 and increasing the contrast improves recognition quality a
lot.

I wonder: is there any testdata for Windows with standard fonts? Or
how to approach this?

Best wishes
Andreas
>   the crownless again shall be king.”- Zitierten Text ausblenden -

Quan Nguyen

unread,
Jul 6, 2011, 6:18:06 PM7/6/11
to tesseract-ocr
Andreas,

By scaling the screenshots to a higher resolution, to about 300 DPI,
you'd likely get better results. VietOCR.NET has a Screenshot mode
that performs this rescaling. You may want to check it out.

I believe the language packs included in tesseractdotnet are
Tesseract's standard issues. The eng seems to work very well for
Windows' standard fonts. Check the site http://code.google.com/p/tesseract-ocr/
for more info.

Quan
> > - Zitierten Text anzeigen -- Hide quoted text -
>
> - Show quoted text -

cong nguyenba

unread,
Jul 6, 2011, 9:17:57 PM7/6/11
to tesser...@googlegroups.com
The tessdata are downloaded from tesseract-ocr.

The unsafe statement does not need in Simple1.cs.

tesseract-ocr recommended 300 DPI, but you can pre-process image first
with others.

Maybe, you should check the initialization tesseract processor before
doing anything:
if (processor.Init(...))
{
.............
}

Here the link to download assemblies only:
http://tesseractdotnet.googlecode.com/files/IPoVn_Release_x86.zip

Sarel van der Merwe

unread,
Jul 6, 2011, 6:24:43 PM7/6/11
to tesser...@googlegroups.com
Did you managed to compile this under Visual Studio 2010.

cong nguyenba

unread,
Jul 6, 2011, 10:24:35 PM7/6/11
to tesser...@googlegroups.com
Not yet now, I have not installed vs2010 on my machine.

Sarel van der Merwe

unread,
Jul 7, 2011, 4:04:39 AM7/7/11
to tesser...@googlegroups.com
Please consider loading 2010, I think a lot of us would benefit from
having the dll using the latest dotnetframework.

The program compiles clean without any warning using the 2008 dll’s.
It only gives
"unhandled Exception : System.IO.FileLoadException: Mixed mode
assembly is build against v2.0.50727 of the runtime and cannot be
loaded in the 4.0 runtime"

I even tried changing the target Framework to 3.5 without any success

Thanks for your support,

On Thu, Jul 7, 2011 at 4:24 AM, cong nguyenba <congng...@gmail.com> wrote:
> Not yet now, I have not installed vs2010 on my machine.
>

Sarel van der Merwe

unread,
Jul 9, 2011, 6:12:52 AM7/9/11
to tesser...@googlegroups.com
Error when I tried to run the Simple program - tesseractdotnetwrapper
-----------------------------------------------------------------------------------------------------------
Could not load file or assembly 'tesseract, Version=0.0.0.0,
Culture=neutral, PublicKeyToken=null' or one of its dependencies. This
application has failed to start because the application configuration
is incorrect. Reinstalling the application may fix this problem.
(Exception from HRESULT: 0x800736B1)

New environment
------------------------
OS: windows xp with SP3
Visual Studio 2008 c# express

Steps followed
------------------------
1. Create new project under vs2008
2. Added the following to project
~ Program.cs
~ Rendre.cs
~ Simple1.cs
~ Simple2.cs
~ Simple3.cs
~ Simple4.cs
~ Workspace.cs
3. Added the 3 assemblies to the reference
http://tesseractdotnet.googlecode.com/files/IPoVn_Release_x86.zip

Building Program
---------------------
1. Had to "allow unsafe code" in properties
2. Program compiles but don’t want to execute.

cong nguyenba

unread,
Jul 9, 2011, 7:34:59 AM7/9/11
to tesser...@googlegroups.com
Try to download and install vcresdist_x86 here:
http://www.microsoft.com/download/en/details.aspx?id=5582

The Microsoft Visual C++ 2008 SP1 Redistributable Package (x86)
installs runtime components of Visual C++ Libraries required to run
applications developed with Visual C++ SP1 on a computer that does not
have Visual C++ 2008 SP1 installed.

Sarel van der Merwe

unread,
Jul 9, 2011, 8:00:14 AM7/9/11
to tesser...@googlegroups.com
I installed the redistribution pack.

1. Reboot and recompiled.
2. Still having the same problem.

Could not load file or assembly 'tesseract, Version=0.0.0.0,
Culture=neutral, PublicKeyToken=null' or one of its dependencies. This
application has failed to start because the application configuration
is incorrect. Reinstalling the application may fix this problem.
(Exception from HRESULT: 0x800736B1)


Please take a look if i haven't done something wrong.
Attached please find project with the simple

Thanks

Sarel

samp.zip

Quan Nguyen

unread,
Jul 9, 2011, 6:13:50 PM7/9/11
to tesseract-ocr
tesseract.dll is x86, so make sure your project's Property > Build >
Platform target is also x86.

On Jul 9, 7:00 am, Sarel van der Merwe <sfvdme...@gmail.com> wrote:
> I installed the redistribution pack.
>
> 1. Reboot and recompiled.
> 2. Still having the same problem.
>
> Could not load file or assembly 'tesseract, Version=0.0.0.0,
> Culture=neutral, PublicKeyToken=null' or one of its dependencies. This
> application has failed to start because the application configuration
> is incorrect. Reinstalling the application may fix this problem.
> (Exception from HRESULT: 0x800736B1)
>
> Please take a look if i haven't done something wrong.
> Attached please find project with the simple
>
> Thanks
>
> Sarel
>
> On Sat, Jul 9, 2011 at 1:34 PM, cong nguyenba <congnguye...@gmail.com> wrote:
> > Try to download and install vcresdist_x86 here:
> >http://www.microsoft.com/download/en/details.aspx?id=5582
>
> > The Microsoft Visual C++ 2008 SP1 Redistributable Package (x86)
> > installs runtime components of Visual C++ Libraries required to run
> > applications developed with Visual C++ SP1 on a computer that does not
> > have Visual C++ 2008 SP1 installed.
>
> > --
> > You received this message because you are subscribed to the Google
> > Groups "tesseract-ocr" group.
> > To post to this group, send email to tesser...@googlegroups.com
> > To unsubscribe from this group, send email to
> > tesseract-oc...@googlegroups.com
> > For more options, visit this group at
> >http://groups.google.com/group/tesseract-ocr?hl=en
>
>
>
>  samp.zip
> 17KViewDownload

Sarel van der Merwe

unread,
Jul 11, 2011, 2:23:46 AM7/11/11
to tesser...@googlegroups.com
Still having this problem.

Could not load file or assembly 'tesseract, Version=0.0.0.0,
Culture=neutral, PublicKeyToken=null' or one of its dependencies. This
application has failed to start because the application configuration
is incorrect. Reinstalling the application may fix this problem.
(Exception from HRESULT: 0x800736B1)

I think the Microsoft.VC80.CRT version is the root of the cause.

In the add remove programs I found 3 versions of the MS Visual c++
2008 Redistributable
x86 9.0.30729
x86 9.0.30729.17
x86 9.0.30729.6161

PS. This was a clean windows xp pro with Visual Studio 2008 express.

Any suggestions.....

Sarel van der Merwe

unread,
Jul 11, 2011, 4:23:06 AM7/11/11
to tesser...@googlegroups.com
Got it working,
For this to work you must have MS Visual Studio C++ Express also loaded.
I only had MS Visual Studio C# Express loaded.
Reply all
Reply to author
Forward
0 new messages