[Plugin] Tesseract OCR integration with ResourceSpace

669 views
Skip to first unread message

Robert Damrau

unread,
Sep 29, 2014, 3:05:14 PM9/29/14
to resour...@googlegroups.com
Hi,

some time ago i figured it would be nice to have the option to do some ocr on scanned documents. Since i wasn't able to find someone willing to implement that into RS i recently started to do it myself. I have no programming experience so please keep that in mind when/if you look at the code and comment, right now i am reading "JavaScript & jQuery: The Missing Manual" and hope this enlightens me a bit. 

You can find the code for the plugin here https://github.com/andirotter/ocrstream.
Tesseract OCR is needed.
This is just the beginning, right now you can 'scan' an image via the single resource edit page.
There is a lot to do but this will take time because i have to learn everything from 0. If you think you can help by commenting on the code or, even better, contribute to it, please, you are invited!


Allison Stec

unread,
Sep 29, 2014, 3:16:43 PM9/29/14
to ResourceSpace
Did the text extraction functionality within RS not meet your needs?

Allison Stec
Asset Management Specialist
Colorhythm
http://www.colorhythm.com

Main Office: +1 415-399-9921
Fax: +1 415-399-9928

as...@colorhythm.com

--
ResourceSpace: Open Source Digital Asset Management
http://www.resourcespace.org
---
You received this message because you are subscribed to the Google Groups "ResourceSpace" group.
To unsubscribe from this group and stop receiving emails from it, send an email to resourcespac...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Robert Damrau

unread,
Sep 29, 2014, 4:15:25 PM9/29/14
to resour...@googlegroups.com
Text extration within RS works great, especially with unoconv. With tesseract  i am am able to extract text from scanned images which is quite useful for me.

Robert Damrau

unread,
Oct 6, 2014, 4:19:36 PM10/6/14
to resour...@googlegroups.com
If anyone is willing to test the plugin i'll be happy.


Thanks!

Brian Irwin

unread,
Oct 8, 2014, 1:45:47 AM10/8/14
to resour...@googlegroups.com
Hello Robert,

Thanks for sharing the plugin!  This is an awesome idea with great potential!

I am looking forward to testing in out.  I'll write back with any questions/comments.

Best Regards

Brian S

unread,
Oct 8, 2014, 10:52:04 AM10/8/14
to resour...@googlegroups.com
Hey Robert,

This is a great idea! I just installed it but i'm getting the following error after clicking activate:

Parse error: syntax error, unexpected '[' in F:\xampp\htdocs\resourcespacebackup\plugins\ocrstream\config\config.php on line 23

Robert Damrau

unread,
Oct 8, 2014, 11:28:40 AM10/8/14
to resour...@googlegroups.com
Sorry for that, try using the latest master branch

Brian Irwin

unread,
Oct 8, 2014, 9:59:45 PM10/8/14
to resour...@googlegroups.com
Hello Robert,

I have just tried using the plugin.  It activates successfully, but when I try to run an OCR, all I get is the spinning progress image.   The image I was trying to OCR was a 300dpi scan of an old newspaper.  No error messages appeared in the error log.  

Is there anything I could be missing?

I installed Tesseract 3.03 in Ubuntu 14.04 using apt-get.
I am using ImageMagick 6.8.9-4 Q16

tesseract 3.03 
leptonica-1.70

Best Regards

Robert Damrau

unread,
Oct 9, 2014, 3:20:06 AM10/9/14
to resour...@googlegroups.com
Hi,
Sorry to hear that. Can you try again with the latest code from git? You should see at least something in the debug log now. What filetype was that scan? Did you take a look at the filestore/tmp dir if any files were created? When you find the tesseract command line in the debug log maybe you can try to execute this directly from the shell/terminal. I remember i had some problems with rights the first time i installed tesseract.

Brian S

unread,
Oct 9, 2014, 1:38:40 PM10/9/14
to resour...@googlegroups.com
Im having the same problem as Brian. When i try to run the scan the progress bar just keeps spinning. Nothing shows up in the RS debug log and there are no temporary files created. I did notice this in the apache error_log though: 

PHP Parse error:  syntax error, unexpected '[' in /var/www/html/plugins/ocrstream/pages/scan.php on line 31

Robert Damrau

unread,
Oct 9, 2014, 4:09:10 PM10/9/14
to resour...@googlegroups.com
Sorry for that. Like i said this is my first programming experience. This error was caused by me using php code only available in php version >=5.4. I fixed that now (hopefully). Maybe you can try again.

Thanks.

Brian S

unread,
Oct 10, 2014, 8:21:21 AM10/10/14
to resour...@googlegroups.com
No worries! I realize you're only at the early stages. That seemed to do the trick though. Everything seems to be running well. Good start!

Robert Damrau

unread,
Dec 27, 2014, 10:04:29 AM12/27/14
to resour...@googlegroups.com
I've added iocr processing for file uploads and other things.
I guess there are many possible errors so i'm still happy about any feedback.

ocrstream_scr_02.png
ocrstream_scr_01.png

Opicron

unread,
Dec 28, 2014, 6:41:31 AM12/28/14
to resour...@googlegroups.com
No programming experience and you created this? Well done ;) thanks for the effort! I am sure this will be usefull for our Resource Space!

Opicron

unread,
Dec 28, 2014, 6:58:26 AM12/28/14
to resour...@googlegroups.com
Im receiving the following error on the console when trying to OCR from the edit resource page:

SyntaxError: JSON.parse: unexpected character at line 1 column 2 of the JSON data

Ps.: I have no possibility to select part of the picture-- if no selection or if due to error the selector wasnt loaded, please consider the whole image to be processed?

Opicron

unread,
Dec 28, 2014, 7:24:59 AM12/28/14
to resour...@googlegroups.com
The error has nothing (probably) to do with jCrop not being loaded.

                // Only show jCrop when image is going to be processed
                //if (param_1 === 'pre_1') {
                    ocr_crop();
                //}

I forced the jcrop to load but still get the same console error.

Why would you restrict ocr after upload/preview? I would leave the option open, or for certain roles only.

Opicron

unread,
Dec 28, 2014, 7:39:21 AM12/28/14
to resour...@googlegroups.com

Some of my picture contain text. This might be a good idea to put in configuration options ;).

// Check if resourcetype is document
if (sql_value("select resource_type value from resource where ref ='$ref'", '') != 2){
    exit(json_encode('ocr_error_4'));
}

Ps.: previous error debugged down to: $ocr_allowed_extensions on line 45 stage_1.php is not a valid variable. The configuration is not set/read somehow. Fixed by including the config.php file in the header.

Opicron

unread,
Dec 28, 2014, 7:44:06 AM12/28/14
to resour...@googlegroups.com
And after fixing those, I get this:

/opt/bitnami/apps/resourcespace/htdocs/plugins/ocrstream/include/stage_2.php line 35: Undefined index: ocr_stage_97</p>

1.Most of the testers/programmers use E_ALL reporting in PHP. Expecting an array index to be present without checking will break a lot of setups. Of course in this case I reached this point by hacking your code ;).

Robert Damrau

unread,
Dec 28, 2014, 8:31:58 AM12/28/14
to resour...@googlegroups.com
Thanks for having a look at the plugin!
Not showing jcrop just means the resource don't need image processing before ocr. You can still choose image processing if you want via the selector.

Some of my picture contain text. This might be a good idea to put in configuration options ;).
// Check if resourcetype is document
if (sql_value("select resource_type value from resource where ref ='$ref'", '') != 2){
    exit(json_encode('ocr_error_4'));
}

I thought about that too but not sure what's the best solution there. Right now i assume all 'Document' resource types are qualified for OCR processing, if i have images/pictures i want ocr'ed i set the resource type to document for them and if they have an allowed file extension they are going to be processed. I did this to prevent regular photos from being accidentally processed as using tesseract on these can result in serious performance issues (at least it did for me).

Ps.: previous error debugged down to: $ocr_allowed_extensions on line 45 stage_1.php is not a valid variable. The configuration is not set/read somehow. Fixed by including the config.php file in the header.

hm that's strange... Did you save the plugin configuration before on the plugin setup page?

And after fixing those, I get this:
/opt/bitnami/apps/resourcespace/htdocs/plugins/ocrstream/include/stage_2.php line 35: Undefined index: ocr_stage_97</p>
1.Most of the testers/programmers use E_ALL reporting in PHP. Expecting an array index to be present without checking will break a lot of setups. Of course in this case I reached this point by hacking your code ;).
 
Huh i can't see why that happens but i still learn :)
If stage_1 completed successful there should be that index variable...

I would be happy if you open issues for the fixes/changes you make. This way i can learn and easily merge them into the code.

Message has been deleted

Opicron

unread,
Dec 28, 2014, 9:25:47 AM12/28/14
to resour...@googlegroups.com
I think it is an session problem, i never use sessions tbh-- instead of jquery.get i use jquery,ajax and pass on the variables i need.

Opicron

unread,
Dec 28, 2014, 10:08:44 AM12/28/14
to resour...@googlegroups.com
nevermind my remark about jQuery.get ;).

1)

The sessions do not work because you are not including authenticate.php (see below)

    require_once "../../../include/db.php";
    include_once "../../../include/authenticate.php";   
    require_once "../../../include/general.php";

Also, there is no need to have two include if/else, just use the top one:

//if (!isset($_SESSION["ocr_start"])) {
    SESSION_START();
    require_once "../../../include/db.php";
    include_once "../../../include/authenticate.php";
    require_once "../../../include/general.php";
    require_once "../../../include/resource_functions.php";
    require_once "../include/ocrstream_functions.php";
    //require_once "../config/config.php";   
    $ref = filter_input(INPUT_GET, 'ref', FILTER_VALIDATE_INT);
//}

2)

I see that there is some room for improvement how you use JSON data. For example, in stage_1.php, you could set the $result['error']= ...

Then at the end of the stage_1.php the data can be returned like so:

header('Content-Type: application/json');
echo json_encode($result);
//return($debug); //we cant return nothing, json is echo'd and jQuery will pick it up

This also removes the need to use JSON.parse in your ocr_start event (OCR submit), e.g.:

console.log('stage 1 started'); // debug
                    jQuery.get('<?php echo $baseurl ?>/plugins/ocrstream/include/stage_1.php', {ref: '<?php echo $ref ?>', ocr_lang: (ocr_lang)}, function (data)
                    {
                        alert(data);

Because the returned data is already an correct object.

After these changes I will give it another go ;).

Opicron

unread,
Dec 28, 2014, 10:15:20 AM12/28/14
to resour...@googlegroups.com
in jQuery you can get the value of $result['error'] like so: data.result.error

if (data.result.error != ..)

Opicron

unread,
Dec 28, 2014, 10:55:54 AM12/28/14
to resour...@googlegroups.com
I quickly forked your code and fixed the sessions, turned out it was pretty easy: https://github.com/opicron/ocrstream

The session_start needed to be outside the $_SESSION check if statement. And I added the authenticate include.

If you agree with the changes please incorporate the forked git changes. Thanks!

Opicron

unread,
Dec 28, 2014, 12:55:28 PM12/28/14
to resour...@googlegroups.com

Robert Damrau

unread,
Dec 28, 2014, 1:11:41 PM12/28/14
to resour...@googlegroups.com
Yes i found that too (https://github.com/andirotter/ocrstream/wiki/Links) ;)
I plan to implement this as an additional preset for the 'harder' cases where there is noise or whatever in the scan. I did some tests and it works well but also needs more processing time and should be optional.

Dan Huby

unread,
Sep 14, 2015, 12:53:28 PM9/14/15
to ResourceSpace
Hi all,

I've just tried this out - I'm very impressed. What is the licensing for this code? It would be great to include it in the base.

There are a couple of glitches probably as a result of a slight incompatibility with the latest RS version, we'd be happy to fix those once part of the base, and I'm sure it would make it easier for others to contribute to this great plugin also.

Many thanks,

Dan

Robert Damrau

unread,
Oct 26, 2015, 6:47:47 AM10/26/15
to ResourceSpace
Hi Dan,

I would be happy to see it in the base!

I added a LICENSE.txt. The code is located here https://github.com/winkelement/ocrstream.

As i mentioned earlier i am not a programmer at all and just starting to learn so any help and advise is much appreciated. I also wouldn't consider the plugin ready for production yet. 

So my question is: if you include it into the base code can it still be on github as i really like the concept of collaboration there or how can i personally still contribute to the code once it is part of the base?

Regards,

Robert  

Robert Damrau

unread,
Nov 3, 2015, 10:44:54 AM11/3/15
to ResourceSpace
Any news on that? I did some fixes which at least for me made "ocr on upload" work again.

Robert Damrau

unread,
Nov 5, 2015, 7:11:26 AM11/5/15
to ResourceSpace
For anyone already using tesseract-ocr i can recommend to try compiling from the latest source (3.05.00dev).
I noticed significant better results, Word Error Rate dropped from 5% to <2% which to me seems pretty ok considering using ocr-unfriendly font calibri...
In /ocrstream/doc/sample_data  i included a sample document (scanned pdf, 300 dpi) and a ground truth file for that so one can check and compare results for different settings/tesseract version...
To evaluate the error rate i used this tool: https://github.com/impactcentre/ocrevalUAtion.

Dan Huby

unread,
Dec 1, 2015, 7:10:51 PM12/1/15
to ResourceSpace
Sorry, I stalled. I hit a glitch whereby the "Loading" display stuck around but suspect it was related to something else I was doing at the time. We're in a bug fix period at the moment until the 7.5 release in a couple of weeks. I'll get right on it after that. It will be a very nice addition to the software!

Dan
Reply all
Reply to author
Forward
0 new messages