Weather rescue

44 views
Skip to first unread message

7tonin

unread,
Dec 17, 2017, 8:42:27 AM12/17/17
to tesseract-ocr

Hi,
in order to digitize a big bunch of weather data, mainly hand-written numbers tables, do you think it is worth it using an OCR ?

I'm a bit litterate in PHP programming but It's hard to figure out if training, for instance tessract-ocr v3, is a good idea.

To get samples of images go there :
https://www.zooniverse.org/projects/edh/weather-rescue


John Muccigrosso

unread,
Dec 17, 2017, 12:16:25 PM12/17/17
to tesseract-ocr
I suspect you might make some progress with this, especially if the areas in each image are consistent. Then you know where there will be only numbers, and so on. 

7tonin

unread,
Dec 17, 2017, 5:01:40 PM12/17/17
to tesseract-ocr
I manage to train a bit tesseract but the results are poor. Let me explain how I did, something must be wrong.

  • I used 3 images like the previous, but cropped to handwritten numbers.
  • I ran  QT Box Editor with default eng language to get the first boxes, which were edited to correct the guessed value for each char.
  • I then trained according to this recipe (http://www.tuxrincon.com/blog/training-tesseract-ocr/) and with its scripts train.sh and train2.sh. I customized to my language "num" (for numbers) and my font "hw1" (for hand-written).
  • I get a num.traineddata which is copied to /usr/share/tessdata/ (it's size is small 113ko comparing to eng.traineddata 21Mo)
  • When I run the command
  • tesseract test-image.jpeg test1 -l num
  • Nothing is recognized.

7tonin

unread,
Dec 17, 2017, 6:07:50 PM12/17/17
to tesseract-ocr
I got it. the mistake was coming from script errors on my system (linux mageia 6)
see traning script attached

the result is better, remains some difficulties like italic hand writing, but it's another subject.

train.sh
train2.sh
Reply all
Reply to author
Forward
0 new messages