Warning. Invalid resolution 0 dpi. Using 70 instead.

116 views
Skip to first unread message

Naomi

unread,
Jan 18, 2019, 2:51:49 PM1/18/19
to tesseract-ocr
I understand this question is asked a lot, but I'm getting 

Warning. Invalid resolution 0 dpi. Using 70 instead.

when I set --psm to 0.

I used ImageMagick to convert the PDF to a tif, and as required, I did set the units and density:

convert -density 300 -units PixelsPerCentimeter InputPdf.pdf -depth 8 -strip -background white -alpha off file.tiff

From there, I try and run tesseract as follows:

tesseract file.tiff searchable-pdf -l eng --psm 0 pdf tesseract_parsley_config.txt 


This produces the error message. Running `magick identify` produces the following and confirms that the metadata is set.

 Format: TIFF (Tagged Image File Format)

  Mime type: image/tiff

  Class: DirectClass

  Geometry: 2550x3300+0+0

  Resolution: 300x300

  Print size: 8.5x11

  Units: PixelsPerCentimeter


How can I remove the error?

Zdenko Podobny

unread,
Jan 18, 2019, 2:54:54 PM1/18/19
to tesser...@googlegroups.com
please provide testing file + info anout tesseract version.

Zdenko


pi 18. 1. 2019 o 20:51 Naomi <naomi.d...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b0743d45-6fa6-4e3c-9b43-6943f7adc8a3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Naomi

unread,
Jan 18, 2019, 3:39:01 PM1/18/19
to tesseract-ocr

tesseract -v

tesseract 4.0.0

 leptonica-1.77.0

  libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0

 Found AVX2

 Found AVX

 Found SSE


Can't provide the file publicly unfortunately but I can look into any specific metadata needed.

Naomi

unread,
Jan 18, 2019, 4:05:33 PM1/18/19
to tesseract-ocr
Additionally, I'm getting somewhat poor output on the following scan. You can see that Tesseract is binarizing it but leaving a lot of black pixels. Is there a method to denoise those white pixels that are part of the background?

Before:

original.png

After Tesseract tries to pre-process:


preprocessedbytesseract.png

Naomi

unread,
Jan 18, 2019, 4:21:58 PM1/18/19
to tesseract-ocr
I'm realizing on the above image I posted that the issue isn't because of the stray pixels, but the white on black text. All of the document is black on white text except for this table header, which is black on white. Tesseract is picking up the black in that image as characters and turning it into gibberish. Does anyone know how I would pre-process the image to invert only the white on black text? 
Reply all
Reply to author
Forward
0 new messages