OCRs produced by Tesseract differ wildly in size

87 views
Skip to first unread message

ArtmanDC

unread,
Mar 21, 2022, 3:24:34 PM3/21/22
to tesseract-ocr
I am working a project that involves turning text pages from scanned microfilm into searchable PDFs

My workflow is like this —

(1) Import raw scan images (*.tif) into Abbyy FineReader v. 12 Professional for some basic image editing including split, deskew, rough crop, and some visual cleanup e.g. microfilm dust. Export as multipage .tif. (Most documents are 2 or 3 pages; a small percentage are 7-8 pages.)
(2) Import edited images to Irfanview 4.58 for further editing, normally as follows
   (a) auto crop borders (ctrl-ctrl-Y)
   (b) change canvas size (shift-V) using Method 1 to set top and left margins and then Method 2 to padthe right and bottom margins to achieve standard starting corner and page size.
   (c) light editing to clean up any stray marks (copy/past white background color to mask marks).
   (d) repeat as necessary for subsequent pages. NOTE: As far as I can tell, changes in multipage tif files have to be saved individually in IrfanView or changes will be lost when moving to another page.
(3) Run edited tif file through Tesseract v5.0.1.20220118 using this format on the Windows 10 command line:   tesseract input.tif input pdf --psm 4

The resulting PDF files were as expected, except for the size relative to the input tif files.

The input files were both two pages and approximately the same size: 3,296 characters for 56143 and 3,194 for 56145.

56143.pdf   998k (2.7 times the size of the tif file)
56143.tif   369k
56145.pdf    94k (half the size of the tif file)
56145.tif   206k

I'm not terribly concerned about reducing the PDF file sizes, but I'm just baffled by why the PDF file size seems to have no relation to the input file size.

I don't know if this is really a Tesseract issue, but since that is the software that actually generated the PDF I thought this is a good place to start.

Thanks,
Art in Northern Virginia



Zdenko Podobny

unread,
Mar 22, 2022, 1:39:27 AM3/22/22
to tesser...@googlegroups.com
Can you provide an example tif file? 

Zdenko


po 21. 3. 2022 o 20:24 ArtmanDC <arts....@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fe64cc77-a08e-4362-9dba-545532037108n%40googlegroups.com.

Art Chimes

unread,
Mar 22, 2022, 5:23:30 PM3/22/22
to tesser...@googlegroups.com, Zdenko Podobny
I have uploaded the relevant files to the Internet Archive, where my
project is housed.

My previous post shortened the file names, as you will see.
In the shaded "DOWNLOAD OPTIONS" box, scroll down to "SHOW ALL" and
click to find the pdf and tif versions.

https://archive.org/details/issues-at-the-u.n.-general-assembly-voa-radio-script
https://archive.org/details/berlin-warnings-voa-radio-script

Thanks for any help you can provide,
Art in Northern Virginia (USA)

Zdenko Podobny

unread,
Mar 28, 2022, 1:18:12 PM3/28/22
to Art Chimes, tesser...@googlegroups.com
>tiffinfo 19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini.tif
TIFF Directory at offset 0x264004 (40744)
  Image Width: 2805 Image Length: 3630
  Resolution: 330, 330 pixels/inch
  Bits/Sample: 8
  Compression Scheme: LZW
  Photometric Interpretation: RGB color
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 3
  Rows/Strip: 907
  Planar Configuration: single image plane
  Page Number: 0-1
  Predictor: horizontal differencing 2 (0x2)
TIFF Directory at offset 0x378168 (5c538)
  Image Width: 2805 Image Length: 3630
  Resolution: 330, 330 pixels/inch
  Bits/Sample: 1
  Compression Scheme: LZW
  Photometric Interpretation: min-is-white
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 1
  Rows/Strip: 907
  Planar Configuration: single image plane
  Page Number: 0-2

>tiffinfo 19780919-backgrounder56145-berlin_warnings-bill_marsh.tif
TIFF Directory at offset 0x108282 (1a6fa)
  Image Width: 2805 Image Length: 3630
  Resolution: 330, 330 pixels/inch
  Bits/Sample: 1
  Compression Scheme: LZW
  Photometric Interpretation: min-is-white
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 1
  Rows/Strip: 907
  Planar Configuration: single image plane
  Page Number: 0-1
TIFF Directory at offset 0x211720 (33b08)
  Image Width: 2805 Image Length: 3630
  Resolution: 330, 330 pixels/inch
  Bits/Sample: 1
  Compression Scheme: LZW
  Photometric Interpretation: min-is-white
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 1
  Rows/Strip: 907
  Planar Configuration: single image plane
  Page Number: 0-2

As you see the problem is with the image format in file 19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini.tif. If you convert the first page to Bits/Sample: 1 (2 colors mode) you will get a similar output as with the second image:
>ls -l 1978*
-rw-r--r-- 1 user 197121  378410 Mar 28 18:57 19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini.tif
-rw-r--r-- 1 
user 197121 1021177 Mar 28 19:00 19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini.tif.pdf
-rw-r--r-- 1 
user 197121  218066 Mar 28 19:10 19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini_.tif
-rw-r--r-- 1 
user 197121   99990 Mar 28 19:11 19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini_.tif.pdf
-rw-r--r-- 1 
user 197121  211962 Mar 28 18:57 19780919-backgrounder56145-berlin_warnings-bill_marsh.tif
-rw-r--r-- 1 
user 197121   95886 Mar 28 19:00 19780919-backgrounder56145-berlin_warnings-bill_marsh.tif.pdf


> tiffinfo 19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini_.tif
TIFF Directory at offset 0x103678 (194fe)
  Image Width: 2805 Image Length: 3630
  Resolution: 330, 330 pixels/inch
  Bits/Sample: 1
  Compression Scheme: LZW
  Photometric Interpretation: min-is-white
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 1
  Rows/Strip: 907
  Planar Configuration: single image plane
  Page Number: 0-1
TIFF Directory at offset 0x217824 (352e0)
  Image Width: 2805 Image Length: 3630
  Resolution: 330, 330 pixels/inch
  Bits/Sample: 1
  Compression Scheme: LZW
  Photometric Interpretation: min-is-white
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 1
  Rows/Strip: 907
  Planar Configuration: single image plane
  Page Number: 0-2



Zdenko


ut 22. 3. 2022 o 22:23 Art Chimes <artso...@gmail.com> napísal(a):
Reply all
Reply to author
Forward
0 new messages