Improving scan quality

169 views
Skip to first unread message

Toolforger

unread,
Feb 18, 2018, 2:48:10 AM2/18/18
to hugin and other free panoramic software
Hi all,

I am having trouble finding the right commands to use for my use case, and I found the man pages of the project pages so unspecific that I have to ask.

Use case: I want to scan book pages (lots of them), and eliminate speckles and noise by scanning each page multiple times and "applying the right tools".
Question: What are the right commands to do that?

More details:
It's something like 100,000 pages, so I need something that I can run in batch mode. I am pretty adept at shell scripting and similar, so a combination of commands will work for me, too.

Gunter Königsmann

unread,
Feb 18, 2018, 3:21:31 AM2/18/18
to hugi...@googlegroups.com
If you are using Linux you might either use something like xsane to scan all pages as separate images and then scantaylor to automatically postprocess them (and to optionally convert them to PDF) ... Or you can use gscan2pdf. But scantaylor is more powerful.

--
A list of frequently asked questions is available at: http://wiki.panotools.org/Hugin_FAQ
---
You received this message because you are subscribed to the Google Groups "hugin and other free panoramic software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hugin-ptx+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hugin-ptx/a3d21dc0-e1c9-4b75-ad61-9cdf1f0e1fbe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jean-Luc Coulon (f5ibh)

unread,
Feb 18, 2018, 5:44:52 AM2/18/18
to hugi...@googlegroups.com
Hi,

BTW, it seems to be scantailor with a "i" and not scantaylor (with a y). At least, it is the case in Debian.

Regards

Jean-Luc

Toolforger

unread,
Feb 18, 2018, 6:29:11 AM2/18/18
to hugin and other free panoramic software
Scantailor assumes a single image of each page.
I need a way to have multiple images for each page to average out speckles and noise.

Toolforger

unread,
Feb 18, 2018, 6:41:40 AM2/18/18
to hugin and other free panoramic software
My actual problem is that the hugin suite alone offers at least a dozen or so command (align_image_stack, cpclean, deghosting_mask, geocpset, hugin_hdrm, pto_merge, verdandi, and whatnot), plus there's panotools (with a big overlap, because hugin uses panotools).
And I don't know which of these are relevant. Of those relevant, I don't know which of them are redundant and which complement each other. (E.g. align_image_stack would be pretty near to what I need, but it assumes all images are the same size, which I don't know how to do in an automated manner and which is also an indicator that align_image_stack isn't actually what I want.)

Gunter Königsmann

unread,
Feb 18, 2018, 1:03:14 PM2/18/18
to hugi...@googlegroups.com, Toolforger
git://g...@github.com:mpetroff/stitch-scanned-images.git combines many
small scanned images to a large one. But it does blend from one image to
the other, not average over them.

I have no experience if that would be possible, as well...

Toolforger

unread,
Feb 18, 2018, 5:32:59 PM2/18/18
to hugin and other free panoramic software
Heh. Blending is a form of averaging, so Matthew's approach should probably work.
What I'd like to know is whether his approach is the best one for my use case. There might be superfluous steps, or steps that would work better with a different tools, or different parameters. Not know the full tool suite means I have too many options to explore, and I'd like to do something that's at least mildly competent. Within, say, the next four weeks or so - I could spend the next six months trying out tool combinations, and would probably get results, but I hate being clueless about the overall sanity of an approach.
Besides, it's fully possible that there's a readymade, competently optimized tool out there that combines cpfind, autooptimizer, etc. for me.

BTW Matthews work again showed me that I don't even know all the relevant tools. Nona sounds highly interesting, but I wasn't aware that it even exists.

Gunter Königsmann

unread,
Feb 19, 2018, 12:32:09 AM2/19/18
to hugi...@googlegroups.com
The problem is that hugin defaults to only blend between images where necessary and to try to blend in places this doesn't tend to produce visual artefacts, for example at sharp edges. There are many options, though. Perhaps trying to make an HDR image is what you wanna do, kind of...

--
A list of frequently asked questions is available at: http://wiki.panotools.org/Hugin_FAQ
---
You received this message because you are subscribed to the Google Groups "hugin and other free panoramic software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hugin-ptx+unsubscribe@googlegroups.com.

bugbear

unread,
Feb 19, 2018, 4:08:11 AM2/19/18
to hugi...@googlegroups.com
Toolforger wrote:
> Scantailor assumes a single image of each page.
> I need a way to have multiple images for each page to average out speckles and noise.

If there are speckles on the page, averaging won't get rid of them.

If there is noise in the scan, using a longer exposure will most likely
eliminate it at source.

BugBear

Toolforger

unread,
Feb 19, 2018, 4:51:41 AM2/19/18
to hugin and other free panoramic software
The number of noisy&speckled pixels won't get smaller by averaging, but the amplitude will average out a bit, and I suspect that Huffman encoding will work better with that. Admittedly, I haven't tested how much of an effect I'll get with that; OTOH I see people recommend that approach.
The other option would be taking the median. Whatever works best *shrug*.

Longer exposure isn't something I can control with a scanner...

... anyway: I'd really prefer if I got some guidance about the existing tools, instead of attempts at talking me out of the approach I'm currently experimenting with.
Most tools come with a single page of man text that tells me fairly litte, and nothing about what they actually do; did I just miss the pages with details about what they do and what each option does, or are image processing tools generally intended to be used in an experiment-and-see fashion? Because that multitude of tools and options is overwhelming, and I don't want to spend months trying to find out what the strengths and limits of each tool and each option are...

T. Modes

unread,
Feb 19, 2018, 1:41:29 PM2/19/18
to hugin and other free panoramic software


Am Montag, 19. Februar 2018 10:51:41 UTC+1 schrieb Toolforger:
... anyway: I'd really prefer if I got some guidance about the existing tools, instead of attempts at talking me out of the approach I'm currently experimenting with.
You need to give more detail. You mention only average, average, average…
But what is the input? A bigger page scanned in several run and you want to stitch them?
Or scanned you the same page again and again? Are the images aligned in this case or do you want to align them?
And what do you want as output? An average image of all overlapped images?
Otherwise nobody can help you beside yourself.

Toolforger

unread,
Feb 19, 2018, 3:38:26 PM2/19/18
to hugin and other free panoramic software
Correct, I scan each page multiple times.
The idea is to align the scans, combine them in a way that makes use of the added redundancy to reduce noise and speckles.
I am pretty open about the combining algorithm. Average, median, something else, actually I do not care *that* much as long as the results are better. My current (pretty incomplete) knowledge indicates that median might give more accurate results, but I'm willing to experiment here.

The background is that I'm scanning my books, for going paperless. Well, paper-frugal, some books will stay :-)
The scanning will be destructive. I want/need to shed the weight and volume of all that paper.
I cannot go crazy with storage, the NAS size is somewhat limited. 300 dpi TIFF, compressed with the right PNG settings, will fit. 600 dpi will fit only if I do lossy compression, and I suspect it's not going to have any more information than the 300dpi image so I didn't pursue that option further.

I also want to keep enough information that if the future comes with improved OCR software, I can take the scans and redo the OCR.
Since lossy compression might throw away exactly those bits of information that an improved OCR would exploit, I am somewhat inclined towards lossless compressors. Good thing that if I scan with 300 dpi and run PNG over the scan, I'll be able to fit everything on the NAS.

(Sorry for being vague before - if I start with the full specs, nobody reads that wall of text and I get zero answers... no idea how to do that better.)

T. Modes

unread,
Feb 20, 2018, 11:35:46 AM2/20/18
to hugin and other free panoramic software


Am Montag, 19. Februar 2018 21:38:26 UTC+1 schrieb Toolforger:
Correct, I scan each page multiple times.
But do you it without moving the page? - then the pages should be aligned. Or do you move the page between consecutive scans? Then you probably need align_image_stack.
 
The idea is to align the scans, combine them in a way that makes use of the added redundancy to reduce noise and speckles.
When the images are aligned, you can use hugin_stacker to get the average image.
 

The background is that I'm scanning my books, for going paperless. Well, paper-frugal, some books will stay :-)
The scanning will be destructive. I want/need to shed the weight and volume of all that paper.
Now you wander from the subject. Keep on the track. 
I cannot go crazy with storage, the NAS size is somewhat limited. 300 dpi TIFF, compressed with the right PNG settings, will fit.
A TIFF file with PNG compression? How should this work?
 

Toolforger

unread,
Feb 20, 2018, 11:57:08 AM2/20/18
to hugin and other free panoramic software
 
Am Montag, 19. Februar 2018 21:38:26 UTC+1 schrieb Toolforger:
Correct, I scan each page multiple times.
But do you it without moving the page? - then the pages should be aligned. Or do you move the page between consecutive scans? Then you probably need align_image_stack.

I need *something* to align the images, yes. The scanner isn't replicating the exact position.
I tried align_image_stack, but it would refuse to work with images of slightly different sizes. Which I don't really understand, because it's identifying and moving control points, which includes moving some pixels beyond the image boundary. So I'm wondering what it's doing - clipping them?
It's not a big deal because the page margins are white space anyway, but I'd like to understand what it's actually doing. Is it considering the image border to be all control points?

The idea is to align the scans, combine them in a way that makes use of the added redundancy to reduce noise and speckles.
When the images are aligned, you can use hugin_stacker to get the average image.

Good pointer, thanks.
 
The background is that I'm scanning my books, for going paperless. Well, paper-frugal, some books will stay :-)
The scanning will be destructive. I want/need to shed the weight and volume of all that paper.
Now you wander from the subject. Keep on the track. 

Gimme a break!
First you complain it's too vague, now I'm providing background and I'm too off-topic for your taste...
 
I cannot go crazy with storage, the NAS size is somewhat limited. 300 dpi TIFF, compressed with the right PNG settings, will fit.
A TIFF file with PNG compression? How should this work?

Ah, the joys of too much editing.
I tested with PNG and found that 300 dpi with the right settings are small enough. TIFF with one of its compression modes may work, too, but I know how to convert TIFF to PNG so it's pretty much a solved problem - but there's that constraint that I probably cannot go above 300 dpi, storage-wise, so this constrains the options a bit.

T. Modes

unread,
Feb 20, 2018, 12:32:52 PM2/20/18
to hugin and other free panoramic software


Am Dienstag, 20. Februar 2018 17:57:08 UTC+1 schrieb Toolforger:
I need *something* to align the images, yes. The scanner isn't replicating the exact position.
That where the missing information. 
I tried align_image_stack, but it would refuse to work with images of slightly different sizes. Which I don't really understand, because it's identifying and moving control points, which includes moving some pixels beyond the image boundary. So I'm wondering what it's doing - clipping them?
Sorry, but control points are not moved. Run align_image_stack with -p parameter and open pto file in Hugin to see the control points.
 
It's not a big deal because the page margins are white space anyway, but I'd like to understand what it's actually doing. Is it considering the image border to be all control points?
A point is a point and not a border.

A TIFF file with PNG compression? How should this work?

Ah, the joys of too much editing.
I tested with PNG and found that 300 dpi with the right settings are small enough. TIFF with one of its compression modes may work, too,
Sorry, but PNG and 300 dpi resolution have nothing in common. A JPEG and a TIFF or a BMP can also have 300 dpi resolution. It seems you mix a lot up.
 

Toolforger

unread,
Feb 20, 2018, 2:57:43 PM2/20/18
to hugin and other free panoramic software

I need *something* to align the images, yes. The scanner isn't replicating the exact position.
That where the missing information. 
I tried align_image_stack, but it would refuse to work with images of slightly different sizes. Which I don't really understand, because it's identifying and moving control points, which includes moving some pixels beyond the image boundary. So I'm wondering what it's doing - clipping them?
Sorry, but control points are not moved. Run align_image_stack with -p parameter and open pto file in Hugin to see the control points.

What does align_image_stack do, then?
 
It's not a big deal because the page margins are white space anyway, but I'd like to understand what it's actually doing. Is it considering the image border to be all control points?
A point is a point and not a border

I said a border might be a set of points.
Not that a point is a border.
 
A TIFF file with PNG compression? How should this work?

Ah, the joys of too much editing.
I tested with PNG and found that 300 dpi with the right settings are small enough. TIFF with one of its compression modes may work, too,
Sorry, but PNG and 300 dpi resolution have nothing in common. A JPEG and a TIFF or a BMP can also have 300 dpi resolution.
 
Sorry, but I was talking about the dpi of the scanner.
Plus I wasn't even thinking that dpi is an exclusive domain of PNG, that would be outright silly.

It seems you mix a lot up.
 
It seems you assume a lot about my level of knowledge.
If something is unclear to you, please ask, don't just assume. Assumptions tend to be off the mark, and having to deflect such assumptions tends to get into an antagonistic mood. A cooperative mood is much better at clearing up misunderstandings, and much more fun for both sides.

Rogier Wolff

unread,
Feb 21, 2018, 9:45:59 AM2/21/18
to hugi...@googlegroups.com
Well, I do understand that after a bunch of non-information and
wrong-writing, you start reading precisely what is written and nitpick
on everything.

But in this case, it's is pretty clear what is meant: When scanning
the book at 300DPI, the resulting PNG is acceptable in (on-disk) size.

PNG claims "lossless" compression. The question is: Is that relevant?
If, say you scan at 600 DPI, and use a high-enough-quality JPG
compression, I would expect that you can get better quality at
less-bits-on-disk.....

Roger.

--
+-- Rogier Wolff -- www.harddisk-recovery.nl -- 0800 220 20 20 --
- Datarecovery Services Nederland B.V. Delft. KVK: 30160549 -
| Files foetsie, bestanden kwijt, alle data weg?!
| Blijf kalm en neem contact op met Harddisk-recovery.nl!

Gunter Königsmann

unread,
Feb 21, 2018, 10:39:25 AM2/21/18
to hugi...@googlegroups.com
Normally if you set the white level low enough that the paper is "white" and the black level high enough that the letters are completely black and if you set the scan to 1 bit color depth and tell scantailor to suppress all speckles that are less than 4 pixels wide png compression should result in smaller files than a JPEG compression that is lossy enough to produce ringing at every change from black to white and vice versa.

Jpeg isn't very well-suited for black-and-white, anyway: it first splits the file into colour and brightness hoping that only one will contain much data, then optionally leave out 50 or 75% of the color information samples: in the end only 1 of 4. Camera pixels sees red or blue. Then it will do a fourier transform and leave out the high frequencies: they are only needed for sharp edges and 99% of a photo aren't sharp. And then it uses huffman-coding that makes small numbers shorter than long ones. Most of these steps aren't perfect for text-only pages.

Kind regards,

  Gunter.

--
A list of frequently asked questions is available at: http://wiki.panotools.org/Hugin_FAQ
---
You received this message because you are subscribed to the Google Groups "hugin and other free panoramic software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hugin-ptx+unsubscribe@googlegroups.com.

John Muccigrosso

unread,
Feb 21, 2018, 9:53:18 PM2/21/18
to hugin and other free panoramic software
Of course it depends on what kinds of texts you're looking at, but my experience with text-only academic articles/books and 240-300 dpi scans is that with a little clean-up, OCR is very good already. I'd recommend playing with your scanner settings up front to minimize background noise and make sure your text is mostly black or nearly black pixels. Save to tiff or png, if you can. You're right that jpeg will hurt.

(I even have some scripts that use ImageMagick to clean the scans, and I'm not the only one. :-) : https://github.com/Jmuccigr/scripts )

David W. Jones

unread,
Feb 21, 2018, 11:05:11 PM2/21/18
to hugi...@googlegroups.com
One situation I encountered: I got terrible OCR results when I scanned at 600dpi, great results on the same pages scanned at 300dpi. They were clean pages, though, so I didn't have to do any cleaning.


On February 21, 2018 4:53:17 PM HST, John Muccigrosso <jmuc...@gmail.com> wrote:
Of course it depends on what kinds of texts you're looking at, but my experience with text-only academic articles/books and 240-300 dpi scans is that with a little clean-up, OCR is very good already. I'd recommend playing with your scanner settings up front to minimize background noise and make sure your text is mostly black or nearly black pixels. Save to tiff or png, if you can. You're right that jpeg will hurt.

(I even have some scripts that use ImageMagick to clean the scans, and I'm not the only one. :-) : https://github.com/Jmuccigr/scripts )


David W. Jones
gnome...@gmail.com
wandering the landscape of god
http://dancingtreefrog.com

Sent from my Android device with F/LOSS K-9 Mail.

Gunter Königsmann

unread,
Feb 21, 2018, 11:39:11 PM2/21/18
to David W. Jones, hugi...@googlegroups.com


Am 22.02.2018 um 05:04 schrieb David W. Jones:
> One situation I encountered: I got terrible OCR results when I scanned
> at 600dpi, great results on the same pages scanned at 300dpi. They were
> clean pages, though, so I didn't have to do any cleaning.
>

Once I had the impression that some OCRs don't believe letters can be
this big [measured in pixels] if the scan is at a high enough resolution.

David W. Jones

unread,
Feb 22, 2018, 3:58:25 AM2/22/18
to hugi...@googlegroups.com
I think that's true. Or maybe too much detail in the letterform confuses
them.

--

Joachim Durchholz

unread,
Feb 23, 2018, 3:44:01 AM2/23/18
to hugi...@googlegroups.com
>> One situation I encountered: I got terrible OCR results when I scanned
>> at 600dpi, great results on the same pages scanned at 300dpi. They were
>> clean pages, though, so I didn't have to do any cleaning.
>
> Once I had the impression that some OCRs don't believe letters can be
> this big [measured in pixels] if the scan is at a high enough resolution.

Tesseract explicitly says that it is geared towards 300 dpi scans.
For other OCR programs, the effect seems similar.

I suspect they're matching raster patterns, not outlines.
Probably because they want to cover 1-bit scans because that's what you
get with fax, but I'm just guessing.

Joachim Durchholz

unread,
Feb 23, 2018, 3:44:01 AM2/23/18
to hugi...@googlegroups.com
> PNG claims "lossless" compression. The question is: Is that relevant?
> If, say you scan at 600 DPI, and use a high-enough-quality JPG
> compression, I would expect that you can get better quality at
> less-bits-on-disk.....

I did a quick smoke test to test that kind of hypothesis: how well does
tesseract fare with JPG-compressed images?
Turned out it had MUCH more errors, even though the human eye wouldn't
see any difference. I guess the artifacts are throwing off the
algorithms of tesseract.
I did a bit of experimentation about what levels of compression would
start to lose serious OCR quality, and found that I'd need a setting
that give me not much better compression than PNG, so I thought "screw
it, stick with the original bits, at least I don't lose info that way".

This was with JPGs from 300-dpi scans.
I haven't tried with 600 dpi because tesseract docs tell me that it's
geared towards 300 dpi scans, and ebook docs tell me that everything is
preconfigured for 300 dpi scans as well.
I might still try and check what I can get out of a 600-dpi scan OCR-wise.

Joachim Durchholz

unread,
Feb 23, 2018, 3:44:01 AM2/23/18
to hugi...@googlegroups.com
Am 21.02.2018 um 16:39 schrieb Gunter Königsmann:
> Normally if you set the white level low enough that the paper is "white"
> and the black level high enough that the letters are completely black
> and if you set the scan to 1 bit color depth and tell scantailor to
> suppress all speckles that are less than 4 pixels wide png compression
> should result in smaller files than a JPEG compression that is lossy
> enough to produce ringing at every change from black to white and vice
> versa.

I have considered such an approach.
Issue is that not everything is text in the books, there's the
occasional rare image. But even these few pages would force me to look
at each page manually whether it needs to be stored at full resolution,
or with different compression options. Since I have a six-digit number
of pages to look at, this would be a *lot* of work, for a sub-percent
amount of pages.

The other consideration is that I want to keep my OCR options. Maybe
some future OCR suite is more accurate than Tesseract, but exploits
exactly the kind of redundancy that JPG kills with its artifacts.
So I do have a preference for lossless compression. It's already giving
me a 50-60% compression ratio, and squeezing out more with JPG starts
generating visible artifacts, so it's fine that way.
I think :-)
Reply all
Reply to author
Forward
0 new messages