Bug#1003089: man-db has become prohibitively slow

Steinar H. Gunderson

unread,

Jan 3, 2022, 3:00:04 PM1/3/22

to

Package: man-db
Version: 2.9.4-2
Severity: normal
Tags: upstream

Hi Colin,

I've noticed that during upgrades to bullseye, the man-db trigger now seems to
take a very long time; in fact, it frequently seems to be a significant time of
the total time of the full-upgrade (depending, of course, on network speeds).

After digging a bit, I think I've figured out why I haven't seen this before;
it's become a _lot_ slower recently, so it wasn't on the radar before. It seems
that due to a combination of the architecture chosen for these operations
(a pipeline system, seemingly forking off subprocesses and running lots of
syscalls in the process) and a new sandbox model (based on seccomp, which needs
a lot of setup for each new subprocess and adds considerable overhead to each
syscall), most of the total time is now spent in pure setup overhead.

As an example, a complete mandb run (mandb -c) on my laptop takes 3 minutes and
39 seconds. (Of course, the man-db trigger is incremental, but on a full-upgrade,
many man pages are likely to be updated.) If I set MAN_DISABLE_SECCOMP=1,
it drops to 1:41. If I recompile without HAVE_LIBSECCOMP, it's down to 51 seconds!
I still honestly think this is slower than it should be for decompressing and
indexing 144 MB of data, but it means that the sandboxing has 329% overhead
on a modern kernel (5.15.0-2-amd64).

I don't honestly know what this database is for, but my guess is that it is
for the apropos command, which seems rare. (At least nothing obviously bad
happens to my man usage if I “rm -r /var/cache/man/*”, but apropos stops working.)
Is it possible to either speed up man-db so that it takes less time to build
its database, or otherwise perhaps split apropos out into a separate
not-installed-by-default package, so that normal installations do not need to
take this installation hit? Or maybe move man-db updating to cron?

-- System Information:
Debian Release: 11.2
APT prefers stable-security
APT policy: (500, 'stable-security'), (500, 'stable-debug'), (500, 'proposed-updates'), (500, 'oldoldstable'), (500, 'stable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 5.14.0 (SMP w/40 CPU threads)
Locale: LANG=en_DK.UTF-8, LC_CTYPE=en_DK.UTF-8 (charmap=UTF-8), LANGUAGE=en_NO:en_US:en_GB:en
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages man-db depends on:
ii bsdextrautils 2.36.1-8
ii bsdmainutils 12.1.7+nmu3
ii debconf [debconf-2.0] 1.5.77
ii dpkg 1.20.9
ii groff-base 1.22.4-6
ii libc6 2.31-13+deb11u2
ii libgdbm6 1.19-2
ii libpipeline1 1.5.3-1
ii libseccomp2 2.5.1-1+deb11u1
ii zlib1g 1:1.2.11.dfsg-2

man-db recommends no packages.

Versions of packages man-db suggests:
ii apparmor 2.13.6-10
ii chromium [www-browser] 90.0.4430.212-1
ii firefox-esr [www-browser] 91.4.1esr-1~deb11u1
pn groff <none>
ii less 551-2
ii lynx [www-browser] 2.9.0dev.6-3~deb11u1
ii w3m [www-browser] 0.5.3+git20210102-6

-- debconf-show failed

Colin Watson

unread,

Jan 3, 2022, 5:40:03 PM1/3/22

to

On Mon, Jan 03, 2022 at 08:34:46PM +0100, Steinar H. Gunderson wrote:
> I've noticed that during upgrades to bullseye, the man-db trigger now seems to
> take a very long time; in fact, it frequently seems to be a significant time of
> the total time of the full-upgrade (depending, of course, on network speeds).
>
> After digging a bit, I think I've figured out why I haven't seen this before;
> it's become a _lot_ slower recently, so it wasn't on the radar before. It seems
> that due to a combination of the architecture chosen for these operations
> (a pipeline system, seemingly forking off subprocesses and running lots of
> syscalls in the process)

This part is already covered in https://bugs.debian.org/696503. I admit
that this was last updated ten years ago; I really need to get back to
that and get it to work one way or another. :-/

> and a new sandbox model (based on seccomp, which needs a lot of setup
> for each new subprocess and adds considerable overhead to each
> syscall),

This is an interesting observation. I do think that seccomp is valuable
defence in general given the amount of rather ad-hoc parsing that's
going on, but I hadn't noticed that it made such a big difference to
mandb performance. (About 3ms per subprocess here purely for setup, I
think.)

The seccomp sandbox is mainly so that man-db can more safely operate on
manual pages shipped by packaging systems that expect to run code under
confinement and thus can ship relatively untrusted code (such as snapd).
If we limited the mandb invocation in the postinst to only
/usr/share/man, we could reasonably assume that all pages there were
installed by dpkg, and those don't need the same level of care since any
package installed by dpkg can run code directly as root anyway. It
would then be reasonable to run it under MAN_DISABLE_SECCOMP=1.

This limitation would make apropos a bit less useful for third-party
packages installed by dpkg that ship manual pages under something like
/opt/*/man, but those are relatively rare, and the existing cron job /
systemd timer will catch up later so it wouldn't be fatal to make such
cases a little less ergonomic.

> As an example, a complete mandb run (mandb -c) on my laptop takes 3 minutes and
> 39 seconds. (Of course, the man-db trigger is incremental, but on a full-upgrade,
> many man pages are likely to be updated.) If I set MAN_DISABLE_SECCOMP=1,
> it drops to 1:41. If I recompile without HAVE_LIBSECCOMP, it's down to 51 seconds!

The difference between MAN_DISABLE_SECCOMP=1 and building without
libseccomp looks like a relatively simple fix, at least:

https://gitlab.com/cjwatson/man-db/-/commit/50200d151dfedb9d5064ec7008c09f6cf0f5ee24

Results on my system for "mandb -c /usr/share/man", roughly confirming
your findings:

status quo: 5m08s
MAN_DISABLE_SECCOMP=1: 2m12s
MAN_DISABLE_SECCOMP=1 plus 50200d151d: 1m17s

> I don't honestly know what this database is for, but my guess is that it is
> for the apropos command, which seems rare. (At least nothing obviously bad
> happens to my man usage if I “rm -r /var/cache/man/*”, but apropos stops working.)
> Is it possible to either speed up man-db so that it takes less time to build
> its database, or otherwise perhaps split apropos out into a separate
> not-installed-by-default package, so that normal installations do not need to
> take this installation hit? Or maybe move man-db updating to cron?

I really don't want to make apropos less useful. I'll have another go
at tackling the core problem here; but the fixes described above should
at least get us back to merely the situation we've had for years, rather
than the (as you say) much worse behaviour recently.

Given #696503, would you mind if I repurposed this bug to be just about
fixing the seccomp-related performance regression? I think the plan
above will let me deal with just that part of it relatively quickly, and
the fixes would be small enough to be viable candidates for stable.

--
Colin Watson (he/him) [cjwa...@debian.org]

Steinar H. Gunderson

unread,

Jan 4, 2022, 5:00:03 AM1/4/22

to

On Mon, Jan 03, 2022 at 10:34:08PM +0000, Colin Watson wrote:
> This part is already covered in https://bugs.debian.org/696503. I admit
> that this was last updated ten years ago; I really need to get back to
> that and get it to work one way or another. :-/

Ah, well, that wasn't so easy to understand from man-db's bug list. :-)

> This is an interesting observation. I do think that seccomp is valuable
> defence in general given the amount of rather ad-hoc parsing that's
> going on, but I hadn't noticed that it made such a big difference to
> mandb performance. (About 3ms per subprocess here purely for setup, I
> think.)
>
> The seccomp sandbox is mainly so that man-db can more safely operate on
> manual pages shipped by packaging systems that expect to run code under
> confinement and thus can ship relatively untrusted code (such as snapd).
> If we limited the mandb invocation in the postinst to only
> /usr/share/man, we could reasonably assume that all pages there were
> installed by dpkg, and those don't need the same level of care since any
> package installed by dpkg can run code directly as root anyway. It
> would then be reasonable to run it under MAN_DISABLE_SECCOMP=1.

Let me see if I understand the threat model here, to see where you're coming
from. The idea is that the user installs a snap (whose installation and
running is nominally sandboxed), which contains a malicious man page
containing an arbitrary execution exploit for man-db's apropos parser
or similar? That sounds like something that is possible for the postinst to
ignore, yes.

OTOH, if you assume packages can carry malicious man files, is it also part
of the threat model that it could be overriding other package's documentation
with spam and/or really bad advice (like “run this vulnerable script as
root”)?

TBH man-db sounds fairly safe compared to many other things, though. I've
only skimmed the code, but as I understand it, a typical mandb run sends the
file through gzip and a flex parser? If you find a bug in zlib or in flex,
I would assume there are _much_ juicier targets out there than dropping a
malicious man page into a snap -- so we're really left with the surrounding
code that takes the output of the flex parser. Which isn't really all that
much code. I understand these are the famous last words, of course... :-)

I understand that man(1) itself pushes things through groff, which is a
different story -- but only for the case where you run man setuid.

Of course, rewriting these into a safe language would be the best solution,
but it's probably out-of-scope.

> The difference between MAN_DISABLE_SECCOMP=1 and building without
> libseccomp looks like a relatively simple fix, at least:
>
> https://gitlab.com/cjwatson/man-db/-/commit/50200d151dfedb9d5064ec7008c09f6cf0f5ee24

OK, that's a good thing.

> I really don't want to make apropos less useful. I'll have another go
> at tackling the core problem here; but the fixes described above should
> at least get us back to merely the situation we've had for years, rather
> than the (as you say) much worse behaviour recently.

Yes, it would be really nice to have speedups and not just no regressions. :-)
We keep installing more packages and man pages, so software probably needs to
be made for larger data sets eventually. It's a bit like dpkg -- a lot of the
design decisions made sense with 100 packages and 10k files, but when you get
to 10x or 100x that, it starts getting a real pain point even if you just
freeze it and don't make it any slower.

I took a look at mandb's profile in perf, and even after turning off
libseccomp, it appears that perhaps 10–11% of its time is spent doing real
work (decompression, lexer, malloc, character set conversion). The rest is
kernel overhead from launching and exiting pipelines, which is a fairly steep
price.

On the surface of it, it's not immediately obvious to me what the pipeline
abstraction is all about. Is there a reason why you can't just open the man
page, decompress it in one shot and then use fmemopen()? (Or similarly,
open_memstream().) If it's bigger than, say, 16 MB uncompressed, you can
probably just kick it from apropos.

> Given #696503, would you mind if I repurposed this bug to be just about
> fixing the seccomp-related performance regression? I think the plan
> above will let me deal with just that part of it relatively quickly, and
> the fixes would be small enough to be viable candidates for stable.

Sure, that would be fine. It would perhaps be useful to make 696503 a bit
clearer to the casual observer, though; i.e., link it to man-db somehow and
make the title more about performance.

/* Steinar */
--
Homepage: https://www.sesse.net/

Steinar H. Gunderson

unread,

Jan 4, 2022, 4:40:03 PM1/4/22

to

On Tue, Jan 04, 2022 at 10:28:51PM +0100, Steinar H. Gunderson wrote:
> I made a straw man to test whether this was really true, and it turns it is.
> See the attached patch,

Now actually attached.

horrible-hack.diff

Steinar H. Gunderson

unread,

Jan 4, 2022, 4:40:04 PM1/4/22

to

On Tue, Jan 04, 2022 at 10:21:58AM +0100, Steinar H. Gunderson wrote:
> I took a look at mandb's profile in perf, and even after turning off
> libseccomp, it appears that perhaps 10–11% of its time is spent doing real
> work (decompression, lexer, malloc, character set conversion). The rest is
> kernel overhead from launching and exiting pipelines, which is a fairly steep
> price.

I made a straw man to test whether this was really true, and it turns it is.

See the attached patch, which rips out the pipelines from mandb and replaces
them by simple one-shot decompression buffers; it's by no means something
that should be applied in its current state, but it reliably finds and parses
all 24216 man pages on my system, giving identical behavior (as determined by
“apropos ''”) to the current man-db. But it finishes in 4.7 seconds instead
of 51, so indeed, about 9% of the time, and at a performance point where it
is unlikely to be too painful in a full-upgrade.

Of course, this is an unfair comparison because it does not do encoding
conversion, but most pages do not need that, and in any case, adding it would
be unlikely to get up to more than 10–11%. It also does not do error handling,
arbitrary-size man pages (it uses fixed buffers instead of streaming; again
completely possible to fix, or one could set upper limits), anything
resembling good coding style, or shelling out to col(1). (I believe the
latter is only for parsing cat pages, though, which would seem less relevant
today.)

But it does set a bar that I think should be approachable by mandb, without
any special microoptimization, multithreading, fancy parsing algorithms
or the likes. :-)

Colin Watson

unread,

Jan 16, 2022, 11:20:03 PM1/16/22

to

Significant progress! See the end of this email.

On Tue, Jan 04, 2022 at 10:21:57AM +0100, Steinar H. Gunderson wrote:
> On Mon, Jan 03, 2022 at 10:34:08PM +0000, Colin Watson wrote:
> > This part is already covered in https://bugs.debian.org/696503. I admit
> > that this was last updated ten years ago; I really need to get back to
> > that and get it to work one way or another. :-/
>
> Ah, well, that wasn't so easy to understand from man-db's bug list. :-)

In fact also https://bugs.debian.org/630799.

> > The seccomp sandbox is mainly so that man-db can more safely operate on
> > manual pages shipped by packaging systems that expect to run code under
> > confinement and thus can ship relatively untrusted code (such as snapd).
> > If we limited the mandb invocation in the postinst to only
> > /usr/share/man, we could reasonably assume that all pages there were
> > installed by dpkg, and those don't need the same level of care since any
> > package installed by dpkg can run code directly as root anyway. It
> > would then be reasonable to run it under MAN_DISABLE_SECCOMP=1.
>
> Let me see if I understand the threat model here, to see where you're coming
> from. The idea is that the user installs a snap (whose installation and
> running is nominally sandboxed), which contains a malicious man page
> containing an arbitrary execution exploit for man-db's apropos parser
> or similar? That sounds like something that is possible for the postinst to
> ignore, yes.
>
> OTOH, if you assume packages can carry malicious man files, is it also part
> of the threat model that it could be overriding other package's documentation
> with spam and/or really bad advice (like “run this vulnerable script as
> root”)?

No, the threat model is just dealing with arbitrary-code-execution
attacks in the various tools man-db spawns that might be exploitable by
way of bad input. I agree that on the whole this *seems* low-priority,
but parsers are often viable targets.

(I agree that bad-advice type attacks are *possible*, but "man -a"
exists, and other things about at least the snap ecosystem will
generally make it hard to override names shipped by .deb packages, so
I'm not worrying about this rather thornier problem for now.)

> TBH man-db sounds fairly safe compared to many other things, though. I've
> only skimmed the code, but as I understand it, a typical mandb run sends the
> file through gzip and a flex parser? If you find a bug in zlib or in flex,
> I would assume there are _much_ juicier targets out there than dropping a
> malicious man page into a snap -- so we're really left with the surrounding
> code that takes the output of the flex parser. Which isn't really all that
> much code. I understand these are the famous last words, of course... :-)
>
> I understand that man(1) itself pushes things through groff, which is a
> different story -- but only for the case where you run man setuid.

It's still worth defending against arbitrary-code-execution attacks by a
malicious snap in groff even if it doesn't permit set-id escalation -
after all, snaps do promise a degree of sandboxing. (But as you say,
groff parsing isn't relevant to mandb(8).)

> > I really don't want to make apropos less useful. I'll have another go
> > at tackling the core problem here; but the fixes described above should
> > at least get us back to merely the situation we've had for years, rather
> > than the (as you say) much worse behaviour recently.
>
> Yes, it would be really nice to have speedups and not just no regressions. :-)
> We keep installing more packages and man pages, so software probably needs to
> be made for larger data sets eventually. It's a bit like dpkg -- a lot of the
> design decisions made sense with 100 packages and 10k files, but when you get
> to 10x or 100x that, it starts getting a real pain point even if you just
> freeze it and don't make it any slower.
>
> I took a look at mandb's profile in perf, and even after turning off
> libseccomp, it appears that perhaps 10–11% of its time is spent doing real
> work (decompression, lexer, malloc, character set conversion). The rest is
> kernel overhead from launching and exiting pipelines, which is a fairly steep
> price.
>
> On the surface of it, it's not immediately obvious to me what the pipeline
> abstraction is all about. Is there a reason why you can't just open the man
> page, decompress it in one shot and then use fmemopen()? (Or similarly,
> open_memstream().) If it's bigger than, say, 16 MB uncompressed, you can
> probably just kick it from apropos.

The real issue is flexibility. For instance:

* man-db intentionally supports various compression formats without
needing to link with all the possible decompression libraries out
there.
* Although mandb(8) currently scans the entire man page to determine
which filters are applicable, that might be something I can change
one day (it essentially involves making sure that pages that do weird
stuff with tbl/eqn/etc. all have an appropriate '\" line at the top -
that's been documented for long enough so it may in fact be true, but
it needs checking). A streaming approach is nice for that kind of
thing.
* The abstraction pays off much more in man(1), where the practical
effect of a few more subprocesses is fairly minimal (except for a few
pathological cases in tools such as Lintian), and the number of
external tools we need to glue together is often quite high. Being
able to reuse the same code in mandb(8) has kept the bug count down.

But I agree that the current performance cost in mandb(8) is a high
price to pay for that flexibility. Read to the end of this email before
you spend any time arguing with the above. :-)

> > Given #696503, would you mind if I repurposed this bug to be just about
> > fixing the seccomp-related performance regression? I think the plan
> > above will let me deal with just that part of it relatively quickly, and
> > the fixes would be small enough to be viable candidates for stable.
>
> Sure, that would be fine. It would perhaps be useful to make 696503 a bit
> clearer to the casual observer, though; i.e., link it to man-db somehow and
> make the title more about performance.

I think I'll leave that bug alone since #630799 exists and has a clear
enough title.

On Tue, Jan 04, 2022 at 10:28:50PM +0100, Steinar H. Gunderson wrote:

> On Tue, Jan 04, 2022 at 10:21:58AM +0100, Steinar H. Gunderson wrote:
> > I took a look at mandb's profile in perf, and even after turning off
> > libseccomp, it appears that perhaps 10–11% of its time is spent doing real
> > work (decompression, lexer, malloc, character set conversion). The rest is
> > kernel overhead from launching and exiting pipelines, which is a fairly steep
> > price.
>

> I made a straw man to test whether this was really true, and it turns it is.
> See the attached patch, which rips out the pipelines from mandb and replaces
> them by simple one-shot decompression buffers; it's by no means something
> that should be applied in its current state, but it reliably finds and parses
> all 24216 man pages on my system, giving identical behavior (as determined by
> “apropos ''”) to the current man-db. But it finishes in 4.7 seconds instead
> of 51, so indeed, about 9% of the time, and at a performance point where it
> is unlikely to be too painful in a full-upgrade.
>
> Of course, this is an unfair comparison because it does not do encoding
> conversion, but most pages do not need that, and in any case, adding it would
> be unlikely to get up to more than 10–11%. It also does not do error handling,
> arbitrary-size man pages (it uses fixed buffers instead of streaming; again
> completely possible to fix, or one could set upper limits), anything
> resembling good coding style, or shelling out to col(1). (I believe the
> latter is only for parsing cat pages, though, which would seem less relevant
> today.)
>
> But it does set a bar that I think should be approachable by mandb, without
> any special microoptimization, multithreading, fancy parsing algorithms
> or the likes. :-)

Thanks for your patch.

In the long term, I think the best approach with the neatest
abstractions may be to retrofit multithreading to libpipeline: threading
allows retaining most of the existing logic for pushing data through
subprocesses, and retains all the flexibility. Of course we'd need to
benchmark any such approach to ensure that it actually makes enough of a
difference; also, some of the critical bits of libpipeline need to be
made async-signal-safe first. I think I know roughly how to do this,
but I don't necessarily want to promise to get it done in any particular
amount of time.

So, I'm basically OK with installing an optimization until such time as
I get more blue-sky work done, since I don't want to get in the way
here. Many of the obvious limits of this kind of approach could be
dealt with by treating it as a fast path that we use if conditions
permit: large files, cat pages, etc. could fall back to the pipeline
approach.

We definitely do need to sort out encoding conversion, though. Although
UTF-8 has been recommended for many years, policy still allows "the
usual legacy encoding" and we've never got round to mandating UTF-8:

$ w3m -dump https://lintian.debian.org/tags/national-encoding | grep --count usr/share/man
502

Furthermore, I don't want to do heavily Debian-specific optimizations in
man-db, so even if we were prepared to fix all these cases in Debian as
a prerequisite, it wouldn't really help much. As a result I do think we
need to figure out a way to get manconv to work with this optimized
approach without breaking existing code or (ideally) duplicating the
whole rather fiddly thing.

I've been experimenting with a few things, and I think the most elegant
way to do this is in fact to add another layer of abstraction! This is
a much cheaper one, though. If instead of returning a pipeline we have
decompress_open return a new tagged union type, we can clone the simple
pipeline_{read,peek}* functions on top of that, and convert everything
over to use those rather than talking to libpipeline directly; the
functions that specifically need pipeline support (mainly in man(1), but
also things like the cat page case in lexgrog.l) can ask to drop down to
a lower level.

It's then fairly straightforward to implement the same interface using
an alternative in-memory decompression method, and this is transparent
to calling code that doesn't explicitly ask for the underlying pipeline.
The only minor difficulty is arranging to do in-memory encoding
conversion, but that's just slightly ugly rather than fundamentally
hard.

This all turned out to be remarkably easy. As soon as I fixed all the
compiler errors, "mandb -c" worked first time and full accessdb dumps
showed no differences from the 2.9.4 baseline. I'd made a small mistake
that I had to fix to stop "man -K" crashing, but so far that seems to
have been it.

Would you care to have a look at this?

https://gitlab.com/cjwatson/man-db/-/merge_requests/2

There's probably still room for improvement, but unlikely to be much
more than a factor of two or so at this point, and I think this should
get us comfortably back to the point where it's no longer annoying
people during upgrades.

Steinar H. Gunderson

unread,

Jan 17, 2022, 3:20:04 PM1/17/22

to

On Mon, Jan 17, 2022 at 04:10:02AM +0000, Colin Watson wrote:
> Significant progress! See the end of this email.

Thanks for dealing with this. I'm deleting most of your text, but be sure
that I've read it :-) And I understand most of it (although I of course
disagree at some points, I think those are beside the discussion).

> We definitely do need to sort out encoding conversion, though. Although
> UTF-8 has been recommended for many years, policy still allows "the
> usual legacy encoding" and we've never got round to mandating UTF-8:
>
> $ w3m -dump https://lintian.debian.org/tags/national-encoding | grep --count usr/share/man
> 502

I tried looking at this, but TBH I don't think the man-db path _ever_
inserts a conversion. The parameter to the lexer path is simply always NULL
in all calls, except for from a test. Am I missing something?

> I've been experimenting with a few things, and I think the most elegant
> way to do this is in fact to add another layer of abstraction! This is
> a much cheaper one, though. If instead of returning a pipeline we have
> decompress_open return a new tagged union type, we can clone the simple
> pipeline_{read,peek}* functions on top of that, and convert everything
> over to use those rather than talking to libpipeline directly; the
> functions that specifically need pipeline support (mainly in man(1), but
> also things like the cat page case in lexgrog.l) can ask to drop down to
> a lower level.

This sounds fine to me. It's not something I would pick if I maintained the
code myself, but I'm obviously not doing that, and I don't see anything in it
that would preclude optimization. :-)

> Would you care to have a look at this?
>
> https://gitlab.com/cjwatson/man-db/-/merge_requests/2

Thanks! I'll have a go at a review, but it might need to wait until the end
of the week. I'll try to get it done earlier, though. How would you like any
review comments? Email or somehow in Gitlab?

> There's probably still room for improvement, but unlikely to be much
> more than a factor of two or so at this point, and I think this should
> get us comfortably back to the point where it's no longer annoying
> people during upgrades.

Someone else suggested an idea I thought of throwing around: In addition
to optimizing the code, perhaps the postinst trigger should simply launch the
man-db timer in the background? At least for systemd users, this should be
pretty straightforward, and give the control back to the installation
process. (Given that dpkg is very much single-threaded, we're not generally
throughput-bound, so using up a core shouldn't be a big problem.)

Colin Watson

unread,

Jan 17, 2022, 6:00:07 PM1/17/22

to

On Mon, Jan 17, 2022 at 09:15:14PM +0100, Steinar H. Gunderson wrote:
> On Mon, Jan 17, 2022 at 04:10:02AM +0000, Colin Watson wrote:
> > We definitely do need to sort out encoding conversion, though. Although
> > UTF-8 has been recommended for many years, policy still allows "the
> > usual legacy encoding" and we've never got round to mandating UTF-8:
> >
> > $ w3m -dump https://lintian.debian.org/tags/national-encoding | grep --count usr/share/man
> > 502
>
> I tried looking at this, but TBH I don't think the man-db path _ever_
> inserts a conversion. The parameter to the lexer path is simply always NULL
> in all calls, except for from a test. Am I missing something?

test_manfile (which despite the name is not a test function) calls
find_name with file!="-" and encoding=NULL; that causes find_name to
call get_page_encoding, which always returns something non-NULL
("ISO-8859-1" for English pages), and then call add_manconv from that to
UTF-8.

> > Would you care to have a look at this?
> >
> > https://gitlab.com/cjwatson/man-db/-/merge_requests/2
>
> Thanks! I'll have a go at a review, but it might need to wait until the end
> of the week. I'll try to get it done earlier, though. How would you like any
> review comments? Email or somehow in Gitlab?

Mild preference for GitLab MR comments, but I'm not that fussy.

> > There's probably still room for improvement, but unlikely to be much
> > more than a factor of two or so at this point, and I think this should
> > get us comfortably back to the point where it's no longer annoying
> > people during upgrades.
>
> Someone else suggested an idea I thought of throwing around: In addition
> to optimizing the code, perhaps the postinst trigger should simply launch the
> man-db timer in the background? At least for systemd users, this should be
> pretty straightforward, and give the control back to the installation
> process. (Given that dpkg is very much single-threaded, we're not generally
> throughput-bound, so using up a core shouldn't be a big problem.)

I suppose that's an option; apt-xapian-index.postinst seems to be
precedent for doing things in the background, albeit in a non-systemd
way. Unlike the postinst, man-db.service omits -p/--no-purge, but maybe
that would be OK, especially after this round of optimization. I'll
consider that once I get to packaging these changes.

Steinar H. Gunderson

unread,

Jan 18, 2022, 6:10:04 PM1/18/22

to

On Mon, Jan 17, 2022 at 10:46:29PM +0000, Colin Watson wrote:
> Mild preference for GitLab MR comments, but I'm not that fussy.

I dropped some comments (I've never used the interface before, but it seemed
to work OK). What I haven't done is to download the patch and test
performance myself; I'll try to get that done soon.

I haven't gone deeply into the buffer management code, and of course I
haven't tried to figure out any style nits.

Steinar H. Gunderson

unread,

Jan 20, 2022, 2:30:04 PM1/20/22

to

On Mon, Jan 17, 2022 at 10:46:29PM +0000, Colin Watson wrote:

> test_manfile (which despite the name is not a test function) calls
> find_name with file!="-" and encoding=NULL; that causes find_name to
> call get_page_encoding, which always returns something non-NULL
> ("ISO-8859-1" for English pages), and then call add_manconv from that to
> UTF-8.

I think there's a potential bug here; from attaching gdb and breaking on
iconv_open, it seems there's a lot of encoding from UTF-8 to UTF-8, which
should be no-ops (except that it might do some additional well-formedness
checking). Is that intentional?

Apart from that, I've given your patch a quick run, and it seems to cut out
nearly all of the unneeded overhead. So what we're potentially left with
without doing strange things like multithreading or simplifying the lexer,
is:

- Get rid of the unneeded conversions (~15–20% overhead, it seems).
- Launch in the background, potentially.

Does that make sense? In any case, what we have here is already a huge
improvement.

Colin Watson

unread,

Jan 21, 2022, 5:00:03 PM1/21/22

to

On Thu, Jan 20, 2022 at 08:26:40PM +0100, Steinar H. Gunderson wrote:
> On Mon, Jan 17, 2022 at 10:46:29PM +0000, Colin Watson wrote:
> > test_manfile (which despite the name is not a test function) calls
> > find_name with file!="-" and encoding=NULL; that causes find_name to
> > call get_page_encoding, which always returns something non-NULL
> > ("ISO-8859-1" for English pages), and then call add_manconv from that to
> > UTF-8.
>
> I think there's a potential bug here; from attaching gdb and breaking on
> iconv_open, it seems there's a lot of encoding from UTF-8 to UTF-8, which
> should be no-ops (except that it might do some additional well-formedness
> checking). Is that intentional?

One annoying thing about man pages is that they didn't originally have
an especially well-defined encoding. I've put work in to try to impose
one, but the best I was ever able to do was to say that pages for a
given language must be either in UTF-8 or whatever the typical legacy
encoding for that language is. manconv implements this by doing trial
decoding from UTF-8 to the target encoding (which is always UTF-8 in
mandb(8), and typically but not always UTF-8 in man(1)); if that
succeeds then it's done, but if it fails then it has to go back and try
again with a different source encoding. This is not necessarily
reliable in general, but it's unreliable in the correct direction (any
errors can be fixed by converting the page to UTF-8), and in practice
detection errors seem rare on the sorts of text that show up in man
pages.

So the current behaviour isn't a bug as such, but there's definitely
room for optimization here: when operating in-process, and in the common
case where the target encoding is UTF-8, the UTF-8 to UTF-8 trial
decoding path could be changed to just do a read-only "is this UTF-8"
test rather than effectively copying everything to a new buffer via
iconv. I don't know how much faster that would be, though it seems
likely to be an improvement. It would be some more code, so I didn't
implement it in the first pass at this, but it would be worth a second
pass to see how much it helps.

I'll see if I can make time for this, though I think a reasonable
priority for me is to finish working on your existing MR comments first
and get this ready to land.

> Apart from that, I've given your patch a quick run, and it seems to cut out
> nearly all of the unneeded overhead. So what we're potentially left with
> without doing strange things like multithreading or simplifying the lexer,
> is:
>
> - Get rid of the unneeded conversions (~15–20% overhead, it seems).
> - Launch in the background, potentially.
>
> Does that make sense? In any case, what we have here is already a huge
> improvement.

I think so, and I should be able to get the first pass done this
weekend. Thanks for helping with this!

Steinar H. Gunderson

unread,

Jan 21, 2022, 5:50:50 PM1/21/22

to

On Fri, Jan 21, 2022 at 09:48:06PM +0000, Colin Watson wrote:
> So the current behaviour isn't a bug as such, but there's definitely
> room for optimization here: when operating in-process, and in the common
> case where the target encoding is UTF-8, the UTF-8 to UTF-8 trial
> decoding path could be changed to just do a read-only "is this UTF-8"
> test rather than effectively copying everything to a new buffer via
> iconv. I don't know how much faster that would be, though it seems
> likely to be an improvement.

Technically, UTF-8 validation can be done at a few gigabytes per second
per core:

https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/

but that is probably overkill. :-)

> I'll see if I can make time for this, though I think a reasonable
> priority for me is to finish working on your existing MR comments first
> and get this ready to land.

Sure, I agree this is a good prioritization. I saw a new patch set landed,
but I'm not sure if you wanted me to look at it again yet? (Fundamentally,
though, almost everything I have is style nits; if the patch went into man-db
as-is, I would still be happy about it.)

Colin Watson

unread,

Jan 21, 2022, 7:50:05 PM1/21/22

to

On Fri, Jan 21, 2022 at 11:38:56PM +0100, Steinar H. Gunderson wrote:
> On Fri, Jan 21, 2022 at 09:48:06PM +0000, Colin Watson wrote:
> > So the current behaviour isn't a bug as such, but there's definitely
> > room for optimization here: when operating in-process, and in the common
> > case where the target encoding is UTF-8, the UTF-8 to UTF-8 trial
> > decoding path could be changed to just do a read-only "is this UTF-8"
> > test rather than effectively copying everything to a new buffer via
> > iconv. I don't know how much faster that would be, though it seems
> > likely to be an improvement.
>
> Technically, UTF-8 validation can be done at a few gigabytes per second
> per core:
>
> https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/
>
> but that is probably overkill. :-)

Quite :-)

> > I'll see if I can make time for this, though I think a reasonable
> > priority for me is to finish working on your existing MR comments first
> > and get this ready to land.
>
> Sure, I agree this is a good prioritization. I saw a new patch set landed,
> but I'm not sure if you wanted me to look at it again yet? (Fundamentally,
> though, almost everything I have is style nits; if the patch went into man-db
> as-is, I would still be happy about it.)

Not yet - that was just trivial rebasing after I found and fixed a few
unrelated things I'd broken on main and wanted to get them into this
tree to simplify my own testing. I have a larger pile of rearrangements
in progress, but I'll post replies on the MR when they're ready.

Steinar H. Gunderson

unread,

Jan 23, 2022, 12:20:06 PM1/23/22

to

On Sat, Jan 22, 2022 at 12:41:56AM +0000, Colin Watson wrote:
>> Technically, UTF-8 validation can be done at a few gigabytes per second
>> per core:
>>
>> https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/
>>
>> but that is probably overkill. :-)
> Quite :-)

It struck me that it can probably be folded for free into the lexer.
If you add symbols for all invalid UTF-8 sequences, I believe it should just
go into the state machine. But I'm fine with those 20%; the perfect need not
be the enemy of the good here.

In general, I don't think I need to look at it again now, unless there are
any special questions. Thanks for taking care of this! Looking forward to
bookworm being faster (and of course sid before that), and then I'll happily
live with this on bullseye, knowing that it's transient.

Colin Watson

unread,

Jan 23, 2022, 10:10:04 PM1/23/22

to

Control: tag -1 fixed-upstream

On Sun, Jan 23, 2022 at 06:11:19PM +0100, Steinar H. Gunderson wrote:
> On Sat, Jan 22, 2022 at 12:41:56AM +0000, Colin Watson wrote:
> >> Technically, UTF-8 validation can be done at a few gigabytes per second
> >> per core:
> >>
> >> https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/
> >>
> >> but that is probably overkill. :-)
> > Quite :-)
>
> It struck me that it can probably be folded for free into the lexer.
> If you add symbols for all invalid UTF-8 sequences, I believe it should just
> go into the state machine. But I'm fine with those 20%; the perfect need not
> be the enemy of the good here.

Mm. I'd somewhat prefer not to put it in the lexer though, because in
general the next stage after encoding conversion can be something other
than the lexer, and I don't want to store up too much confusion for my
future self.

I grabbed glib's UTF-8 validator on the basis that it was a simple,
portable, and compatibly-licensed one that I could verify by eye and
dropped it in, replacing the trial conversion pass if source_encoding !=
UTF-8 and target_encoding == UTF-8. This saves about 8% on my test
system on top of the previous optimizations (10.589s → 9.791s, median of
nine runs), so it might be possible to do better with a faster
validator, but this seems likely to be good enough and we're probably
approaching diminishing returns.

> In general, I don't think I need to look at it again now, unless there are
> any special questions. Thanks for taking care of this! Looking forward to
> bookworm being faster (and of course sid before that), and then I'll happily
> live with this on bullseye, knowing that it's transient.

Thanks a lot for the initial prod and the review comments - they've
definitely improved things. I've gone ahead and merged all this. I'll
need to do a call for translations before releasing since there are some
other changes that will necessitate that, but I expect to produce a new
upstream release in a couple of weeks.