Buildworld Taking Very Long Time

0 views
Skip to first unread message

Tim Daneliuk

unread,
Jun 30, 2024, 11:12:13 AM (5 days ago) Jun 30
to FreeBSD Mailing List
We do a nightly pull of -STABLE and then a buildworld/buildkernel

The world and kernel build typically has been taking about 45-60min on one of
our quad core i5 machines.

For no obvious reason, it's now taking dozens of hours. Any insight on why this
might be happening would be appreciated.

Edward Sanford Sutton, III

unread,
Jun 30, 2024, 2:48:58 PM (5 days ago) Jun 30
to ques...@freebsd.org
On 6/30/24 08:11, Tim Daneliuk wrote:
> We do a nightly pull of -STABLE and then a buildworld/buildkernel

stable/14, stable/13, and/or no longer supported stable versions?

> The world and kernel build typically has been taking about 45-60min on
> one of
> our quad core i5 machines.

i5 narrows it down to 19(?) generations of CPUs. 4 core cuts it down to
about 9. CPU performance/features can vary a lot across those
generations+models.

> For no obvious reason, it's now taking dozens of hours.  Any insight on
> why this
> might be happening would be appreciated.

My system using meta mode and ccache for stable/14 if running a build
attempt with filesystem data cached in RAM after a build completes
causes it to rerun within minutes on my i7-3820 using only 6 cpu cores,
32GB RAM, and a single magnetic hard drive. Running an update when clang
has been updated takes hours (not tens of hours) and I thought I recall
a decent amount of time goes to openssl too. A full build after cleanup
of the work directory should still be below 10 hours; my last timing was
with 4 cores on the otherwise same hardware given a -j16 and took less
than 6 hours ago but it was long enough ago I don't remember if it was
timed during /13 or /12 (I delayed the 14 update for a while but may be
new enough to be in that window but not of that build). Been a number of
days since the last clang update in /14 but openssl did just get
updated; still doesn't likely explain 1 hour to 1 day+ buildtime change.

More build hardware+software setup is likely needed:
Specific CPU, preferably RAM total+speed, what storage media
(magnetic/ssd, models, array configuration if RAID.
What filesystem is on the drives. Any build customizations (ccache,
WITH_META_MODE, altered compiler flags, number of make jobs). What
version of OS. If PORTS_MODULES is defined it can add additional
complete compilers to the build process among other things from the
ports tree depending on its state and the state of currently installed
packages.
Have you observed any unusual stats like lower CPU, higher disk I/O &
% busy compared to a typical run? If you don't have specific stats you
could glance at how things appear with top, systat, etc. to start
getting an idea.
Do you know what steps in the world/kernel are taking long? You can
separate buildworkd and buildkernel into separate commands and time them
separately. `make -s buildworld` will suppress a lot of output which
helps see stages messages and the entire build can be logged. I don't
know how but I imagine there is a way to do it with timestamps throughout.
Using magnetic media, ZFS with compression, ccache, and leaving
atime=on can lead to horrendous disk performance. I 'think' atime causes
fragmentation of file metadata (even listing large directory contents
takes forever) but even if not you still have 1 write for every file
read; disabling it likely causes ccache to clear the cache as a first-in
first-out sequence instead of removing what hasn't been used in the
longest time. devel/ccache on a compressed dataset doesn't track sizes
properly which sounded like zfs reports new cache entries are 0 bytes
instead of returning its uncompressed size (compressed size can't be
returned until compression algorithm is completed). This causes `ccache
-s` cache size to exceed max cache size without triggering automatic
cache cleanups; manually running `ccache -c` gets the cache back within
limits which can make a much smaller cache and can have massive
performance improvements if the file count was getting out of control. A
very poorly performing ccache storage even reveals questionable calls to
ccache from ports tree operations as basic non-compiling operations now
become very slow with ccache disk I/O.
I haven't had WITH_META_MODE cause a noticeable detriment to build
times but have had it break builds until I ran `chflags -R noschg
/usr/obj/usr;rm -rf /usr/obj/usr;cd /usr/src&&make cleandir&&make
cleandir` though if trying to diagnose this for yourself and others it
would be helpful if you moved/backup instead of removed the build
directory contents so it could be further analyzed.
Are there any other uses this machine has during build that could be
hogging CPU/RAM/disk with other operations?
Are CPU temperatures staying in proper range or could thermal
throttling be ruining CPU performance? Disk I/O taking longer than
expected on a filesystem with plenty of free space and reasonable
file/directory count could indicate a drive issue; running SMART tests,
reseating all drive cable connections (helps with dirt/minor corrosion;
disconnect+connect several times), and making sure drive temperatures
are within adequate ranges is good.

Ralf Mardorf

unread,
Jun 30, 2024, 6:26:06 PM (5 days ago) Jun 30
to ques...@freebsd.org
On Sun, 2024-06-30 at 11:48 -0700, Edward Sanford Sutton, III wrote:
> Disk I/O taking longer than expected

Shingled Magnetic Recording vs Conventional Magnetic Recording

Horse Radish

unread,
Jun 30, 2024, 7:04:39 PM (5 days ago) Jun 30
to ralf-m...@riseup.net, ques...@freebsd.org

The drives in question are ssd

infoomatic

unread,
Jun 30, 2024, 7:09:46 PM (5 days ago) Jun 30
to ques...@freebsd.org
On 01.07.24 01:03, Horse Radish wrote:
> The drives in question are ssd

Are these QLC-SSDs or TLC-SSDs? QLCs can be very slow - eg my 3 Patriot
Burst Elite 1920GB in a raidz1 are barely faster than HDDs after the
throughput of a few hundred MBs drops significantly, especially with
small files.

Horse Radish

unread,
Jun 30, 2024, 7:23:40 PM (5 days ago) Jun 30
to infoo...@gmx.at, ques...@freebsd.org

Not sure. But they're drives that have been on this server for over a year.  The slow  compilation just showed up in the past few months.   LLVM seems to be part of the issue but iI don't know why it's taking so long..

Kevin P. Neal

unread,
Jun 30, 2024, 9:06:59 PM (5 days ago) Jun 30
to Horse Radish, infoo...@gmx.at, ques...@freebsd.org
On Sun, Jun 30, 2024 at 06:23:06PM -0500, Horse Radish wrote:
> Not sure. But they're drives that have been on this server for over a
> year. The slow compilation just showed up in the past few
> months.  LLVM seems to be part of the issue but iI don't know why
> it's taking so long..

Are you running out of memory? Links of LLVM programs can use fantastic
amounts of memory. For debug builds I frequently see 16GB of memory for a
single link. I've brought down login servers multiple times by accident
when running links in parallel.

A non-debug build will take a fraction of the memory, but if enough of
them are done in parallel it still might be a problem.

I admit I don't know of any change that might have created a new problem,
but I never build from the FreeBSD tree -- just vanilla LLVM from git.
--
Kevin P. Neal http://www.pobox.com/~kpn/
"Oh, I've heard that paradox a couple of times, but there's something
about a cat dying and I hate to think of such things."
- Dr. Donald Knuth speaking of Schrodinger's cat, December 8, 1999, MIT

Tim Daneliuk

unread,
Jul 1, 2024, 7:56:19 PM (4 days ago) Jul 1
to FreeBSD Mailing List
As a followup -

- The machine in question is a 4 core I5 - a rather older one
- The build disk is a fairly recent (1 yo ish) ssd
- It has 8G of memory, but htop shows no memory starvation or use of swap
- HOWEVER, htop is reporting a load average of almost 14 (!) but I am specifying -j4 on the make line

Have buildworld and buildkern suddenly decided to ignore the -j4 and launch tons of parallel processes?

Ideas anyone?

Tim Daneliuk

unread,
Jul 2, 2024, 5:34:17 PM (3 days ago) Jul 2
to FreeBSD Mailing List
So, we've discovered the apparent cause and I thought I share with the class.

This is an older 4 core i5 server and it was starving for resources because ...

We appear to be increasingly under bot scraping attacks. This was made worse
because our apache config didn't implement the mpm module to limit apache
resource consumption. In effect, apache would try to take on as much works
as the public network could throw at it .... which gave us load averages
in the 17s and up.

The fix involved several things:

- getting the apache mpm module in place
- tuning its settings to severely limit number of processes and theads it could use
- ipfw blocking some obvious scanning abusers (looking at you degenerates at Facebook..)

So far, much, much better. We'll see.

Kevin P. Neal

unread,
Jul 4, 2024, 10:28:14 AM (yesterday) Jul 4
to Tim Daneliuk, FreeBSD Mailing List
On Mon, Jul 01, 2024 at 06:55:57PM -0500, Tim Daneliuk wrote:
> Have buildworld and buildkern suddenly decided to ignore the -j4 and launch tons of parallel processes?
>
> Ideas anyone?

/usr/bin/top is your friend. Especially if you tell it to order by CPU usage.
It would have told you that your top CPU users were four compilations plus
however many apache processes were killing your server.

--
Kevin P. Neal http://www.pobox.com/~kpn/

"What is mathematics? The age-old answer is, of course, that mathematics
is what mathematicians do." - Donald Knuth

Reply all
Reply to author
Forward
0 new messages