On 6/30/24 08:11, Tim Daneliuk wrote:
> We do a nightly pull of -STABLE and then a buildworld/buildkernel
stable/14, stable/13, and/or no longer supported stable versions?
> The world and kernel build typically has been taking about 45-60min on
> one of
> our quad core i5 machines.
i5 narrows it down to 19(?) generations of CPUs. 4 core cuts it down to
about 9. CPU performance/features can vary a lot across those
generations+models.
> For no obvious reason, it's now taking dozens of hours. Any insight on
> why this
> might be happening would be appreciated.
My system using meta mode and ccache for stable/14 if running a build
attempt with filesystem data cached in RAM after a build completes
causes it to rerun within minutes on my i7-3820 using only 6 cpu cores,
32GB RAM, and a single magnetic hard drive. Running an update when clang
has been updated takes hours (not tens of hours) and I thought I recall
a decent amount of time goes to openssl too. A full build after cleanup
of the work directory should still be below 10 hours; my last timing was
with 4 cores on the otherwise same hardware given a -j16 and took less
than 6 hours ago but it was long enough ago I don't remember if it was
timed during /13 or /12 (I delayed the 14 update for a while but may be
new enough to be in that window but not of that build). Been a number of
days since the last clang update in /14 but openssl did just get
updated; still doesn't likely explain 1 hour to 1 day+ buildtime change.
More build hardware+software setup is likely needed:
Specific CPU, preferably RAM total+speed, what storage media
(magnetic/ssd, models, array configuration if RAID.
What filesystem is on the drives. Any build customizations (ccache,
WITH_META_MODE, altered compiler flags, number of make jobs). What
version of OS. If PORTS_MODULES is defined it can add additional
complete compilers to the build process among other things from the
ports tree depending on its state and the state of currently installed
packages.
Have you observed any unusual stats like lower CPU, higher disk I/O &
% busy compared to a typical run? If you don't have specific stats you
could glance at how things appear with top, systat, etc. to start
getting an idea.
Do you know what steps in the world/kernel are taking long? You can
separate buildworkd and buildkernel into separate commands and time them
separately. `make -s buildworld` will suppress a lot of output which
helps see stages messages and the entire build can be logged. I don't
know how but I imagine there is a way to do it with timestamps throughout.
Using magnetic media, ZFS with compression, ccache, and leaving
atime=on can lead to horrendous disk performance. I 'think' atime causes
fragmentation of file metadata (even listing large directory contents
takes forever) but even if not you still have 1 write for every file
read; disabling it likely causes ccache to clear the cache as a first-in
first-out sequence instead of removing what hasn't been used in the
longest time. devel/ccache on a compressed dataset doesn't track sizes
properly which sounded like zfs reports new cache entries are 0 bytes
instead of returning its uncompressed size (compressed size can't be
returned until compression algorithm is completed). This causes `ccache
-s` cache size to exceed max cache size without triggering automatic
cache cleanups; manually running `ccache -c` gets the cache back within
limits which can make a much smaller cache and can have massive
performance improvements if the file count was getting out of control. A
very poorly performing ccache storage even reveals questionable calls to
ccache from ports tree operations as basic non-compiling operations now
become very slow with ccache disk I/O.
I haven't had WITH_META_MODE cause a noticeable detriment to build
times but have had it break builds until I ran `chflags -R noschg
/usr/obj/usr;rm -rf /usr/obj/usr;cd /usr/src&&make cleandir&&make
cleandir` though if trying to diagnose this for yourself and others it
would be helpful if you moved/backup instead of removed the build
directory contents so it could be further analyzed.
Are there any other uses this machine has during build that could be
hogging CPU/RAM/disk with other operations?
Are CPU temperatures staying in proper range or could thermal
throttling be ruining CPU performance? Disk I/O taking longer than
expected on a filesystem with plenty of free space and reasonable
file/directory count could indicate a drive issue; running SMART tests,
reseating all drive cable connections (helps with dirt/minor corrosion;
disconnect+connect several times), and making sure drive temperatures
are within adequate ranges is good.