[PATCH] deb-dl-dir: remove excessive calls to dpkg-deb in debsrc_download

71 views
Skip to first unread message

Cedric Hombourger

unread,
Mar 5, 2025, 8:12:05 AMMar 5
to isar-...@googlegroups.com, ub...@ilbers.de, Cedric Hombourger
Several calls to dpkg-deb are made for each single .deb file found in
downloads to parse individual fields. This approach is terribly slow
when a large amount of .deb files are found. Use apt-ftparchive to
produce an index of packages that were found and a simple awk script
to produce a (sorted) list of source package names and their versions.
Also avoid using sed to remove Epoch from the version when we are
trying to determine the name of the .dsc file: we instead use a simple
POSIX parameter expansion to remove everything up to the first colon

Signed-off-by: Cedric Hombourger <cedric.h...@siemens.com>
---
meta/classes/deb-dl-dir.bbclass | 62 +++++++++++++++++++--------------
1 file changed, 35 insertions(+), 27 deletions(-)

diff --git a/meta/classes/deb-dl-dir.bbclass b/meta/classes/deb-dl-dir.bbclass
index 7ebd057e..53ce4538 100644
--- a/meta/classes/deb-dl-dir.bbclass
+++ b/meta/classes/deb-dl-dir.bbclass
@@ -5,23 +5,6 @@

inherit repository

-is_not_part_of_current_build() {
- local package="$( dpkg-deb --show --showformat '${Package}' "${1}" )"
- local arch="$( dpkg-deb --show --showformat '${Architecture}' "${1}" )"
- local version="$( dpkg-deb --show --showformat '${Version}' "${1}" )"
- # Since we are parsing all the debs in DEBDIR, we can to some extend
- # try to eliminate some debs that are not part of the current multiconfig
- # build using the below method.
- local output="$( grep -xhs ".* status installed ${package}:${arch} ${version}" \
- "${IMAGE_ROOTFS}"/var/log/dpkg.log \
- "${SCHROOT_HOST_DIR}"/var/log/dpkg.log \
- "${SCHROOT_TARGET_DIR}"/var/log/dpkg.log \
- "${SCHROOT_HOST_DIR}"/tmp/dpkg_common.log \
- "${SCHROOT_TARGET_DIR}"/tmp/dpkg_common.log | head -1 )"
-
- [ -z "${output}" ]
-}
-
debsrc_do_mounts() {
sudo -s <<EOSUDO
set -e
@@ -54,16 +37,41 @@ debsrc_download() {
( flock 9
set -e
printenv | grep -q BB_VERBOSE_LOGS && set -x
- find "${rootfs}/var/cache/apt/archives/" -maxdepth 1 -type f -iname '*\.deb' | while read package; do
- is_not_part_of_current_build "${package}" && continue
- local src="$( dpkg-deb --show --showformat '${source:Package}' "${package}" )"
- local version="$( dpkg-deb --show --showformat '${source:Version}' "${package}" )"
- local dscname="$(echo ${src}_${version} | sed -e 's/_[0-9]\+:/_/')"
- local dscfile=$(find "${DEBSRCDIR}"/"${rootfs_distro}" -name "${dscname}.dsc")
- [ -n "$dscfile" ] && continue
-
- sudo -E chroot --userspec=$( id -u ):$( id -g ) ${rootfs} \
- sh -c ' mkdir -p "/deb-src/${1}/${2}" && cd "/deb-src/${1}/${2}" && apt-get -y --download-only --only-source source "$2"="$3" ' download-src "${rootfs_distro}" "${src}" "${version}"
+
+ # Use apt-ftparchive to scan all .deb files found in the download directory
+ # and produce an index that we can "parse" with awk. This is much faster
+ # than parsing each .deb file individually using dpkg-deb. Lines from the
+ # index we need are:
+ #
+ # Package: <binary-name>
+ # Version: <binary-version>
+ # Source: <source-name> (<source-version>)
+ #
+ # If Source is omitted, then <source-name>=<binary-name> and
+ # if <source-version> is not specified then it is <binary-version>.
+ # The awk script handles these optional fields. It looks for Size: as a
+ # trigger to print the source,version tupple
+
+ apt-ftparchive --md5=no --sha1=no --sha256=no --sha512=no \
+ -a "${DISTRO_ARCH}" packages \
+ "${rootfs}/var/cache/apt/archives" \
+ | awk '/^Package:/ { s=$2; }
+ /^Version:/ { v=$2; next }
+ /^Source:/ { s=$2; if ($3 ~ /^\(/) v=substr($3, 2, length($3)-2) }
+ /^Size:/ { print s, v}' \
+ | sort -u \
+ | while read src version; do
+ # Name of the .dsc file does not include Epoch, remove it before checking
+ # if sources were already downloaded. Avoid using sed here to reduce the
+ # number of processes being spawned by this function: we assume that the
+ # version is correctly formatted and simply strip everything up to the
+ # first colon
+ dscname="${src}_${version#*:}.dsc"
+ [ -f "${DEBSRCDIR}"/"${rootfs_distro}"/"${src}"/"${dscname}" ] || {
+ # use apt-get source to download sources in DEBSRCDIR
+ sudo -E chroot --userspec=$( id -u ):$( id -g ) ${rootfs} \
+ sh -c ' mkdir -p "/deb-src/${1}/${2}" && cd "/deb-src/${1}/${2}" && apt-get -y --download-only --only-source source "$2"="$3" ' download-src "${rootfs_distro}" "${src}" "${version}"
+ }
done
) 9>"${DEBSRCDIR}/${rootfs_distro}.lock"

--
2.39.5

Jan Kiszka

unread,
Mar 5, 2025, 8:57:11 AMMar 5
to Cedric Hombourger, isar-...@googlegroups.com, ub...@ilbers.de
On 05.03.25 14:11, 'Cedric Hombourger' via isar-users wrote:
> Several calls to dpkg-deb are made for each single .deb file found in
> downloads to parse individual fields. This approach is terribly slow
> when a large amount of .deb files are found. Use apt-ftparchive to

Out of curiosity: What is roughly the amount of packages where this
inefficiency becomes visible?
Can we rewrap this horribly long line at that chance?

> + }
> done
> ) 9>"${DEBSRCDIR}/${rootfs_distro}.lock"
>

Did you also consider using a python function for the content
processing? I'm not predicting that this will be faster or nicer or
whatever, just wondering if it might be while reading the above.

Jan

--
Siemens AG, Foundational Technologies
Linux Expert Center

cedric.h...@siemens.com

unread,
Mar 5, 2025, 10:08:35 AMMar 5
to isar-...@googlegroups.com, Kiszka, Jan, ub...@ilbers.de
On Wed, 2025-03-05 at 14:57 +0100, Jan Kiszka wrote:
> On 05.03.25 14:11, 'Cedric Hombourger' via isar-users wrote:
> > Several calls to dpkg-deb are made for each single .deb file found
> > in
> > downloads to parse individual fields. This approach is terribly
> > slow
> > when a large amount of .deb files are found. Use apt-ftparchive to
>
> Out of curiosity: What is roughly the amount of packages where this
> inefficiency becomes visible?

That's a great question and I may not have a great answer though. I am
pleeding guilty for (1) sharing my downloads folder between builds and
(2) doing multiconfig builds. I therefore have ~4.9k .deb packages in
my downloads folder.

I would expect people start experiencing the cost of deb-src caching
with a few hundreds only. The current implementation was calling dpkg-
deb 5 times for each package + 1 call to sed for each + 1 call to find
to traverse the deb-src tree to check if the .dsc we want to download
happens to be already there.
Cedric Hombourger
Siemens AG
www.siemens.com

Niedermayr, BENEDIKT

unread,
Mar 5, 2025, 12:22:05 PMMar 5
to cedric.h...@siemens.com, isar-...@googlegroups.com, ub...@ilbers.de
On 05.03.25 14:11, 'Cedric Hombourger' via isar-users wrote:
Maybe a pointer to my previous patch [1] which addresses this as well
but with a different motivation. Your patch would also fix a regression
that has been introduced with mmdebstrap.

At least my patch is causing problems.

[1] https://groups.google.com/g/isar-users/c/IeORW6eiTxI

Niedermayr, BENEDIKT

unread,
Mar 5, 2025, 12:24:15 PMMar 5
to cedric.h...@siemens.com, isar-...@googlegroups.com, ub...@ilbers.de
Ok just saw this [1], so you might be already aware of it.

[1] https://groups.google.com/g/isar-users/c/8QstIaudyts

Regards,
Benedikt

Srinuvasan Arjunan

unread,
Mar 10, 2025, 7:06:56 AMMar 10
to isar-users
  Hi Cedric,

  I took this patch for my deb-src-caching issue [1], now i can able to download deb-src for bootstrap and image related packages
  only missing part is imager_install related packages, going to send the patches based on your patch.

  But here i found one issue for armfh arch base-apt builds in ISAR, the help2man and texinfo deb-src packages are missing
  because when we take the index using  apt-ftparchive --md5=no --sha1=no --sha256=no --sha512=no  -a "${DISTRO_ARCH}"
  we uses the -a ${DISTRO_ARCH}, in this case it is armfh, but help2man and texinfo packages are only available for amd64 arch (might
  be ISAR_CROSS_COMPILE configuration) not armhf, hence the index doesn't have those packages , due to this reason we are not able to
  download src packages for those packages.

   I would suggest we can remove -a "${DISTRO_ARCH}" option and anyhow we are getting final list with sort -u.
   Validated without -a option and it's working fine as expected.


 Please provide your thoughts?  

Cedric Hombourger

unread,
Mar 22, 2025, 2:15:39 AMMar 22
to isar-...@googlegroups.com, ub...@ilbers.de, Cedric Hombourger
Changes since v1:
* v1 had is_not_part_of_current_build() removed. It turns out that
we better check what .deb files get imported in the apt cache
of our rootfs to make sure that we only try to fetch sources for
packages we know. This is now achieved with apt-cache dumpavail
to obtain a list of known source packages.

Cedric Hombourger (1):
deb-dl-dir: remove excessive calls to dpkg-deb in debsrc_download

meta/classes/deb-dl-dir.bbclass | 78 +++++++++++++++++++++------------
1 file changed, 51 insertions(+), 27 deletions(-)

--
2.39.5

Cedric Hombourger

unread,
Mar 22, 2025, 2:15:40 AMMar 22
to isar-...@googlegroups.com, ub...@ilbers.de, Cedric Hombourger
Several calls to dpkg-deb are made for each single .deb file found in
downloads to parse individual fields. This approach is terribly slow
when a large amount of .deb files are found. Use apt-ftparchive to
produce an index of packages that were found and a simple awk script
to produce a (sorted) list of source package names and their versions.
"apt-cache dumpavail" is used so we only attempt to download source
packages for versions that are known to apt.

Also avoid using sed to remove Epoch from the version when we are
trying to determine the name of the .dsc file: we instead use a simple
POSIX parameter expansion to remove everything up to the first colon

Signed-off-by: Cedric Hombourger <cedric.h...@siemens.com>
---
meta/classes/deb-dl-dir.bbclass | 78 +++++++++++++++++++++------------
1 file changed, 51 insertions(+), 27 deletions(-)

diff --git a/meta/classes/deb-dl-dir.bbclass b/meta/classes/deb-dl-dir.bbclass
index 7ebd057e..75877750 100644
--- a/meta/classes/deb-dl-dir.bbclass
+++ b/meta/classes/deb-dl-dir.bbclass
@@ -5,23 +5,6 @@

inherit repository

-is_not_part_of_current_build() {
- local package="$( dpkg-deb --show --showformat '${Package}' "${1}" )"
- local arch="$( dpkg-deb --show --showformat '${Architecture}' "${1}" )"
- local version="$( dpkg-deb --show --showformat '${Version}' "${1}" )"
- # Since we are parsing all the debs in DEBDIR, we can to some extend
- # try to eliminate some debs that are not part of the current multiconfig
- # build using the below method.
- local output="$( grep -xhs ".* status installed ${package}:${arch} ${version}" \
- "${IMAGE_ROOTFS}"/var/log/dpkg.log \
- "${SCHROOT_HOST_DIR}"/var/log/dpkg.log \
- "${SCHROOT_TARGET_DIR}"/var/log/dpkg.log \
- "${SCHROOT_HOST_DIR}"/tmp/dpkg_common.log \
- "${SCHROOT_TARGET_DIR}"/tmp/dpkg_common.log | head -1 )"
-
- [ -z "${output}" ]
-}
-
debsrc_do_mounts() {
sudo -s <<EOSUDO
set -e
@@ -41,6 +24,24 @@ debsrc_undo_mounts() {
EOSUDO
}

+debsrc_source_version_filter() {
+ # Filter the input to only consider Package, Version and Source lines
+ #
+ # Package: <binary-name>
+ # Version: <binary-version>
+ # Source: <source-name> (<source-version>)
+ #
+ # If Source is omitted, then <source-name>=<binary-name> and
+ # if <source-version> is not specified then it is <binary-version>.
+ # The awk script handles these optional fields. It looks for Size: as a
+ # trigger to print the source,version tupple
+ awk '/^Package:/ { s=$2; }
+ /^Version:/ { v=$2; next }
+ /^Source:/ { s=$2; if ($3 ~ /^\(/) v=substr($3, 2, length($3)-2) }
+ /^Size:/ { print s, v}' \
+ | sort -u
+}
+
debsrc_download() {
export rootfs="$1"
export rootfs_distro="$2"
@@ -54,16 +55,39 @@ debsrc_download() {
( flock 9
set -e
printenv | grep -q BB_VERBOSE_LOGS && set -x
- find "${rootfs}/var/cache/apt/archives/" -maxdepth 1 -type f -iname '*\.deb' | while read package; do
- is_not_part_of_current_build "${package}" && continue
- local src="$( dpkg-deb --show --showformat '${source:Package}' "${package}" )"
- local version="$( dpkg-deb --show --showformat '${source:Version}' "${package}" )"
- local dscname="$(echo ${src}_${version} | sed -e 's/_[0-9]\+:/_/')"
- local dscfile=$(find "${DEBSRCDIR}"/"${rootfs_distro}" -name "${dscname}.dsc")
- [ -n "$dscfile" ] && continue
-
- sudo -E chroot --userspec=$( id -u ):$( id -g ) ${rootfs} \
- sh -c ' mkdir -p "/deb-src/${1}/${2}" && cd "/deb-src/${1}/${2}" && apt-get -y --download-only --only-source source "$2"="$3" ' download-src "${rootfs_distro}" "${src}" "${version}"
+
+ # We need temporary files for our lists of source packages
+ # trap exit of this sub-shell to remove them (this script may exit abruptly
+ # since "set -e" is used)
+ avail=$(mktemp)
+ wanted=$(mktemp)
+ trap "rm -f ${avail} ${wanted}" EXIT
+
+ # List all packages known to apt
+ apt-cache -o Dir=${rootfs} dumpavail | debsrc_source_version_filter > ${avail}
+
+ # Use apt-ftparchive to scan all .deb files found in the download directory
+ # and get the <source> <version> pairs that we wish to download
+ apt-ftparchive --md5=no --sha1=no --sha256=no --sha512=no \
+ -a "${DISTRO_ARCH}" packages \
+ "${rootfs}/var/cache/apt/archives" \
+ | debsrc_source_version_filter > ${wanted}
+
+ # We now have two sorted lists: source packages we want and those known to
+ # apt. We will only consider source packages that may be found in both.
+ comm -12 ${wanted} ${avail} \

Uladzimir Bely

unread,
Mar 27, 2025, 6:34:05 AMMar 27
to Cedric Hombourger, isar-...@googlegroups.com
Applied to next, thanks.

--
Best regards,
Uladzimir.
Reply all
Reply to author
Forward
0 new messages