Source repos?

43 views
Skip to first unread message

Ben Laurie

unread,
Oct 28, 2015, 8:37:21 AM10/28/15
to binary-tr...@googlegroups.com, Holger Levsen, Ed Maste, opensec
I've noticed that neither FreeBSD nor Debain (AFAIK) include in their package metadata where the original source repo is, nor what actually version from that repo is released.

I realise currently both systems operate off released versions (i.e. tarballs) but if the ultimate plan is to increase confidence that the binary is indeed the "right" binary, linking all the way back to the original repo seems like a good idea.

I know this can't be done for everything, but where possible it seems like a good idea.

Also, its a useful piece of information for metrics (e.g. measuring patch times, contribution levels, etc).

What do people think?

Daniel Kahn Gillmor

unread,
Oct 28, 2015, 11:34:49 AM10/28/15
to Ben Laurie, binary-tr...@googlegroups.com, Holger Levsen, Ed Maste, opensec
Hi Ben--

On Wed 2015-10-28 08:37:11 -0400, 'Ben Laurie' via binary-transparency <binary-tr...@googlegroups.com> wrote:
> I've noticed that neither FreeBSD nor Debain (AFAIK) include in their
> package metadata where the original source repo is, nor what actually
> version from that repo is released.

You're using a few terms here that have multiple/ambiguous meanings so
i'm going to try to sort them out.

I think you might be talking about one of these things:

a) the "upstream" source code per individual package
b) the "upstream" revision control system per individual package
c) the distribution's repository of source code
d) the distribution's repository itself

I think you don't mean (c) or (d), but in case you do:

metadata/debian/main/binary-all/Release in the binary-transparency-notes
repo contains this:

-----
Archive: stable
Origin: Debian
Label: Debian
Version: 8.2
Component: main
Architecture: all
-----

If you're talking about the source code and not the binaries, there is a
corresponding main/source/Release -- i can include an example of that in
the binary-transparency-notes repo as well if people are interested. It
contains digests of the "upstream source" tarballs, which debian also
publishes and distributes (more on upstream source tarballs below).

You identify two different pieces of information: "where the source repo
is" -- do you mean by URL? debian has traditionally used mirrors for
distribution so there is no one canonical source URL (other than the
source for the mirrors, which we don't want everyone pulling from
directly).

> I realise currently both systems operate off released versions (i.e.
> tarballs) but if the ultimate plan is to increase confidence that the
> binary is indeed the "right" binary, linking all the way back to the
> original repo seems like a good idea.
>
> I know this can't be done for everything, but where possible it seems like
> a good idea.

Looking at (a) and (b) --

not every upstream uses revision control, and those that do sometimes
don't have a simple linkage between their revision control system and
their released tarballs. (for example, there are many projects that
produce tarballs via "make dist", which strips out some
developer-specific files and embeds a bunch of autotools cruft^H^H^H^H^H
useful portability scripts)

Even those upstreams that do have a stable repository using a sane
revision control system, and generate clean tarballs directly from it
without any additional modification don't necessarily publish tags in a
standard way that corresponds to their releases.

debian packages contain a file debian/copyright which should indicate
the URL of the upstream project, and they often also contain a file
debian/watch, which indicates the URL where upstream's released tarballs
can be found. In any package, debian/control also has an option
Homepage: field which points to the URL of the upstream project (and is
replicated in the Packages.xz files), and a Vcs-Git: or Vcs-Browser: field
which points to revision control for the debian packaging itself. so
all of this info is extractable from the archive itself, even if some of
it is not in the specific metadata you find in the Packages.xz file.

fwiw, the version numbers of debian packages do correspond to upstream
version numbers, up to the final hyphen ("-"), which separates the
debian revision from the upstream version. That is, if you see
libfoo_0.3-2, that's the second revision of debian packaging for
libfoo's upstream version 0.3. So if (say) OpenSSH fixes a bug in
version 9.8, it's probably fixed in debian's openssh_9.8-1. But it
could also be fixed (by backport, etc) in debian's openssh_9.6-2.

Note also that upstream's standards of what goes in the tarball might
not fit the standards of the downstream distribution. Debian is known
to "clean" tarballs before redistributing them if they contain any
material that is not "dfsg-free" (free according to the Debian Free
Software Guidelines). This often results in removal of proprietary
image or audio files, documentation that has non-modification
constraints, or outright non-redistributable binary blobs that upstream
is ignoring the copyright issues on. So even the "upstream source"
tarballs that debian has might not be byte-for-byte aligned with the
actual upstream tarballs. Debian usually notes that by appending the
upstream version number with +dfsg1 or something like that.

The debian project has seen some internal discussion about the
relationship between upstream revision control systems and debian
packaging revision control. upstream revision control systems are often
even "dirtier" in terms of things debian doesn't want to (or can't)
redistribute than the released tarballs, and revision control systems
are even harder to "clean" than a released tarball is. So it's unlikely
that we'll be able

But even given all of the constraints above, i agree that having a
systematic way to get back to the source code for those projects where
it's possible would be nice. Maybe we have a chance to carve out a
"best practices" space here, at least to be able to highlight which
packages/upstreams meet these practices, and which could use
improvement?

However, is binary-transparency the right place to do that work? If we
do, how will that work interact with proprietary vendors who want to
provide some level of binary transparency themselves?

--dkg

Ben Laurie

unread,
Oct 28, 2015, 2:02:48 PM10/28/15
to Daniel Kahn Gillmor, binary-tr...@googlegroups.com, Holger Levsen, Ed Maste, opensec
On Wed, 28 Oct 2015 at 15:34 Daniel Kahn Gillmor <d...@fifthhorseman.net> wrote:
Hi Ben--

On Wed 2015-10-28 08:37:11 -0400, 'Ben Laurie' via binary-transparency <binary-tr...@googlegroups.com> wrote:
> I've noticed that neither FreeBSD nor Debain (AFAIK) include in their
> package metadata where the original source repo is, nor what actually
> version from that repo is released.

You're using a few terms here that have multiple/ambiguous meanings so
i'm going to try to sort them out.

I think you might be talking about one of these things:

 a) the "upstream" source code per individual package
 b) the "upstream" revision control system per individual package
 c) the distribution's repository of source code
 d) the distribution's repository itself

I think you don't mean (c) or (d), but in case you do:

I meant b.
 

metadata/debian/main/binary-all/Release in the binary-transparency-notes
repo contains this:

-----
Archive: stable
Origin: Debian
Label: Debian
Version: 8.2
Component: main
Architecture: all
-----

If you're talking about the source code and not the binaries, there is a
corresponding main/source/Release -- i can include an example of that in
the binary-transparency-notes repo as well if people are interested.  It
contains digests of the "upstream source" tarballs, which debian also
publishes and distributes (more on upstream source tarballs below).

You identify two different pieces of information: "where the source repo
is" -- do you mean by URL?

Yes. Or equivalent (e.g. I think CVS repos' locations are not strictly URLs). Basically, "what do I tell what tool to get a copy of the actual repo developers work on?".
To be clear, the "debian packaging itself" is essentially metadata and not a mirror of b?
So, it seems to me there are two possible aims for this kind of work in the context of a packaging system.

1. Try to tightly integrate the upstream repo into the packaging process.

2. Allow audit of the packaging process.

It seems to me that 2 is probably a more interesting immediate target than 1, particularly given the many technical barriers you raise.

I confess, though, that my interest was more driven by a desire to gather metrics than directly to do with packaging. That said...
 
However, is binary-transparency the right place to do that work?

...it seems to me that binary transparency is ultimately about provenance. In that context, the ability to go all the way back to the original work of the developers seems like a key component to provenance, even if its only ultimately useful to humans rather than machines.

But I guess we're not the only interested parties - should we start encouraging some kind of self-published metadata in projects, which includes this kind of thing? And if its missing, Debian might (or might not) choose to provide it itself...


 
  If we
do, how will that work interact with proprietary vendors who want to
provide some level of binary transparency themselves?

It seems to me that the option "upstream repo: private" is entirely fair.

Daniel Kahn Gillmor

unread,
Oct 29, 2015, 12:33:32 PM10/29/15
to Ben Laurie, binary-tr...@googlegroups.com, Holger Levsen, Ed Maste, opensec
On Wed 2015-10-28 14:02:35 -0400, Ben Laurie <be...@google.com> wrote:
> Yes. Or equivalent (e.g. I think CVS repos' locations are not strictly
> URLs). Basically, "what do I tell what tool to get a copy of the actual
> repo developers work on?".

sure, having this be automated would be great. there are versioning
dependencies here too (e.g. what if version X of VCS A doesn't support
URLs of type foo://, but version Y does?)

>> debian packages contain a file debian/copyright which should indicate
>> the URL of the upstream project, and they often also contain a file
>> debian/watch, which indicates the URL where upstream's released tarballs
>> can be found. In any package, debian/control also has an option
>> Homepage: field which points to the URL of the upstream project (and is
>> replicated in the Packages.xz files), and a Vcs-Git: or Vcs-Browser: field
>> which points to revision control for the debian packaging itself. so
>> all of this info is extractable from the archive itself, even if some of
>> it is not in the specific metadata you find in the Packages.xz file.
>
> To be clear, the "debian packaging itself" is essentially metadata and not
> a mirror of b?

Debian packaging can contain metadata, patches, and code, depending on
the state of the upstream project and how well it fits into debian
default packaging techniques.

The VCS that tracks the debian packaging can either contain only the
debian packaging or it can contain both the debian packaging and the
upstream source. For my own debian packaging work i always include the
upstream source in the VCS, but this isn't a requirement within debian
(these practices have grown up organically, and aren't mandated by the
distro).

> So, it seems to me there are two possible aims for this kind of work
> in the context of a packaging system.
>
> 1. Try to tightly integrate the upstream repo into the packaging process.
>
> 2. Allow audit of the packaging process.
>
> It seems to me that 2 is probably a more interesting immediate target than
> 1, particularly given the many technical barriers you raise.
>
> I confess, though, that my interest was more driven by a desire to gather
> metrics than directly to do with packaging. That said...

If your goal is just to gather metrics, then most of the data you're
asking about is already available, albeit in different forms for
different distros. There are lots of metrics that would be interesting
to review, but clearly we can't overload b-t with all of them.

And there are risks of confusion for some kinds of metrics as well. For
example, if we report something like this:

{ 'package': 'foo',
'version': '1.2-3',
'distro': { 'name': 'lunix',
'version': '9.2',
'platform': 'x86_64-linux-gnu'
},
'sha256': 'b42e48b3a6bf6e407688be6f11908cb0ef8b9a58',
'upstream_repo': { 'type': 'git',
'url': 'https://git.example/projects/foo',
'commit': '18911075732d64426bafccdde6a6b3656727da0d' }
}

and the foo project is known to to have a serious bug that is not fixed
in the referenced upstream_repo.commit, it would be easy to mistakenly
claim that lunix 9.2 hasn't fixed that issue in their foo package. But
it's possible that they have added a patch to fix it already.

Note also that some upstream repositories produce multiple binary
artifacts (e.g. the "pinentry" upstream source produces a half-dozen
binaries in debian, all producing password prompt mechanisms in
different environments).

Similarly, the relationship between the source code and the generated
binary artifact isn't a clean mapping. As the reproducible-builds
project has shown, the toolchain used to convert the source to a binary
artifact is also relevant. in r-b, we've been capturing the manifest of
all packages used during build as a "buildinfo" file. Is b-t the right
place to record that info as well?

>> However, is binary-transparency the right place to do that work?
>
> ...it seems to me that binary transparency is ultimately about provenance.
> In that context, the ability to go all the way back to the original work of
> the developers seems like a key component to provenance, even if its only
> ultimately useful to humans rather than machines.

I think i'm going to push back on this a bit. Is b-t ultimately about
provenance? Currently i see b-t as offering users some assurance that
the package they're seeing is seen by everyone who relies on the same
cryptographic authority. This allows the user to be sure that if
they're getting malware, at least everyone else is getting malware
too. This provides a deterrent to vendors who are being pressured to
ship customized malware to some of their users.

Is this provenance? i see it more as oversight or accountability for
the last-hop leg that software travels from vendor to end-user.

If a malicious binary artifact is discovered, and its associated b-t log
somehow does point all the way back to upstream source code, and that
source code has clear evidence of malice, great, we found the attack.
But if the upstream source code does *not* have evidence of malice, we
don't know where the problem was introduced.

b-t will be simpler if it just focuses on the last hop (the link between
vendor and user), and leave it to the individual vendors to justify any
failures further back in the chain.

That said, i understand the desire to link binary artifacts all the way
back to source code, both from the FLOSS-advocacy perspective and from
the reproducible-builds perspective. And if b-t could provide that
linkage, i'd be happy. I'm just wary about trying to make b-t do all
those things.

> But I guess we're not the only interested parties - should we start
> encouraging some kind of self-published metadata in projects, which
> includes this kind of thing? And if its missing, Debian might (or might
> not) choose to provide it itself...

I listed several ways that this kind of data is already provisionally
published in Debian. If there are specific things that you think are
missing, i'd be happy to look into ways to improve debian infrastructure
to allow it. let me know!

>> If we do, how will that work interact with proprietary vendors who
>> want to provide some level of binary transparency themselves?
>
> It seems to me that the option "upstream repo: private" is entirely fair.

sure, sounds fine to me :)

--dkg

Edward Tomasz Napierała

unread,
Nov 6, 2015, 11:12:28 AM11/6/15
to Ben Laurie, binary-tr...@googlegroups.com, Holger Levsen, Ed Maste, opensec
On 1028T1237, 'Ben Laurie' via binary-transparency wrote:
> I've noticed that neither FreeBSD nor Debain (AFAIK) include in their
> package metadata where the original source repo is, nor what actually
> version from that repo is released.

True. It should be relatively easy to do for FreeBSD - just note
the ports tree repo path (eg svn+ssh://svn.freebsd.org/ports/head)
and revision (eg 395080). The ports metadata includes all the
information on how to fetch the source code and what the SHA256
of the tarballs should be. For reproducibility we'd also need
the repo path and revision for the base system running on the build
host - that defines the system headers and the toolchain.

Reply all
Reply to author
Forward
0 new messages