Adding the GrypDB data into GSD

24 views
Skip to first unread message

Josh Bressers

unread,
Apr 5, 2022, 4:48:08 PM4/5/22
to GSD
Hi all,

I'm working on a script to start adding the GrypeDB data into GSD. You can see the start of my script here

It's going to output the GrypeDB data into OSV format.

And I've hit an interesting challenge. At the moment the script will spit out just the advisory types because I needed to figure something out.

There are CVE IDs, and GHSA, but also a lot of ELSA (Oracle Linux) and ALSA (Amazon Linux) advisories.

How do we want to handle these types overlapping identifiers? If there is a CVE or GHSA already, should we just add metadata to the existing ID? Should we let them have their own GSD that also has one or more related tags?

I see value in both approaches.

The old way would be to overload one ID as the "primary" ID as much as possible. This would be CVE probably given it is both the most widely used and least flexible.

Given we have a large number of available integers (nearly infinite), and we are targeting machines as the intended audience, it's also easy to say just give every possible identifier its own GSD ID.

Thoughts?

-- 
     Josh

Oliver Chang

unread,
Apr 6, 2022, 1:17:11 AM4/6/22
to Josh Bressers, GSD
On Wed, 6 Apr 2022 at 06:48, Josh Bressers <jo...@bress.net> wrote:
Hi all,

I'm working on a script to start adding the GrypeDB data into GSD. You can see the start of my script here


It's going to output the GrypeDB data into OSV format.

That's awesome!! I see you've already filed https://github.com/ossf/osv-schema/issues/40, but let us know if you run into any other difficulties. 

Side question: Is there a license for GrypeDB data?


And I've hit an interesting challenge. At the moment the script will spit out just the advisory types because I needed to figure something out.

There are CVE IDs, and GHSA, but also a lot of ELSA (Oracle Linux) and ALSA (Amazon Linux) advisories.

How do we want to handle these types overlapping identifiers? If there is a CVE or GHSA already, should we just add metadata to the existing ID? Should we let them have their own GSD that also has one or more related tags?

I see value in both approaches.

The old way would be to overload one ID as the "primary" ID as much as possible. This would be CVE probably given it is both the most widely used and least flexible.

Given we have a large number of available integers (nearly infinite), and we are targeting machines as the intended audience, it's also easy to say just give every possible identifier its own GSD ID.

I think we should pull advisories that have a well defined source (e.g. GHSA, ELSA, ALSA) from the original source only, as it's the most authoritative. This isn't convenient or easy for every source today, and GrypeDB will be a big help in filling in the gaps.

For any given ID, if there is in fact a direct source where this can be pulled from (e.g. GitHub's GHSA repo in OSV format), then this should be trusted over other sources that provide metadata for the same ID. If some other database wants to provide metadata for the same vulnerability, then they should get/use their own ID. This will make updates and keeping things in sync a lot easier. 

CVEs are an exception, because they're rather overloaded -- if those are the primary key for a source, then perhaps it makes sense to assign them their own GSD to disambiguate them? 

Thoughts?

-- 
     Josh

--
You received this message because you are subscribed to the Google Groups "GSD Discussion Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsd+uns...@groups.cloudsecurityalliance.org.

Kurt Seifried

unread,
Apr 6, 2022, 1:59:39 AM4/6/22
to Oliver Chang, Josh Bressers, GSD
On Tue, Apr 5, 2022 at 11:17 PM 'Oliver Chang' via GSD Discussion Group <g...@groups.cloudsecurityalliance.org> wrote:

On Wed, 6 Apr 2022 at 06:48, Josh Bressers <jo...@bress.net> wrote:
Hi all,

I'm working on a script to start adding the GrypeDB data into GSD. You can see the start of my script here


It's going to output the GrypeDB data into OSV format.

That's awesome!! I see you've already filed https://github.com/ossf/osv-schema/issues/40, but let us know if you run into any other difficulties. 

Side question: Is there a license for GrypeDB data?


And I've hit an interesting challenge. At the moment the script will spit out just the advisory types because I needed to figure something out.

There are CVE IDs, and GHSA, but also a lot of ELSA (Oracle Linux) and ALSA (Amazon Linux) advisories.

How do we want to handle these types overlapping identifiers? If there is a CVE or GHSA already, should we just add metadata to the existing ID? Should we let them have their own GSD that also has one or more related tags?

I see value in both approaches.

The old way would be to overload one ID as the "primary" ID as much as possible. This would be CVE probably given it is both the most widely used and least flexible.

Given we have a large number of available integers (nearly infinite), and we are targeting machines as the intended audience, it's also easy to say just give every possible identifier its own GSD ID.

I think we should pull advisories that have a well defined source (e.g. GHSA, ELSA, ALSA) from the original source only, as it's the most authoritative. This isn't convenient or easy for every source today, and GrypeDB will be a big help in filling in the gaps.

Yes
 

For any given ID, if there is in fact a direct source where this can be pulled from (e.g. GitHub's GHSA repo in OSV format), then this should be trusted over other sources that provide metadata for the same ID. If some other database wants to provide metadata for the same vulnerability, then they should get/use their own ID. This will make updates and keeping things in sync a lot easier. 

I think here a simple solution is to look at "what's the most upstream advisory", e.g. if you have a RedHat thing then the most authoritative sources are:


And for Debian:


And for mediawiki:


and so on. Having said that if we find some random web page that mentions an ID we understand (e.g. "DSA-foo"), especially if we have (e.g. some file with an "alias": "DSA-foo") I say we add it in, or at least flag it for a human review if we don't know how to process the site automatically (file an issue against the file in our github?). Basically we try with automation and fail to "human, please help me!" which also creates a work queue with relatively easy (for a human) to parse data that provides a gentle on ramp for doing this kind of infosec work as well.
 
CVEs are an exception, because they're rather overloaded -- if those are the primary key for a source, then perhaps it makes sense to assign them their own GSD to disambiguate them? 

I think here we have some flexibility because we can update and link our data. E.g. the parent of/sibling of/etc relationship ideas. My preference would be to have both, e.g. an extreme example would be CVE-2021-44228, either way you go, one big advisory with 5000 affected (https://github.com/cisagov/log4j-affected-db/), or a huge pile of GSD's (one per vendor?) with an optional "primary" GSD, either way it's a lot of data and a bit messy, but I think to split up files (e.g. one per vendor) would be easier to process in most situations.

Josh Bressers

unread,
Apr 6, 2022, 8:54:37 AM4/6/22
to Oliver Chang, GSD
On Wed, Apr 6, 2022 at 12:17 AM Oliver Chang <och...@google.com> wrote:

On Wed, 6 Apr 2022 at 06:48, Josh Bressers <jo...@bress.net> wrote:
Hi all,

I'm working on a script to start adding the GrypeDB data into GSD. You can see the start of my script here


It's going to output the GrypeDB data into OSV format.

That's awesome!! I see you've already filed https://github.com/ossf/osv-schema/issues/40, but let us know if you run into any other difficulties.

I expect to keep filing issues as things pop up, thanks!
 

Side question: Is there a license for GrypeDB data?

This is a huge question actually. The license on a great deal of this data is unknowable. You can't copyright facts for example, which much of the data is, but you can copyright the prose descriptions. Descriptions in every vulnerability aggregation system come from a variety of sources, some of which have no license at all.

The intention of the GrypeDB data is to let the GSD treat the data however it wishes, but the GSD is aware this is not so simple.
 


The old way would be to overload one ID as the "primary" ID as much as possible. This would be CVE probably given it is both the most widely used and least flexible.

Given we have a large number of available integers (nearly infinite), and we are targeting machines as the intended audience, it's also easy to say just give every possible identifier its own GSD ID.

I think we should pull advisories that have a well defined source (e.g. GHSA, ELSA, ALSA) from the original source only, as it's the most authoritative. This isn't convenient or easy for every source today, and GrypeDB will be a big help in filling in the gaps.

For any given ID, if there is in fact a direct source where this can be pulled from (e.g. GitHub's GHSA repo in OSV format), then this should be trusted over other sources that provide metadata for the same ID. If some other database wants to provide metadata for the same vulnerability, then they should get/use their own ID. This will make updates and keeping things in sync a lot easier. 

CVEs are an exception, because they're rather overloaded -- if those are the primary key for a source, then perhaps it makes sense to assign them their own GSD to disambiguate them?

I think there's a disconnect in this answer. I'm not suggesting the GrypeDB data be treated as authoritative. It will exist in a namespace in GSD. This data is supplementary to the authoritative sources. The real add in this data is the affected versions, the rest is mostly irrelevant to be honest because we will want the authoritative copy.

That said, I think this answers my question. Overloading existing identifiers is wrong. I think we should treat every source as its own authority and capture that data whenever possible.

For example: rather than trying to untangle which CVEs a GHSA might reference, we should log the CVEs and GHSA separately then we can rely on aliases to connect things.

-- 
     Josh

Weston Steimel

unread,
Apr 6, 2022, 11:51:39 AM4/6/22
to GSD Discussion Group, Josh Bressers, GSD, Oliver Chang
OSV uses the source identifiers as the primary id for the record when available and then uses the aliases and related fields to reference other identifiers.  The currently used prefixes and original sources of data are identified at https://ossf.github.io/osv-schema/#id-modified-fields.  I have thought it'd be really cool to get all of the various linux distro feeds also exporting OSV format to osv.dev at least, but haven't yet had time to work on that.  Perhaps that is something Oliver has already been considering? 

Thanks,
--Weston Steimel
Reply all
Reply to author
Forward
0 new messages