Data format discussion - OSV especially

22 views
Skip to first unread message

Kurt Seifried

unread,
Jul 18, 2021, 5:22:51 PMJul 18
to u...@groups.cloudsecurityalliance.org
So I like the OSV format (https://osv.dev/docs/#tag/vulnerability_schema), I think it fits well with the OpenSource/source code management (e.g. git) view of the world (this package, this commit introduces the vuln, this commit fixes it). It should work really well for distros (are we vuln? Do we include the code from commit X? etc.). 

But where I'm running into trouble is e.g. CVE-2021-28692 in Debian stable:


Debian stable (buster)
xen prior to 4.11.4+107-gef32c7afa2-1.

This data won't "fit" in the OSV schema, e.g. do we assign an OSV for debian and use the:

"package": {
"name": "string",
"ecosystem": "string",
"purl": "string"
}

to specify "xen" in the ecosystem Debian? Or do I overload the 

"versions": [
"string"
]

entirely with some CPE like string:

Debian:buster:xen:<4.11.4+107-gef32c7afa2-1

Or (as I suspect) is OSV not aimed at solving this part of the problem instead focussing on the upstream/commit centric view of the world that then flows downstream?

Part of me thinks it would be better to have data formats that actually specialize on different views of the world (e.g. OSV for upstream centric view, something else for the distro view, and something entirely else for malware for example). The advantage of UVI being that we can hold multiple types of data (by simple virtue of using JSON and assigning namespaces so e.g. OSV can put data in, we could simply mass import the Alpine Linux format, etc.).

I'm also tired of "this will be the one true JSON (or XML) to describe all aspects of security" (like the terrible CVE JSON format I invented, CVRF, CSAF, etc...). Thoughts/comments on how to best support multiple formats, give hints/etc are welcome.


Kurt Seifried
Chief Blockchain Officer and Director of Special Projects
Cloud Security Alliance

Oliver Chang

unread,
Jul 18, 2021, 9:58:11 PMJul 18
to Kurt Seifried, u...@groups.cloudsecurityalliance.org
Hey Kurt,

Thank you very much for the comments!

On Mon, 19 Jul 2021 at 07:22, Kurt Seifried <ksei...@cloudsecurityalliance.org> wrote:
So I like the OSV format (https://osv.dev/docs/#tag/vulnerability_schema), I think it fits well with the OpenSource/source code management (e.g. git) view of the world (this package, this commit introduces the vuln, this commit fixes it). It should work really well for distros (are we vuln? Do we include the code from commit X? etc.). 

But where I'm running into trouble is e.g. CVE-2021-28692 in Debian stable:


Debian stable (buster)
xen prior to 4.11.4+107-gef32c7afa2-1.

This data won't "fit" in the OSV schema, e.g. do we assign an OSV for debian and use the:

"package": {
"name": "string",
"ecosystem": "string",
"purl": "string"
}

The intention for this case is to use something like:

"package": {
  "name": "xen",
  "ecosystem": "Debian"
  "purl": "..."

Our format doesn't require git commit info either, and you can instead specify version ranges by e.g.

"ranges": [{
  "type": "ECOSYSTEM",
  "introduced": "4.10.1-4",
  "fixed": "4.11.4+107-gef32c7afa2-1",
}]

(and of course the explicit "versions" list as well).

The OSV format doesn't aim to be a format that holds affected package data for every ecosystem in a single entry. We want this to be a distributed format with separate entries for each package ecosystem, and tracking them separately enables more flexibility with text descriptions etc and allows a more distributed/federated model. They can cross reference each other by "aliases" or "related" fields.

Do you agree with this goal? Of course, we are always open to changes to the schema to make it more useful. 


to specify "xen" in the ecosystem Debian? Or do I overload the 

"versions": [
"string"
]

entirely with some CPE like string:

Debian:buster:xen:<4.11.4+107-gef32c7afa2-1

Or (as I suspect) is OSV not aimed at solving this part of the problem instead focussing on the upstream/commit centric view of the world that then flows downstream?

Part of me thinks it would be better to have data formats that actually specialize on different views of the world (e.g. OSV for upstream centric view, something else for the distro view, and something entirely else for malware for example). The advantage of UVI being that we can hold multiple types of data (by simple virtue of using JSON and assigning namespaces so e.g. OSV can put data in, we could simply mass import the Alpine Linux format, etc.).

Alpine is also adopting the OSV format, using {"ecosystem": "Alpine", "package": "pkg"} etc, and also planning some JSON-LD extensions to enable better sharing of vulnerability data across different distros. I think it would be nice if we could converge on a format for open source to enable easier sharing and tooling across all databases. 

I'm also tired of "this will be the one true JSON (or XML) to describe all aspects of security" (like the terrible CVE JSON format I invented, CVRF, CSAF, etc...). Thoughts/comments on how to best support multiple formats, give hints/etc are welcome.

The CVE board also seems interested in better aligning their CVE 5.0 schema with the OSV schema. We'll see where that takes us as well.  


Kurt Seifried
Chief Blockchain Officer and Director of Special Projects
Cloud Security Alliance

--
You received this message because you are subscribed to the Google Groups "UVI Discussion Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to uvi+uns...@groups.cloudsecurityalliance.org.

Ariadne Conill

unread,
Jul 18, 2021, 11:06:58 PMJul 18
to Kurt Seifried, u...@groups.cloudsecurityalliance.org
Hi,

On Sun, 18 Jul 2021, Kurt Seifried wrote:

> So I like the OSV format (https://osv.dev/docs/#tag/vulnerability_schema), I think it fits well with the OpenSource/source code management (e.g. git) view of the world (this package, this commit
> introduces the vuln, this commit fixes it). It should work really well for distros (are we vuln? Do we include the code from commit X? etc.).

I am planning to adopt this format for use in Alpine's security issue
tracking database, with a couple of extensions.

> But where I'm running into trouble is e.g. CVE-2021-28692 in Debian stable:
>
> https://www.debian.org/security/2021/dsa-4931
> https://security-tracker.debian.org/tracker/CVE-2021-28692
>
> Debian stable (buster)
> xen prior to 4.11.4+107-gef32c7afa2-1.
>
> This data won't "fit" in the OSV schema, e.g. do we assign an OSV for debian and use the:
>
> "package": {
> "name": "string",
> "ecosystem": "string",
> "purl": "string"
> }
>
> to specify "xen" in the ecosystem Debian? Or do I overload the
>
> "versions": [
> "string"
> ]
>
> entirely with some CPE like string:
>
> Debian:buster:xen:<4.11.4+107-gef32c7afa2-1
>
> Or (as I suspect) is OSV not aimed at solving this part of the problem instead focussing on the upstream/commit centric view of the world that then flows downstream?

In Alpine, we plan to do something like:

"versions": [
{"type": "RANGE", "earliest": ..., "latest": ...}
]

I forget what the actual fields are for the range type, but they are
documented in the OSV format spec. There is the question of how to signal
that those are *Alpine*-specific versions, but we will deal with it when
we get there.

> Part of me thinks it would be better to have data formats that actually specialize on different views of the world (e.g. OSV for upstream centric view, something else for the distro view, and
> something entirely else for malware for example). The advantage of UVI being that we can hold multiple types of data (by simple virtue of using JSON and assigning namespaces so e.g. OSV can put data
> in, we could simply mass import the Alpine Linux format, etc.).

I don't think this is the right way. We should want to have a unified,
standard JSON-based vocabulary for describing vulnerability data. There
are aspects about the OSV format I dislike, but I believe that it can be
worked into a truly universal format which makes the most of Linked Data.

I intend to write about my vision for that soon.

Ariadne

Oliver Chang

unread,
Jul 18, 2021, 11:27:32 PMJul 18
to Ariadne Conill, Kurt Seifried, u...@groups.cloudsecurityalliance.org
Hmm, that seems similar enough to the actual OSV spec (type: "ECOSYSTEM" instead and "introduced", "fixed") -- could we use that one instead for now? If there are changes that need to be made we can do this as part of the actual spec. 


> Part of me thinks it would be better to have data formats that actually specialize on different views of the world (e.g. OSV for upstream centric view, something else for the distro view, and
> something entirely else for malware for example). The advantage of UVI being that we can hold multiple types of data (by simple virtue of using JSON and assigning namespaces so e.g. OSV can put data
> in, we could simply mass import the Alpine Linux format, etc.).

I don't think this is the right way.  We should want to have a unified,
standard JSON-based vocabulary for describing vulnerability data.  There
are aspects about the OSV format I dislike, but I believe that it can be
worked into a truly universal format which makes the most of Linked Data.

I intend to write about my vision for that soon.

Ariadne

Oliver Chang

unread,
Jul 19, 2021, 2:22:40 AMJul 19
to Ariadne Conill, Kurt Seifried, u...@groups.cloudsecurityalliance.org
On Mon, 19 Jul 2021 at 13:07, Ariadne Conill <ari...@dereferenced.org> wrote:
Would you be able to talk briefly about which parts you dislike and think could be improved on? It would be nice to align on this early :) 

I intend to write about my vision for that soon.

Ariadne

Ariadne Conill

unread,
Jul 19, 2021, 3:55:26 AMJul 19
to Oliver Chang, Ariadne Conill, Kurt Seifried, u...@groups.cloudsecurityalliance.org
Hi,
Yes, that was what I was referring to. We will be using that once the
secfixes-tracker is outputting data in this format.

Ariadne

Ariadne Conill

unread,
Jul 19, 2021, 4:05:36 AMJul 19
to Oliver Chang, Ariadne Conill, Kurt Seifried, u...@groups.cloudsecurityalliance.org
Hi,

On Mon, 19 Jul 2021, Oliver Chang wrote:

>
Basically, I want to be able to use URIs to identify things, such as
ecosystems, so that you can walk along the URIs in the JSON document to
discover more information. This is basically what I am getting at when I
talk about Linked Data and how it is very applicable here.

For example, a vulnerability could have data in many databases, and
allowing for a URI-first design makes it very easy to write tooling to
walk along the whole semantic graph for a given vulnerability:

"references": [
"https://cve.mitre.org/CVE-whatever",
"https://security.alpinelinux.org/vuln/ALSA-whatever",
...
]

The idea is that all of these URIs would point to additional data, in the
same format, that could be ingested automatically by tooling if desired,
and that tooling would know that all of these objects are all related to
each other (e.g. they're all about the same thing, but different views
of that thing).

For me, this is obvious, as I have been dealing with distributed graphs
for a long time, but also because I grew up having a librarian as a
mother, and so the whole Linked Data concept as advanced by W3C has been
something demonstrated to me basically my entire life :)

It is just hard to explain it to people who are not familiar with
web-scale datasets. Basically, I want to make sure that the format we
create makes the most of what the web has to offer, since we are exporting
all this data over it anyway. This isn't made easy by the fact that the
semantic web / RDF (and even JSON-LD) folks have not really made any
killer demos for the concept.

(Note I am not in favor of making JSON-LD a hard requirement for this, nor
do I think it is even required to do so, but building a system where URIs
are a first-class citizen makes it possible to build very powerful
tooling. ActivityPub is a good model here: you can use it with or
without JSON-LD, and still make the most of the Linked Data approach.)

Ariadne

Ariadne Conill

unread,
Jul 19, 2021, 4:11:15 AMJul 19
to Ariadne Conill, Oliver Chang, Kurt Seifried, u...@groups.cloudsecurityalliance.org
Hi,
And where this becomes *really* awesome is when you're a researcher trying
to figure out what exploitation vectors a given malware attack is using.
You could have something like:

"exploitation_vectors": [
"https://msrc.microsoft.com/CVE-2021-34481",
"https://cve.mitre.org/CVE-2021-34481",
"https://exploit-db.com/CVE-2021-34481",
]

Obviously these are not real objects (yet), but that's basically the world
I am trying to build. This is what UVI should try to push for, in my
opinion.

But I think we all agree that if you have a universal vulnerability data
format, built on linked data, where you can follow these URIs and get more
data for your research / enrichment needs, it will be a pretty awesome
world, which is why I am pushing as hard as I can for URIs and Linked Data
to be baked into the spec from day 1.

Ariadne

Kurt Seifried

unread,
Jul 19, 2021, 9:45:40 PMJul 19
to Oliver Chang, u...@groups.cloudsecurityalliance.org
On Sun, Jul 18, 2021 at 7:58 PM Oliver Chang <och...@google.com> wrote:
Hey Kurt,

Thank you very much for the comments!

On Mon, 19 Jul 2021 at 07:22, Kurt Seifried <ksei...@cloudsecurityalliance.org> wrote:
So I like the OSV format (https://osv.dev/docs/#tag/vulnerability_schema), I think it fits well with the OpenSource/source code management (e.g. git) view of the world (this package, this commit introduces the vuln, this commit fixes it). It should work really well for distros (are we vuln? Do we include the code from commit X? etc.). 

But where I'm running into trouble is e.g. CVE-2021-28692 in Debian stable:


Debian stable (buster)
xen prior to 4.11.4+107-gef32c7afa2-1.

This data won't "fit" in the OSV schema, e.g. do we assign an OSV for debian and use the:

"package": {
"name": "string",
"ecosystem": "string",
"purl": "string"
}

The intention for this case is to use something like:

"package": {
  "name": "xen",
  "ecosystem": "Debian"
  "purl": "..."

How does it deal with multiple versions of the same package e.g. "example 1.2.3" in Debian stable and Debian unstable? Could package listings be made an array of objects instead of a single object? Or is the intent to have multiple OSV entries, e.g. an OSV for Debian:stable:example:1.2.3 and another OSV for Debian:unstable:example:1.2.3. I think I'd prefer to have multiple files actually especially since you support "related"
 

Our format doesn't require git commit info either, and you can instead specify version ranges by e.g.

"ranges": [{
  "type": "ECOSYSTEM",
  "introduced": "4.10.1-4",
  "fixed": "4.11.4+107-gef32c7afa2-1",
}]

(and of course the explicit "versions" list as well).

The OSV format doesn't aim to be a format that holds affected package data for every ecosystem in a single entry. We want this to be a distributed format with separate entries for each package ecosystem, and tracking them separately enables more flexibility with text descriptions etc and allows a more distributed/federated model. They can cross reference each other by "aliases" or "related" fields.

Perfect, that's also my plan with UVI data (more on this in a later email). 
 

Do you agree with this goal? Of course, we are always open to changes to the schema to make it more useful. 

I think making it explicit whether the goal is to jam a lot of data into a single OSV, or have lots of OSV's for a given issue (e.g. one per vendor:product:package basically) and use the related/aliases fields.
 


to specify "xen" in the ecosystem Debian? Or do I overload the 

"versions": [
"string"
]

entirely with some CPE like string:

Debian:buster:xen:<4.11.4+107-gef32c7afa2-1

Or (as I suspect) is OSV not aimed at solving this part of the problem instead focussing on the upstream/commit centric view of the world that then flows downstream?

Part of me thinks it would be better to have data formats that actually specialize on different views of the world (e.g. OSV for upstream centric view, something else for the distro view, and something entirely else for malware for example). The advantage of UVI being that we can hold multiple types of data (by simple virtue of using JSON and assigning namespaces so e.g. OSV can put data in, we could simply mass import the Alpine Linux format, etc.).

Alpine is also adopting the OSV format, using {"ecosystem": "Alpine", "package": "pkg"} etc, and also planning some JSON-LD extensions to enable better sharing of vulnerability data across different distros. I think it would be nice if we could converge on a format for open source to enable easier sharing and tooling across all databases. 

I would ask that you (and OSV) and information about what kind of file it is, e.g. one of the few things I got right was versioning in the CVE data format at the start:

"VERSION": "1.2",

and then in 4.0 we added data type/format:

"data_type": "CVE",
"data_format": "MITRE",
"data_version": "4.0",

so e.g. I'd ask that "pure" OSV use something like:

"data_type": "OSV",
"data_format": "OSV",
"data_version": "1.0"

And assuming AlpineLinux has their own flavour that is different:

"data_type": "OSV",
"data_format": "AlpineLinux",
"data_version": "1.0"

because then when someone grabs a JSON file they don't have to guess what it is/do tricks to try and identify it, it's clear and simple.

So to make sure it's obvious: I'd like to ask that the OSV add a data_type/data_format/data_version to the schema.
 

I'm also tired of "this will be the one true JSON (or XML) to describe all aspects of security" (like the terrible CVE JSON format I invented, CVRF, CSAF, etc...). Thoughts/comments on how to best support multiple formats, give hints/etc are welcome.

The CVE board also seems interested in better aligning their CVE 5.0 schema with the OSV schema. We'll see where that takes us as well.  

As the author of that format it supports a lot of this but they never were willing to use it or enforce data cleanliness, witness:


    "affects": {
        "vendor": {
            "vendor_data": [
                {
                    "vendor_name": "n/a",
                    "product": {
                        "product_data": [
                            {
                                "product_name": "gstreamer-plugins-good",
                                "version": {
                                    "version_data": [
                                        {
                                            "version_value": "gstreamer-plugins-good 1.18.4"

Sigh. You can explicitly nest the objects so what they could do is:

vendor{
vendor_name Debian
  product { 
  product_name Linux
    version_value stable
      product {
      product_name gstreamer-plugins-good
          version_value 1.18.4

In other words my vision was objects containing objects, objects all the way down and up. I doubt very much they'll go with that vision but I'd love to be wrong.



 -Kurt

Kurt Seifried

unread,
Jul 19, 2021, 10:42:08 PMJul 19
to Ariadne Conill, Oliver Chang, u...@groups.cloudsecurityalliance.org
On Mon, Jul 19, 2021 at 2:11 AM Ariadne Conill <ari...@dereferenced.org> wrote:


And where this becomes *really* awesome is when you're a researcher trying
to figure out what exploitation vectors a given malware attack is using.
You could have something like:

"exploitation_vectors": [
   "https://msrc.microsoft.com/CVE-2021-34481",
   "https://cve.mitre.org/CVE-2021-34481",
   "https://exploit-db.com/CVE-2021-34481",
]

Obviously these are not real objects (yet), but that's basically the world
I am trying to build.  This is what UVI should try to push for, in my
opinion.

But I think we all agree that if you have a universal vulnerability data
format, built on linked data, where you can follow these URIs and get more
data for your research / enrichment needs, it will be a pretty awesome
world, which is why I am pushing as hard as I can for URIs and Linked Data
to be baked into the spec from day 1.

Ariadne

Speaking of which this leads to a philosophical issue:

"exploitation_vectors": [
   "https://msrc.microsoft.com/CVE-2021-34481"

vs

references: [
"exploitation_vectors": "yes"}
]

vs having both?

My take on this is... I think having a URL: references section is a historical relic, in that there's a better way, you'll want to have URL references for sure, and you'll want to have specific data things that point at a URL. Based on the "farm to fork" model for data -> statements I'm leaning towards (but not convinced of) a format something like:

"exploitation_vectors": [
   {"type": "url",
"more tags": "more data context added as time goes on, e.g. someone reads the url and puts the workaround info or whatever in JSON formatted data with this entry"

because it's clear that that is a URL, why it's there, what it is, etc. Parsing the file to get all the referenced URLs would be trivial, and it's a lot better than the mostly random list of URLs in each CVE with no real context (e.g. "FEDORA" or "GENTOO" but you already knew that from the URL). 






Oliver Chang

unread,
Jul 20, 2021, 12:47:57 AMJul 20
to Ariadne Conill, Kurt Seifried, u...@groups.cloudsecurityalliance.org
Everything you've said here sounds amazing and sounds like a great goal to work towards.

We did make a conscious decision to avoid URIs though, as domains/URIs can be unstable and change (or even disappear) over time, especially for community based databases that may not have dedicated staffing/resources. The idea we had with cross referencing data is that any database can store vuln data for any ID. So rather than explicitly linking to specific database entries, a user can check all the known databases for a vuln they're interested in (or an aggregator can do it for them). 

Perhaps we could be convinced otherwise, but we definitely do want to ensure it's compatible with the JSON-LD extensions though, via e.g. mappings of string values to JSON-LD URIs (as I believe we discussed before for ecosystems). Happy to hear others' thoughts on this too. 

Oliver Chang

unread,
Jul 20, 2021, 12:51:47 AMJul 20
to Kurt Seifried, u...@groups.cloudsecurityalliance.org
On Tue, 20 Jul 2021 at 11:45, Kurt Seifried <ksei...@cloudsecurityalliance.org> wrote:



On Sun, Jul 18, 2021 at 7:58 PM Oliver Chang <och...@google.com> wrote:
Hey Kurt,

Thank you very much for the comments!

On Mon, 19 Jul 2021 at 07:22, Kurt Seifried <ksei...@cloudsecurityalliance.org> wrote:
So I like the OSV format (https://osv.dev/docs/#tag/vulnerability_schema), I think it fits well with the OpenSource/source code management (e.g. git) view of the world (this package, this commit introduces the vuln, this commit fixes it). It should work really well for distros (are we vuln? Do we include the code from commit X? etc.). 

But where I'm running into trouble is e.g. CVE-2021-28692 in Debian stable:


Debian stable (buster)
xen prior to 4.11.4+107-gef32c7afa2-1.

This data won't "fit" in the OSV schema, e.g. do we assign an OSV for debian and use the:

"package": {
"name": "string",
"ecosystem": "string",
"purl": "string"
}

The intention for this case is to use something like:

"package": {
  "name": "xen",
  "ecosystem": "Debian"
  "purl": "..."

How does it deal with multiple versions of the same package e.g. "example 1.2.3" in Debian stable and Debian unstable? Could package listings be made an array of objects instead of a single object? Or is the intent to have multiple OSV entries, e.g. an OSV for Debian:stable:example:1.2.3 and another OSV for Debian:unstable:example:1.2.3. I think I'd prefer to have multiple files actually especially since you support "related"

If these are different namespaces for package names (Debian:stable vs Debian:unstable), then these would have to be separate entries. I think we both agree on tracking these as separate entries :) 
 

Our format doesn't require git commit info either, and you can instead specify version ranges by e.g.

"ranges": [{
  "type": "ECOSYSTEM",
  "introduced": "4.10.1-4",
  "fixed": "4.11.4+107-gef32c7afa2-1",
}]

(and of course the explicit "versions" list as well).

The OSV format doesn't aim to be a format that holds affected package data for every ecosystem in a single entry. We want this to be a distributed format with separate entries for each package ecosystem, and tracking them separately enables more flexibility with text descriptions etc and allows a more distributed/federated model. They can cross reference each other by "aliases" or "related" fields.

Perfect, that's also my plan with UVI data (more on this in a later email). 

Nice! Looking forward to hearing your plans.  
 

Do you agree with this goal? Of course, we are always open to changes to the schema to make it more useful. 

I think making it explicit whether the goal is to jam a lot of data into a single OSV, or have lots of OSV's for a given issue (e.g. one per vendor:product:package basically) and use the related/aliases fields.

Do you mean making this more explicit in our spec? 
 


to specify "xen" in the ecosystem Debian? Or do I overload the 

"versions": [
"string"
]

entirely with some CPE like string:

Debian:buster:xen:<4.11.4+107-gef32c7afa2-1

Or (as I suspect) is OSV not aimed at solving this part of the problem instead focussing on the upstream/commit centric view of the world that then flows downstream?

Part of me thinks it would be better to have data formats that actually specialize on different views of the world (e.g. OSV for upstream centric view, something else for the distro view, and something entirely else for malware for example). The advantage of UVI being that we can hold multiple types of data (by simple virtue of using JSON and assigning namespaces so e.g. OSV can put data in, we could simply mass import the Alpine Linux format, etc.).

Alpine is also adopting the OSV format, using {"ecosystem": "Alpine", "package": "pkg"} etc, and also planning some JSON-LD extensions to enable better sharing of vulnerability data across different distros. I think it would be nice if we could converge on a format for open source to enable easier sharing and tooling across all databases. 

I would ask that you (and OSV) and information about what kind of file it is, e.g. one of the few things I got right was versioning in the CVE data format at the start:

Yeah, this is something we've been thinking about. We've been avoiding it in the earlier days because the schema was pretty unstable (and we didn't really have a proper name for it), but I think it's gotten to a fairly stable enough point that we can slap on a version + official name soon. Will keep you posted.
This e-mail account is used only for work-related purposes; it is not guaranteed that any correspondence sent to this address will be read by the addressee only, as it may be necessary, under certain circumstances, for third parties appointed by the Cloud Security Alliance to access this e-mail account. Please do not send any messages of a personal nature to this address.

Ariadne Conill

unread,
Jul 20, 2021, 2:24:11 AMJul 20
to Oliver Chang, Ariadne Conill, Kurt Seifried, u...@groups.cloudsecurityalliance.org
Hello,
URIs are basically *the* thing needed to make this an open system where
everyone can contribute their own data. Outside of that, we do not really
get to a point where we have *universal* vulnerability information. There
are hundreds of thousands of projects, and a system which is not URI-first
basically discourages them from participating by publishing their own
vulnerability data -- instead sticking with centralized databases.

Yes, databases can disappear, but others can maintain archives of them,
and a well-designed spec will be friendly for archival, anyway.

A URI itself does not need to be tied to the DNS system either, there is
the possibility of using identifiers which point to shared databases like
blockchains, etc. The Datashards project (https://datashards.net) are
working on such a system, which would be a good fit here for ensuring
linked data resources are preserved beyond the lifecycle of the original
maintainer.

We need to stop thinking about individual databases and start thinking
about databases of databases. Having a unified vocabulary is great, but
if everyone builds tooling which only checks a few pre-existing databases,
we don't really solve the problem we are trying to solve with UVI, in my
opinion.

(Ironically, this last part is something MITRE really has understood well
when I've talked to them briefly.)

To build a universal vulnerability knowledge base, we have to be bold and
design for zero-friction inclusion of new data sources. A spec where
there is a registry of known databases introduces sufficient friction that
projects will just stick to using GitHub security advisories or the
CVE reporting form rather than managing their own vulnerability data.

If you still aren't convinced, feel free to express your concerns, but
URIs are really the key here, in my opinion, as they allow *anyone* to
participate out of the box.

Ariadne

Kurt Seifried

unread,
Jul 20, 2021, 2:43:27 AMJul 20
to Ariadne Conill, Oliver Chang, u...@groups.cloudsecurityalliance.org
So Long story short:

Downsides of URLs/solutions:

Data is deleted - mirror it/copy extracted data
Data is modified - mirror it/copy extracted data
Data is moved - mirror it/copy extracted data or find the new location

Upsides of URLs:

URLs have a lot of built in context for finding and identifying stuff, e.g. "https://www.debian.org/security/2021/dsa-4931" or commit ID's or email MSG IDs.
URL's can have aliases within their domain/service (e.g. Bugzillas with CVE's aliased)
URL's can have aliases outside of their domain/service (e.g. URL shorteners)
URL's can be mirrored, it's  not always easy (e.g. https://psirt.global.sonicwall.com/vuln-detail/SNWLID-2021-0017 is 

============
<!DOCTYPE html><html><head><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><script src=https://www.google.com/recaptcha/api.js async defer=defer></script><script async src="https://www.googletagmanager.com/gtag/js?id=UA-114705979-1"></script><script>window.dataLayer = window.dataLayer || [];
    function gtag() { dataLayer.push(arguments); }
    gtag('js', new Date());

    gtag('config', 'UA-114705979-1');</script><title>Security Advisory</title><link href=/static/css/vendors~main.dc36b71b7abc5f6241b2.css rel=stylesheet></head><body><div id=app></div><script src=/static/js/vendors~main.55c24b1145dd30568c93.js></script><script src=/static/js/main.1a680dd64bb5fcd249cf.js></script></body></html>
============

but chrome --headless --dump-dom, or worst case scenario send a human to the page and make them save it manually and shove it into the mirror system. This is why one of my new rules is that I don't download and process web pages. I mirror web pages and then process my local copy (https://github.com/cloudsecurityalliance/uvi-url-downloads).

 Basically the "mirror it, extract info, then feed that into your system" is just to good on many levels to do anything else, e.g. my initial experiments:


I can also now modify the www.debian.org-processing.py and rerun it on the 5100+ Debian entries in 2.5 minutes on my desktop and get better/new data as I improve the parser (speaking of which I'll be spitting out OSV/CVE formatted data shortly). And yes my code is awful, I'm a terrible programmer. 

As for the whole JSON-LD/allowing anyone to play, we have to use URL's and embrace the original vision of the web, with the addendum: let's make copies of stuff, storage is cheap and losing data is a pain (to say nothing of dealing with modifications/etc). 

Another addendum: the UVI plans to be *THE* aggregator, not just in the sense of we crawl/locate everything we can, but we also allow anyone to shove content to us. Discovery is a hard problem, but solvable (like that company, the name starts with a "G" I think, they let you find web pages easily, Giggle? Goose? something like that ;) especially if you can extract meaning from the data and let users search it easily.

 

Ariadne Conill

unread,
Jul 20, 2021, 3:26:56 AMJul 20
to Kurt Seifried, Ariadne Conill, Oliver Chang, u...@groups.cloudsecurityalliance.org
Hi,
Yep, exactly! That's the real value of both the CVE *and* the UVI
databases: curation of security data.

If you talk to the CVE team, you find out very quickly that really all
they do is collect and add URIs to things.

A system that isn't built around that realization will just be yet another
CVE replacement standard that doesn't go anywhere. The special sauce is
curation, and URIs are what enable that curation.

Ariadne

Kurt Seifried

unread,
Jul 20, 2021, 1:32:14 PMJul 20
to Ariadne Conill, Oliver Chang, u...@groups.cloudsecurityalliance.org
On Tue, Jul 20, 2021 at 1:26 AM Ariadne Conill <ari...@dereferenced.org> wrote:


Yep, exactly!  That's the real value of both the CVE *and* the UVI
databases: curation of security data.

If you talk to the CVE team, you find out very quickly that really all
they do is collect and add URIs to things.

A system that isn't built around that realization will just be yet another
CVE replacement standard that doesn't go anywhere.  The special sauce is
curation, and URIs are what enable that curation.

Ariadne

This is a major part of why I'm reframing the problem (at least for myself and collecting data for the UVI) as a web search problem and not a human infosec research problem (hint: if there's a webpage some human already did the infosec research most likely). So we need curation, but we need automated curation, for example:


17 URL's... but https://www.suse.com/security/cve/ contains links to 28608 CVEs. There's 22000 or so SUSE urls in CVE so at least a few thousand are missing, which ones? Dunno, you'd have to crawl all the links and parse them. Honestly it'd be easier to just add the links. Even better using automation caught a bunch of errors which Marcus was able to quickly fix:


Debian is an even better example as each DSA links to package pages and security pages which often have links to the git commits (which means you could auto generate OSV entries largely) e.g.



so there's your OSV data in a nice tidy bundle. As you can see the automation for this ranges from trivial (well formed Debian data at well defined urls, easily discovered, curated, etc.) to non trivial (email postings and tweets from trusted people). 

The challenge is that we need curation, but automated curation. Even if we only spend 1 second per link (and it'll take a lot more than that) there's still hundreds of hours of work for just the well defined  stuff. Why not automate it? The UVI can do that (especially with our farm to fork data model), but CVE is still largely human driven/not automated (e.g. a ten line script could add all the SUSE urls), in other words not automating is basically the same as failure, especially at the scale we need this to work.


-Kurt




 

Josh Bressers

unread,
Jul 23, 2021, 9:36:04 PMJul 23
to Ariadne Conill, Kurt Seifried, Oliver Chang, u...@groups.cloudsecurityalliance.org
I think this makes sense, but I want to nitpick the language.

The word "curation" gives an impression of gate keeping. I will agree CVE suffers from the problem of curation. I think in the modern world curation is a bug, not a feature. Open source can move at the speed it does because projects can work around bugs like this.

One goal we want to see around the UVI project is to allow open source style contributions. The long tail is powerful.

-- 
    Josh

Oliver Chang

unread,
Aug 6, 2021, 11:38:30 AMAug 6
to Josh Bressers, Ariadne Conill, Kurt Seifried, u...@groups.cloudsecurityalliance.org
Hi all,

Thank you for the discussion, and sorry for the late reply!

We've moved our spec to https://github.com/ossf/osv-schema in the interests of fostering more discussions on this. Please let's continue any discussions there (feel free to create issues!)

Re automation: completely agreed that better automation is key to scaling triage efforts. That's why a big part of OSV is to develop this infrastructure to improve data quality by performing bisections, and mapping commits to versions etc. If there's anything we can do to help extend our infrastructure or collaborate on we'd be very happy to do this! 

Re URIs: Another reason why some most top level fields aren't URIs go beyond just domains expiring. e.g. our "aliases" field lists vulnerability IDs (not URIs), because we can't be sure which other databases/URLs hold information about a particular ID. This situation can also change at any time -- another database may add an entry for e.g. CVE-2021-1111 after our own entry is made. This will impose on everyone producing this data the burden of infrastructure to sync this information in the first place. That will hinder adoption in my opinion. 

For consumers of this data (i.e. someone running a vulnerability scanning tool on their dependencies), it's likely they will only "trust" a few known databases anyway (e.g. UVI) that have done curation/improvements to the data. That said, we certainly don't want to gatekeep databases. Anybody can add their database to https://github.com/ossf/osv-schema so we have a known registry of databases. 

Is there anything in the schema today blocks compatibility with JSON-LD with the previously discussed extensions? We also have a typed "references" field, which can be used/extended by JSON-LD users to facilitate structured linking of data:

"references": [ {
    "type": "ADVISORY",
}, {
    "type": "PACKAGE",
} ]

Of course, more types can always be added. 

Thanks!

Ariadne Conill

unread,
Aug 6, 2021, 12:39:08 PMAug 6
to Oliver Chang, Josh Bressers, Ariadne Conill, Kurt Seifried, u...@groups.cloudsecurityalliance.org
Hi,

On Fri, 6 Aug 2021, 'Oliver Chang' via UVI Discussion Group wrote:

> Hi all,
>
> Thank you for the discussion, and sorry for the late reply!
>
> We've moved our spec to https://github.com/ossf/osv-schema in the interests of fostering more discussions on this. Please let's continue any discussions there (feel free to create issues!)
>
> Re automation: completely agreed that better automation is key to scaling triage efforts. That's why a big part of OSV is to develop this infrastructure to improve data quality by performing
> bisections, and mapping commits to versions etc. If there's anything we can do to help extend our infrastructure or collaborate on we'd be very happy to do this!

While the projects that the OSV team and Google are doing are certainly
amazing, we should think bigger and try to build systems where *anyone*
can contribute this kind of data enrichment.

> Re URIs: Another reason why some most top level fields aren't URIs go beyond just domains expiring. e.g. our "aliases" field lists vulnerability IDs (not URIs), because we can't be sure which other
> databases/URLs hold information about a particular ID. This situation can also change at any time -- another database may add an entry for e.g. CVE-2021-1111 after our own entry is made. This will
> impose on everyone producing this data the burden of infrastructure to sync this information in the first place. That will hinder adoption in my opinion.

I don't follow. You can just choose to not add the reference in your own
database. And this kind of notification can be handled automatically
in a secure way using a variety of approaches (signatures, capability
URLs, etc.)

Having a closed system that pretends to be open hinders adoption. It is
possible, and in fact, trivially easy to just do the right thing here:
allow URIs to be used anywhere in the schema as a pointer.

> For consumers of this data (i.e. someone running a vulnerability scanning tool on their dependencies), it's likely they will only "trust" a few known databases anyway (e.g. UVI) that have done
> curation/improvements to the data. That said, we certainly don't want to gatekeep databases. Anybody can add their database to https://github.com/ossf/osv-schema so we have a known registry of
> databases.

But the thing is: that's not true. Anyone can send the OSV team a pull
request. And the OSV team can reject that pull request. And if they do,
then that person cannot play in the ecosystem if we build it as presently
designed, as URIs are not supported in the schema.

If UVI is intended to be universal, then it must use a schema that allows
universal participation without anyone's explicit approval.

> Is there anything in the schema today blocks compatibility with JSON-LD with the previously discussed extensions? We also have a typed "references" field, which can be used/extended by JSON-LD users
> to facilitate structured linking of data:

It is compatible with JSON-LD, but not really in the spirit of it.
JSON-LD is meant to enable rich ecosystems based on linked data. The
applicability to security databases is hopefully obvious.

> "references": [ {
>     "type": "ADVISORY",
>     "url": "https://www.debian.org/security/2021/dsa-4937",
> }, {
>     "type": "PACKAGE",
>     "url": "https://pypi.org/project/foo"
> } ]

The key thing here is that a linked object should be self-describing. In
other words, `type` should be in the linked object. Doing it the other
way is less natural, and is one of the things that is bad about CVE4.

Please look at the success of ActivityStreams[1], and the application of
ActivityPub[2] to federate ActivityStreams data. The people who do look
at those projects, and then apply the lessons learned from those projects
to distributing security data, are the people who will have the project
that actually gains traction in this space.

While OSV is an improvement over CVE4, there is still a long way to go
before it really solves the problem that we are building this ecosystem to
solve.

Distributions want a system based on URIs and push messaging. I can only
imagine that upstream projects also want this. Every person I've walked
through what JSON-LD can enable for us outside of the OSV team has
immediately gotten it.

If URIs cannot just be used in a document, then there's no point in even
worrying about JSON-LD, as the whole value proposition of using linked
data is lost, and the only thing JSON-LD gives you beyond that is a
typed schema.

It is so frustrating that OSV is like 90% of the right thing, and won't
just be the right thing. With just a little bit of effort, we can
overcome this and build something that is actually awesome, and we should
do it. I hope we do. I have certainly tried to make a case for the right
thing.

There is at least one data science person in Google who are familiar with
both ActivityStreams and ActivityPub. I can introduce you if it would
help.

Ariadne

[1]: https://www.w3.org/TR/activitystreams-core/
[2]: https://www.w3.org/TR/activitypub/

> Of course, more types can always be added.
>
> Thanks!
>
> On Sat, 24 Jul 2021 at 11:36, Josh Bressers <jo...@bress.net> wrote:
>
>
> On Tue, Jul 20, 2021 at 2:26 AM Ariadne Conill <ari...@dereferenced.org> wrote:
> Hi,
>
> On Tue, 20 Jul 2021, Kurt Seifried wrote:
>
>
> >
> > Another addendum: the UVI plans to be *THE* aggregator, not just in the sense of we crawl/locate everything we can, but we also allow anyone to shove content to us. Discovery is a
> hard problem, but
> > solvable (like that company, the name starts with a "G" I think, they let you find web pages easily, Giggle? Goose? something like that ;) especially if you can extract meaning
> from the data and let
> > users search it easily.
>
> Yep, exactly!  That's the real value of both the CVE *and* the UVI
> databases: curation of security data.
>
> If you talk to the CVE team, you find out very quickly that really all
> they do is collect and add URIs to things.
>
> A system that isn't built around that realization will just be yet another
> CVE replacement standard that doesn't go anywhere.  The special sauce is
> curation, and URIs are what enable that curation.
>
>
>
> I think this makes sense, but I want to nitpick the language.
>
> The word "curation" gives an impression of gate keeping. I will agree CVE suffers from the problem of curation. I think in the modern world curation is a bug, not a feature. Open source can move
> at the speed it does because projects can work around bugs like this.
>
> One goal we want to see around the UVI project is to allow open source style contributions. The long tail is powerful.
>
> --
>     Josh
>

Kurt Seifried

unread,
Aug 6, 2021, 12:59:32 PMAug 6
to Josh Bressers, Ariadne Conill, Oliver Chang, u...@groups.cloudsecurityalliance.org
I think it's more subtle than that, curation is fine as long as people can participate, with CVE there is a single gatekeeper, witness the failed DWF effort, or heck, witness MITRE's own CVEs not being in the database until the issuing CNA comes through, I have 500 or so public CVEs with data, why can't I contribute them to the database?

Part of my vision for the UVI is to build as near to a permissionless based ecosystem as possible, or if permission is needed to have multiple entities that can grant permission, ideally with widely varying views and cultures so that it's unlikely people get locked out if they have useful stuff. 

This also feeds into the JSON-LD and linked version of the world. 
 
-- 
    Josh

Josh Bressers

unread,
Aug 6, 2021, 8:20:06 PMAug 6
to Kurt Seifried, Ariadne Conill, Oliver Chang, u...@groups.cloudsecurityalliance.org
On Fri, Aug 6, 2021 at 11:59 AM Kurt Seifried <ksei...@cloudsecurityalliance.org> wrote:

Part of my vision for the UVI is to build as near to a permissionless based ecosystem as possible, or if permission is needed to have multiple entities that can grant permission, ideally with widely varying views and cultures so that it's unlikely people get locked out if they have useful stuff. 

This also feeds into the JSON-LD and linked version of the world. 
 

I want to reel this back a bit. I was out for a few weeks which gave me time to think about this whole problem, and I think we don't yet understand the problem well enough.

When working on something like UVI I have a saying "talk about problems, build solutions". JSON-LD is a solution, we shouldn't be talking about it, we should be building it.

But I'm not sure we should stop talking about the problem yet.

I think we all agree there are magnitudes more things to track than we are tracking. We don't even know how many more, is it 4x, 10x, 100x? We have no idea. My current favorite example is this search
That turns up 3.6 million issues with the term "prototype pollution". How many are actual problems? It's not going to be a small number.

How many pieces do we want to carve this problem into? Here are some of my initial thoughts

Identify
The OSV project is doing some really cool automation around discovering and fixing bugs. Who else is doing something here?

Tracking
How many security identifiers exist? Discovery of these is currently impossible.

Solutions
We have an obsession with patches today. This is a very narrow focus. How serious is the bug? Are you using a dependency in a vulnerable way? Is the bug reachable by attackers? There's a lot of room here. We can't keep up with the current paltry number of vulnerabilities, what happens if there are 10x the identifiers per year?

Communication
There is a knowledge gap in this space you could drive a Death Star through. I see messages today that range from confusion to just lies. This is going to be a big hill to climb.


There are probably more, but this is just some quick thoughts. I think if we want to see JSON-LD used (which I also want to see), we should start building a small demo. We'll learn more with 100 lines of code than we will with 100 emails messages :)

--
    Josh

Oliver Chang

unread,
Aug 9, 2021, 8:45:35 PMAug 9
to Josh Bressers, Kurt Seifried, Ariadne Conill, u...@groups.cloudsecurityalliance.org
My apologies for any frustrations caused -- I want to clarify that I'm 100% on board with the goals that we want to achieve of getting structured vulnerability data and a linked, structured world (and the benefits of that). Email is indeed hard to communicate with and I'd be happy to discuss this over a meeting too. Would it make sense to have some kind of regular meeting? 

Our motivation today for keeping this as a generic plain JSON comes from the fact that there's a bit of a catch-22 -- it makes the OSV format harder to adopt if we ask everybody to use JSON-LD when not many others are using it yet (for vulnerabilities). To bring up some previous examples again (again long technical discussions over email is hard... perhaps we should move to separate GitHub issues).

- "aliases" field -- in an ideal world these should be URIs. However, without everything using JSON-LD already and having infrastructure to synchronize this information today, we can't do this. It's still crucial metadata for a consumer of vulnerability entries to understand all the "names" a vulnerability is known as, so they can e.g. deduplicate across different providers.
- "typed" references -- today very few references would be valid JSON-LD, so we will need a "type" to describe them. Perhaps we could add a "JSON-LD" type here to encode such cases?

The highest priorities for the OSV team today are:
- To define a standardised format which describes without any ambiguity the set of affected package versions for a vulnerability (such that it can be used by automated tools with high reliability). Our experience with CVEs historically (as consumers of a large amount of open source) in this area have been extremely painful. 
- To encourage adoption as much as possible, such that anybody can easily output this format and store them anywhere without needing to set up infrastructure, have a stable URL to host them etc.

 As part of this, we also need infrastructure to help ensure data quality: i.e. calculating affected versions in an accurate way (e.g. from performing bisections + determining list of released versions affected from the upstream repo).

The spec is of course a work in progress, and at this point I think we want to encourage as much participation as possible while keeping the door open for making everything more structured as we get more adopters. Getting to that world will take time, and I think the first step is to get everyone to agree on how to describe packages and package versions precisely first :) I'd love for us to be able to continue to collaborate and iterate on this here!


On Sat, 7 Aug 2021 at 10:20, Josh Bressers <jo...@bress.net> wrote:
On Fri, Aug 6, 2021 at 11:59 AM Kurt Seifried <ksei...@cloudsecurityalliance.org> wrote:

Part of my vision for the UVI is to build as near to a permissionless based ecosystem as possible, or if permission is needed to have multiple entities that can grant permission, ideally with widely varying views and cultures so that it's unlikely people get locked out if they have useful stuff. 

This also feeds into the JSON-LD and linked version of the world. 
 

I want to reel this back a bit. I was out for a few weeks which gave me time to think about this whole problem, and I think we don't yet understand the problem well enough.

When working on something like UVI I have a saying "talk about problems, build solutions". JSON-LD is a solution, we shouldn't be talking about it, we should be building it.

But I'm not sure we should stop talking about the problem yet.

I think we all agree there are magnitudes more things to track than we are tracking. We don't even know how many more, is it 4x, 10x, 100x? We have no idea. My current favorite example is this search
That turns up 3.6 million issues with the term "prototype pollution". How many are actual problems? It's not going to be a small number.

How many pieces do we want to carve this problem into? Here are some of my initial thoughts

Identif
The OSV project is doing some really cool automation around discovering and fixing bugs. Who else is doing something here?

Tracking
How many security identifiers exist? Discovery of these is currently impossible.

Solutions
We have an obsession with patches today. This is a very narrow focus. How serious is the bug? Are you using a dependency in a vulnerable way? Is the bug reachable by attackers? There's a lot of room here. We can't keep up with the current paltry number of vulnerabilities, what happens if there are 10x the identifiers per year?

Communication
There is a knowledge gap in this space you could drive a Death Star through. I see messages today that range from confusion to just lies. This is going to be a big hill to climb.


There are probably more, but this is just some quick thoughts. I think if we want to see JSON-LD used (which I also want to see), we should start building a small demo. We'll learn more with 100 lines of code than we will with 100 emails messages :)

+1. A demo will help clarify things for everyone and help future database/vulnerability entry providers see the benefits as well!
 

--
    Josh

Josh Bressers

unread,
Aug 9, 2021, 8:51:58 PMAug 9
to Oliver Chang, Kurt Seifried, Ariadne Conill, u...@groups.cloudsecurityalliance.org
On Mon, Aug 9, 2021 at 7:45 PM Oliver Chang <och...@google.com> wrote:
My apologies for any frustrations caused -- I want to clarify that I'm 100% on board with the goals that we want to achieve of getting structured vulnerability data and a linked, structured world (and the benefits of that). Email is indeed hard to communicate with and I'd be happy to discuss this over a meeting too. Would it make sense to have some kind of regular meeting? 


I don't think you caused any frustration, it's email, we all know everyone are friends unless you use ALL CAPS :)

I'm OK with a few meetings, but I don't want a regular meeting schedule. I've found when a working group has too many meetings they lose the whole "working" portion of it. I also hate taking notes which you need with a meeting.

There's also the benefit of while email is slower, it does make us sharpen our message to a razor point. Meetings let things flop around.

--
    Josh

Ariadne Conill

unread,
Aug 9, 2021, 8:57:57 PMAug 9
to Oliver Chang, Josh Bressers, Kurt Seifried, Ariadne Conill, u...@groups.cloudsecurityalliance.org
Hi,

On Tue, 10 Aug 2021, Oliver Chang wrote:

> My apologies for any frustrations caused -- I want to clarify that I'm 100% on board with the goals that we want to achieve of getting structured vulnerability data and a linked, structured world (and
> the benefits of that). Email is indeed hard to communicate with and I'd be happy to discuss this over a meeting too. Would it make sense to have some kind of regular meeting?
>
> Our motivation today for keeping this as a generic plain JSON comes from the fact that there's a bit of a catch-22 -- it makes the OSV format harder to adopt if we ask everybody to use JSON-LD when
> not many others are using it yet (for vulnerabilities). To bring up some previous examples again (again long technical discussions over email is hard... perhaps we should move to separate GitHub
> issues).
>
> - "aliases" field -- in an ideal world these should be URIs. However, without everything using JSON-LD already and having infrastructure to synchronize this information today, we can't do this. It's
> still crucial metadata for a consumer of vulnerability entries to understand all the "names" a vulnerability is known as, so they can e.g. deduplicate across different providers.
> - "typed" references -- today very few references would be valid JSON-LD, so we will need a "type" to describe them. Perhaps we could add a "JSON-LD" type here to encode such cases?
>
> The highest priorities for the OSV team today are:
> - To define a standardised format which describes without any ambiguity the set of affected package versions for a vulnerability (such that it can be used by automated tools with high reliability).
> Our experience with CVEs historically (as consumers of a large amount of open source) in this area have been extremely painful.
> - To encourage adoption as much as possible, such that anybody can easily output this format and store them anywhere without needing to set up infrastructure, have a stable URL to host them etc.
>
>  As part of this, we also need infrastructure to help ensure data quality: i.e. calculating affected versions in an accurate way (e.g. from performing bisections + determining list of released
> versions affected from the upstream repo).
>
> The spec is of course a work in progress, and at this point I think we want to encourage as much participation as possible while keeping the door open for making everything more structured as we get
> more adopters. Getting to that world will take time, and I think the first step is to get everyone to agree on how to describe packages and package versions precisely first :) I'd love for us to be
> able to continue to collaborate and iterate on this here!

I for one think OSV is great. But I also think that there is value in
keeping OSV simple. Trying to make OSV cover the usecase for distributed
vulnerability tracking is just going to result in a format that frustrates
everyone.

A good way to think about it is: OSV should be like the RSS of the
vulnerability tracking world. And it is very good at being that already.

But to build real-time vulnerability sharing infrastructure, we need
something more fit to purpose.

So, my idea now, is to build something that makes the most of JSON-LD, but
can easily translate into OSV. This allows for vulnerability tracking
hubs to do their job very efficiently, while also providing vulnerability
databases an easy format to work with.

So the UVI format and the OSV format become complementary to each other.
As such, I think it makes sense to triage anything remaining in GitHub.

Ariadne

Oliver Chang

unread,
Aug 9, 2021, 9:08:16 PMAug 9
to Ariadne Conill, Josh Bressers, Kurt Seifried, u...@groups.cloudsecurityalliance.org
I think we should do this in a way that doesn't end up duplicating work. I think we'd all hate to see yet another completely different vulnerability format being introduced describing the same things. Tooling we (or others) build should should ideally be able to help everybody.

How do you think we could have two schemas in such a way to share as much as possible? Two ways I think of:

- Simply embedding an "OSV" structure (as UVI does today), and adding top-level fields that make the most of JSON-LD.
- Have some agreed way to extend the OSV structure with additional fields. 

WDYT? 

Ariadne Conill

unread,
Aug 9, 2021, 9:44:28 PMAug 9
to Oliver Chang, Ariadne Conill, Josh Bressers, Kurt Seifried, u...@groups.cloudsecurityalliance.org
Hi,
Well, fundamentally, I think the idea is to use the OSV format as a basis
and extend it with new fields that can be used in the way we need them to
work. This allows for UVI datasets to be easily represented in OSV
format.

Problem is that OSV is verbose in areas where we want to be able to just
include the remote references directly and have it feel like it is part of
the same graph. This is why I pushed so hard for URIs to be a first-class
feature of the OSV format. Fundamentally, we can't build something like
UVI without that and deliver the user/developer experience that makes the
whole thing compelling.

So, I think of the UVI format being a JSON-LD oriented remix of OSV. We
change a couple of things, so that it feels right when working with it as
linked data, but provide guidance -- and reference code -- to allow
translation to the simplified OSV format, and encourage implementations of
software using the UVI tooling to provide their data in the simplified OSV
format as well.

In other words, the UVI format basically will be a superset of OSV, with
fields that flow right for JSON-LD.

And, hopefully, in the future, there will be an OSV 2.0 which brings all
of that back into OSV. Think of that as Atom vs RSS: they both do the
same thing, but one is informed by past experiences.

In other words, there's two use-cases:

- If you want to publish vulnerability data, and you just want people to
scrape it, and you don't care about building a metaverse of vulnerability
data, you use OSV, as it is the best schema out there today. This would
be primarily useful to projects like Go, Rust, PyPI, etc. And forges,
like GitHub.

- If you want to participate in building a metaverse of vulnerability
data, e.g. what UVI is about, you use the UVI format which adds features
to OSV to make it work for that. Then you start linking up with other
trackers and exchanging data using Linked Data Notifications. This design
is primarily useful to CERT and SR teams, which want to know what other
teams are up to. It is also useful for archivists, and researchers, since
they can build scrapers to dig up all that CVE and OSV data out there.

Trying to bake both use-cases into OSV at the moment, especially when we
don't *completely* know how to build the UVI thing yet, is kind of silly.

But there could always be an OSV 2.0 that brings it all together, I think,
once we have the answers and there are *compelling* demos for you to see,
you will probably change your mind on this JSON-LD thing. But, it is
important... remember, Google became a trillion $ company from indexing a
universe of linked data. There will be similar opportunities in the UVI
ecosystem, too.

Ariadne

Oliver Chang

unread,
Aug 10, 2021, 2:04:53 AMAug 10
to Ariadne Conill, Josh Bressers, Kurt Seifried, u...@groups.cloudsecurityalliance.org
This sounds good to me. Please keep us in the loop once we need to figure out the details so we don't diverge too much if possible (most fields in OSV are technically optional as well). One thing I'd really like for us to all have is tooling that can help in the other direction (OSV->UVI)  as well. e.g. tooling that helps triage / automate parts of the OSV entry and have that flow to UVI easily. Having more compatible schemas will make this a lot easier.


In other words, the UVI format basically will be a superset of OSV, with
fields that flow right for JSON-LD.

And, hopefully, in the future, there will be an OSV 2.0 which brings all
of that back into OSV.  Think of that as Atom vs RSS: they both do the
same thing, but one is informed by past experiences.

In other words, there's two use-cases:

- If you want to publish vulnerability data, and you just want people to
scrape it, and you don't care about building a metaverse of vulnerability
data, you use OSV, as it is the best schema out there today.  This would
be primarily useful to projects like Go, Rust, PyPI, etc.  And forges,
like GitHub.

- If you want to participate in building a metaverse of vulnerability
data, e.g. what UVI is about, you use the UVI format which adds features
to OSV to make it work for that.  Then you start linking up with other
trackers and exchanging data using Linked Data Notifications.  This design
is primarily useful to CERT and SR teams, which want to know what other
teams are up to.  It is also useful for archivists, and researchers, since
they can build scrapers to dig up all that CVE and OSV data out there.

Trying to bake both use-cases into OSV at the moment, especially when we
don't *completely* know how to build the UVI thing yet, is kind of silly.

Yes! Completely agree with this direction. We need more time to understand these two different use cases and then see how we can converge as we get more adoption/usage.
Reply all
Reply to author
Forward
0 new messages