Another JSON-LD example

3 views
Skip to first unread message

Josh Bressers

unread,
Aug 15, 2021, 7:12:32 PM8/15/21
to u...@groups.cloudsecurityalliance.org
Hi all,

I spent a bit more time today with JSON-LD. This is truly one of those things where the more I learn the less I know.

So anyway, here's what looks like 15 minutes of effort but took me longer than I want to admit :)

I used Node.jsfor this experiment, mostly because I can. I detest all languages equally, be sure to tell me why your favorite is best :P
We'll probably want lots of language examples in the future.

I took a Linux Kernel CVE ID, CVE-2021-38208, and created some JSON-LD that links to the NVD json. I'm loading the NVD JSON via the jsonld library, which works (tm). I'm not sure if this is considered kosher in the JSON-LD world.

I think in order to load data like from github we will need to build some tooling that knows how to look up certain things, like a kernel git repo for example. I'm thinking we should build some sort of vulnerability library to help with this.

So anyway

I think I've realized a few things today about linking all of this together, feel free to fork this into multiple email threads if that makes more sense.

I don't want to duplicate any data that exists somewhere else in the graph. So for example a description already exists in the NVD data and a description exists in the Kernel git commit. Unless there's a new description for us to add, we can use one of those. I think it's very common for many vulnerability data sets to heavily duplicate today (this makes sense given what we have today). Duplication annoys me.

Is there an easy way to pick a "category" or some other identifier type? In my example it's pointing directly at a CVE ID. That's going to be a very different set of data than findings from a fuzzer, or a collection of Alpine security advisories. Or is it?

Do we want to track absolutely everything with a new identifier? If we look at how OSV is handling this the other ecosystem data is imported then re-exported with the ecosystem identifier. Maybe this is a question we ignore for now.

So tear this up, I know it sucks but it's meant to help drive discussion.

--
    Josh

Ariadne Conill

unread,
Aug 15, 2021, 7:33:37 PM8/15/21
to Josh Bressers, u...@groups.cloudsecurityalliance.org
Hi,

On Sun, 15 Aug 2021, Josh Bressers wrote:

> Hi all,
>
> I spent a bit more time today with JSON-LD. This is truly one of those things where the more I learn the less I know.
>
> So anyway, here's what looks like 15 minutes of effort but took me longer than I want to admit :)
> https://github.com/joshbressers/uvi-tools/tree/json-ld/json-ld
>
> I used Node.jsfor this experiment, mostly because I can. I detest all languages equally, be sure to tell me why your favorite is best :P
> We'll probably want lots of language examples in the future.

Looks good to me. Ideally, we want to support as many languages as
possible, but Node (well Javascript in general) is basically the language
that people use to interact with the Web so makes sense to start there.

> I took a Linux Kernel CVE ID, CVE-2021-38208, and created some JSON-LD that links to the NVD json. I'm loading the NVD JSON via the jsonld library, which works (tm). I'm not sure if this is considered
> kosher in the JSON-LD world.

It works for now, but yes, we probably want to have a gateway service
which translates the CVE4/CVE5 stuff to something more palatable.

> I think in order to load data like from github we will need to build some tooling that knows how to look up certain things, like a kernel git repo for example. I'm thinking we should build some sort
> of vulnerability library to help with this.

Yes, and then we would implement a gateway for that as well, which handles
the translation, like with the CVEs.

> So anyway
>
> I think I've realized a few things today about linking all of this together, feel free to fork this into multiple email threads if that makes more sense.
>
> I don't want to duplicate any data that exists somewhere else in the graph. So for example a description already exists in the NVD data and a description exists in the Kernel git commit. Unless
> there's a new description for us to add, we can use one of those. I think it's very common for many vulnerability data sets to heavily duplicate today (this makes sense given what we have today).
> Duplication annoys me.
>
> Is there an easy way to pick a "category" or some other identifier type? In my example it's pointing directly at a CVE ID. That's going to be a very different set of data than findings from a fuzzer,
> or a collection of Alpine security advisories. Or is it?

We can use compounding for this, something like:

{
"type": ["Vulnerability", "Kernel"],
...
}

Alternatively, we can have a separate subtype field if the compounding
approach seems unnatural:

{
"@context": [
"https://uvi.whatever/ns/uvi",
{
"type": "@type",
"subtype": "uvi:subtype",
"Kernel": "uvi:Kernel"
}
],
"type": "Vulnerability",
"subtype": "Kernel",
}

That would allow tooling to prefer the kernel vulnerability data over the
NVD data, or whatever.

> Do we want to track absolutely everything with a new identifier? If we look at how OSV is handling this the other ecosystem data is imported then re-exported with the ecosystem identifier. Maybe this
> is a question we ignore for now.
>
> So tear this up, I know it sucks but it's meant to help drive discussion.

It looks perfectly fine to me as a starting point.

Ariadne

Josh Bressers

unread,
Aug 15, 2021, 9:50:00 PM8/15/21
to Ariadne Conill, u...@groups.cloudsecurityalliance.org


On Sun, Aug 15, 2021 at 6:33 PM Ariadne Conill <ari...@dereferenced.org> wrote
On Sun, 15 Aug 2021, Josh Bressers wrote:

> I took a Linux Kernel CVE ID, CVE-2021-38208, and created some JSON-LD that links to the NVD json. I'm loading the NVD JSON via the jsonld library, which works (tm). I'm not sure if this is considered
> kosher in the JSON-LD world.

It works for now, but yes, we probably want to have a gateway service
which translates the CVE4/CVE5 stuff to something more palatable.


This gives me an idea.

What if we start doing something like a gateway service. My current thought is to take the existing NVD data and massage it into OSV as best as we can. This gives us data that is easier to use, but more importantly it gives us a place to enrich and modify the existing data as there is no reasonable way to update CVE details today.

The current UVI namespace leaves room below one million to avoid overlapping with the CVE namespace, I envision assigning the IDs a compatible UVI identifier that are OSV formatted in one namespace. Then start working on the JSON-LD format in a different namespace. Working with existing data will probably be easier than trying to find new data.

-- 
    Josh

Ariadne Conill

unread,
Aug 15, 2021, 10:29:25 PM8/15/21
to Josh Bressers, Ariadne Conill, u...@groups.cloudsecurityalliance.org
Hi,

On Sun, 15 Aug 2021, Josh Bressers wrote:

>
>
> On Sun, Aug 15, 2021 at 6:33 PM Ariadne Conill <ari...@dereferenced.org> wrote
> On Sun, 15 Aug 2021, Josh Bressers wrote:
>
> > I took a Linux Kernel CVE ID, CVE-2021-38208, and created some JSON-LD that links to the NVD json. I'm loading the NVD JSON via the jsonld library, which works (tm). I'm not sure if this
> is considered
> > kosher in the JSON-LD world.
>
> It works for now, but yes, we probably want to have a gateway service
> which translates the CVE4/CVE5 stuff to something more palatable.
>
>
> This gives me an idea.
>
> What if we start doing something like a gateway service. My current thought is to take the existing NVD data and massage it into OSV as best as we can. This gives us data that is easier to use, but
> more importantly it gives us a place to enrich and modify the existing data as there is no reasonable way to update CVE details today.

Yes, that was my idea, except instead of massaging it into OSV, we could
massage it into the JSON-LD-ified version. But, really, we should do
both, I think.

Perhaps serve OSV formatted data when requested as JSON, and our WIP
format when requested as JSON-LD?

> The current UVI namespace leaves room below one million to avoid overlapping with the CVE namespace, I envision assigning the IDs a compatible UVI identifier that are OSV formatted in one namespace.
> Then start working on the JSON-LD format in a different namespace. Working with existing data will probably be easier than trying to find new data.

Yes, probably best to remap the CVE identifier into a UVI one below
1000000, since MITRE seems quite willing to aggressively defend the CVE
trademark.

I can write a simple gateway later this week and put it up somewhere, like
on GitHub or whatever.

Ariadne

Josh Bressers

unread,
Aug 16, 2021, 9:28:39 AM8/16/21
to Ariadne Conill, u...@groups.cloudsecurityalliance.org
On Sun, Aug 15, 2021 at 9:29 PM Ariadne Conill <ari...@dereferenced.org> wrote:
Hi,

On Sun, 15 Aug 2021, Josh Bressers wrote:

>
> What if we start doing something like a gateway service. My current thought is to take the existing NVD data and massage it into OSV as best as we can. This gives us data that is easier to use, but
> more importantly it gives us a place to enrich and modify the existing data as there is no reasonable way to update CVE details today.

Yes, that was my idea, except instead of massaging it into OSV, we could
massage it into the JSON-LD-ified version.  But, really, we should do
both, I think.

Perhaps serve OSV formatted data when requested as JSON, and our WIP
format when requested as JSON-LD?

Yeah, this. It made sense in my brain :)

I think offering data in multiple formats will be important.
 

> The current UVI namespace leaves room below one million to avoid overlapping with the CVE namespace, I envision assigning the IDs a compatible UVI identifier that are OSV formatted in one namespace.
> Then start working on the JSON-LD format in a different namespace. Working with existing data will probably be easier than trying to find new data.

Yes, probably best to remap the CVE identifier into a UVI one below
1000000, since MITRE seems quite willing to aggressively defend the CVE
trademark.

I can write a simple gateway later this week and put it up somewhere, like
on GitHub or whatever.


Feel free to fork tools repo if you don't want to create a new repo

I've not looked into if the OSV API source is public anywhere, I always like borrowing from others

Thanks!

--
    Josh

Josh Bressers

unread,
Aug 23, 2021, 10:44:38 AM8/23/21
to u...@groups.cloudsecurityalliance.org
On Mon, Aug 16, 2021 at 8:28 AM Josh Bressers <jo...@bress.net> wrote:


On Sun, Aug 15, 2021 at 9:29 PM Ariadne Conill <ari...@dereferenced.org> wrote:
Hi,

On Sun, 15 Aug 2021, Josh Bressers wrote:

>
> What if we start doing something like a gateway service. My current thought is to take the existing NVD data and massage it into OSV as best as we can. This gives us data that is easier to use, but
> more importantly it gives us a place to enrich and modify the existing data as there is no reasonable way to update CVE details today.

Yes, that was my idea, except instead of massaging it into OSV, we could
massage it into the JSON-LD-ified version.  But, really, we should do
both, I think.

Perhaps serve OSV formatted data when requested as JSON, and our WIP
format when requested as JSON-LD?

Yeah, this. It made sense in my brain :)

I think offering data in multiple formats will be important.
 

I wanted to send a followup to this group. I started looking at turning CVE data into UVI data this weekend (I haven't checked anything in yet).
Specifically turning the NVD data into OSV. It's going to be rough. I would value input from others on this plan.

When possible, I will fill in all the OSV fields I can. Over time I can see building more intelligence into the system. Let's use this for example
We can see there is a cpe for the Linux Kernel and a reference that points at a commit. I would be confident adding the commit as the fix and package ecosystem data.

But then if we look at this Node.js issue
It's going to be REALLY hard to parse that in an automated manner.

When we can't add in an OSV required field with confidence, I want to use "MISSING" as the string value. That will make it very easy to see what data is obviously wrong. Hopefully if someone has the data, they can then submit a PR that adds some of the missing pieces. It could also be a source of low hanging fruit for anyone looking to help.

Thanks in advance

--
    Josh

Kurt Seifried

unread,
Aug 23, 2021, 1:28:22 PM8/23/21
to Josh Bressers, u...@groups.cloudsecurityalliance.org
On Mon, Aug 23, 2021 at 8:44 AM Josh Bressers <jo...@bress.net> wrote:

I wanted to send a followup to this group. I started looking at turning CVE data into UVI data this weekend (I haven't checked anything in yet).
Specifically turning the NVD data into OSV. It's going to be rough. I would value input from others on this plan.

When possible, I will fill in all the OSV fields I can. Over time I can see building more intelligence into the system. Let's use this for example
We can see there is a cpe for the Linux Kernel and a reference that points at a commit. I would be confident adding the commit as the fix and package ecosystem data.

But then if we look at this Node.js issue
It's going to be REALLY hard to parse that in an automated manner.

When we can't add in an OSV required field with confidence, I want to use "MISSING" as the string value. That will make it very easy to see what data is obviously wrong. Hopefully if someone has the data, they can then submit a PR that adds some of the missing pieces. It could also be a source of low hanging fruit for anyone looking to help.

I think MISSING is a really important idea. I keep coming back to two central use cases for OSV/UVI/whatever:

1) People want data they can just quickly consume, ideally with tooling, e.g. given this UVI am I vulnerable Yes/No? For this we want things like machine-readable git commits, affected product/versions/etc.

2) People want to be able to research it further, e.g. given this UVI can I (for example) find out the affected code and search for that pattern in my repos (because people cut and paste), or see how badly affected we are because we use a nondefault setting or whatever. For this, we want URLs and source data (e.g. the commit that introduces it and fixes it are pretty ideal, the original issue, bug report, etc.). A really common example here "CVE foo says CVSS bar, but that doesn't apply to us because..." discussions that vendors OFTEN have with customers (thanks to scan reports), ask Josh about this, he has some opinions here.

So for the second use case having the URLs, git commits, hashes of stuff and not just version numbers, the program that processed it, when, etc. helps, but I think EXPLICITLY telling people what is MISSING would be a major step forwards. For example, CVSS rating an issue, if NVD doesn't know if an issue is local or remote they assume worst-case scenario and go with remote (which... what else can they do? People demand CVSS ratings, even in the absence of data). Explicitly providing CVSS data with "MISSING" in fields would make life much easier when discussing a CVSS rating, and as Josh said, show people where they can do some research and add some real value to the data.

I think being explicit as possible in all things, where did this data come from, we don't have this data, we don't know X would be a major step forwards. Right now, for example, we live in a world where everyone assumes CVE data is "correct" and "complete" which as we all know is not the case, and the problems that it causes. This is a problem we can fix for UVI (and OSV/etc.). 

Josh Bressers

unread,
Aug 27, 2021, 7:15:42 PM8/27/21
to u...@groups.cloudsecurityalliance.org
OK, I created a very unintelligent NVD to OSV converter in this script.

Here's a gist showing the output

I'm using the OSV 0.8 format as it's going to be live very soon. There are a lot of important fields that are MISSING but that's OK. I see the MISSING as a placeholder for anyone looking to add comments.

Here is my current thinking of how to move this forward.

I want to take the existing NVD data, transform it into OSV data. I also want to include the demo json-ld format in this data. I don't know what the JSON-LD namespace will be yet.

The data will all live in this repo

For the moment edits to the data will be done via pull requests.

This has me thinking about a bigger picture question. The description in CVE is prose and often doesn't contain a lot of useful details that could be extracted in an automated manner. Something the DWF web interface did was ask for some details then auto-generate a description. I'm leaning in the direction that we should try to collect details about an application then generate a description rather than rely on a human to produce something. For example if we had a CWE, fixed commit, introduced commit, we already have more details than most CVE descriptions.

I imagine there are research projects happening that can take a diff and spit out a CWE, but that's a topic for another day :)

--
    Josh

Kurt Seifried

unread,
Aug 27, 2021, 7:57:05 PM8/27/21
to Josh Bressers, u...@groups.cloudsecurityalliance.org
My hope was to have generated text description (DWF also did that since day1, I'm way to lazy not to automate it) with human prose in the NOTES field which existed since v1 of the CVE schema (https://github.com/CVEProject/cve-schema/blob/master/schema/v1.0/JSON-file-format-v1.md):

"NOTES": "string",
But it of course never caught on.
Reply all
Reply to author
Forward
0 new messages