Accessing the ClearlyDefined dataset

257 views
Skip to first unread message

James Siri

unread,
Sep 4, 2025, 7:28:28 PM9/4/25
to Tomlinson, Qing, Paul Malmsten, clearly...@googlegroups.com

Howdy Qing and the rest of ClearlyDefined,

 

I am casting a broad net here because I am ignorant of what the proper processes  are so apologies if this isn’t the correct path to follow so let me know if there is a more formal way to continue forward.

 

Paul is my teammate and is working on a way to identify malicious changes to packages after they have been published, and we are interested in the data ClearlyDefined has been building over the years because we believe we could leverage it in this battle. In his own words.

 

Microsoft is running an internal project to identify currently unrecognized binaries checked in to internal source control repositories at scale, which we call 'foreign checked-in binaries'. The ClearlyDefined dataset interests us because it includes content digests for the files contained within distributed archives for many OSS application package formats.

To determine whether this dataset would be of use, we would like to query it by content digest (SHA-1 or SHA-256). However, I do not see such an API for that purpose today - so as a prototype, we would like to ingest the data into a temporary Azure Data Explorer database that would allow us to search the data by file hash. This is intended to be a temporary prototype to start, just to establish the usefulness of the dataset; if it proves to be useful, we would revisit what a more correct long-term approach would look like.

For now, I would like to create a service principal having read access to the specific data we need in Blob Storage, associate the service principal with a certificate we create in our Azure directory+subscription (such that my team pays for the Azure Data Explorer cluster), and use that service principal to copy the data into our temporary cluster.

 

Is this something that the project has done before and would be amenable to let us starting our experiment? Would be happy to hop on a call and discuss further if it would be helpful!

 

Thank you for your time and hope to hear from you soon.

 

James

Josh Berkus

unread,
Sep 4, 2025, 8:17:29 PM9/4/25
to James Siri, Tomlinson, Qing, Paul Malmsten, clearly...@googlegroups.com
On 9/4/25 16:28, 'James Siri' via clearlydefined wrote:
> Howdy Qing and the rest of ClearlyDefined,
>
> I am casting a broad net here because I am ignorant of what the proper
> processes  are so apologies if this isn’t the correct path to follow so
> let me know if there is a more formal way to continue forward.

Because of some staff shuffling, I'm not actually sure right now who can
authorize this. I've pinged some folks, and will try to make sure you
get an answer from *someone*.

--
-- Josh Berkus
OSI Board

James Siri

unread,
Sep 4, 2025, 8:18:57 PM9/4/25
to Josh Berkus, Tomlinson, Qing, Paul Malmsten, clearly...@googlegroups.com

Delightful and much thanks Josh!

 

James

 

From: Josh Berkus <jbe...@redhat.com>
Date: Thursday, September 4, 2025 at 17:17
To: James Siri <james...@microsoft.com>, Tomlinson, Qing <qing.to...@sap.com>, Paul Malmsten <paulma...@microsoft.com>, clearly...@googlegroups.com <clearly...@googlegroups.com>
Subject: [EXTERNAL] Re: Accessing the ClearlyDefined dataset

[You don't often get email from jbe...@redhat.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

Tomlinson, Qing

unread,
Sep 4, 2025, 9:35:21 PM9/4/25
to James Siri, Josh Berkus, Paul Malmsten, clearly...@googlegroups.com, Jeff Mendoza, Roman Iakovlev

Hi James,

 

Thank you for reaching out!

 

The SHA-1 and SHA-256 hashes for a specific package can be found within its definition under the described.hashes property. If these hashes meet your requirements, truncated versions of package definitions (excluding files) are already publicly accessible via the change notification blob store. For example, the truncated definition for pypi/pylint/3.2.3 can be accessed at:

https://clearlydefinedprod.blob.core.windows.net/changes-notifications/pypi/pypi/-/pylint/3.2.3.json.

 

For further details, you can refer to the documentation here:

https://github.com/clearlydefined/service/blob/master/docs/change-notification-api.md.

 

To the best of my knowledge, there is no rate limit currently applied to querying this endpoint.

 

I hope this information helps with your exploration. I've also included Roman, who is an expert on the topic of change notifications, in case you have any further questions.

 

Best regards,

Qing

Paul Malmsten

unread,
Sep 5, 2025, 11:37:34 AM9/5/25
to Tomlinson, Qing, James Siri, Josh Berkus, clearly...@googlegroups.com, Jeff Mendoza, Roman Iakovlev
Hi all, Qing,

Thanks for that info - however, we do need the files array. The information we want to index by is the content digest of individual files contained within an OSS package version distributable; indexing by the content digest of an overall  .tar.gz/.zip/.nupkg itself is not sufficient for our use case.

Is there any other way to retrieve files arrays at scale (so that we can index them)? Or if there is some existing index searchable by file array content digest, I'd of course love to know.

Thanks,
~Paul Malmsten

From: Tomlinson, Qing <qing.to...@sap.com>
Sent: Thursday, September 4, 2025 6:35 PM
To: James Siri <james...@microsoft.com>; Josh Berkus <jbe...@redhat.com>; Paul Malmsten <paulma...@microsoft.com>; clearly...@googlegroups.com <clearly...@googlegroups.com>
Cc: Jeff Mendoza <je...@kusari.dev>; Roman Iakovlev <romani...@github.com>
Subject: RE: [EXTERNAL] Re: Accessing the ClearlyDefined dataset
 

Nick Vidal

unread,
Sep 5, 2025, 1:26:47 PM9/5/25
to Paul Malmsten, Philippe Ombredanne, Tomlinson, Qing, James Siri, Josh Berkus, clearly...@googlegroups.com, Jeff Mendoza, Roman Iakovlev
Hi Paul,

This is Nick Vidal, community manager of ClearlyDefined.

I'm adding Philippe Ombredanne, one of the co-founders of ClearlyDefined, who might also help address your question.

Thanks,
Nick


--
You received this message because you are subscribed to the Google Groups "clearlydefined" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clearlydefine...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/clearlydefined/BY1PR21MB4512E293862EC6298C36C63CC803A%40BY1PR21MB4512.namprd21.prod.outlook.com.

Philippe Ombredanne

unread,
Sep 11, 2025, 7:33:48 AM9/11/25
to Nick Vidal, Paul Malmsten, Tomlinson, Qing, James Siri, Josh Berkus, clearly...@googlegroups.com, Jeff Mendoza, Roman Iakovlev
Nick:
Thanks for pinging me here as I had missed that email!

James, Paul, and all:
Here are a couple thoughts:

Jame, you wrote:
> Paul is my teammate and is working on a way to identify malicious changes to packages after they have been published, and we are interested in the data ClearlyDefined has been building over the years because we believe we could leverage it in this battle. In his own words.

> Microsoft is running an internal project to identify currently unrecognized binaries checked in to internal source control repositories at scale, which we call 'foreign checked-in binaries'.

This is awesome initiative. We need to chat, as we (at AboutCode) have
built some code to actually detect (potentially malicious) mismatches
between source and binaries doing through reverse engineering. This is
a feature set we call "Back2source" and is supported by NLnet [1] and
[2]. We are actually deploying that at a larger scale, starting with
the most popular packages across the top ecosystems (pypi, maven, npm
and rust), and we also have an upcoming joint project with Apache
log4j maintainers, another one with Rust folks, yet another one for
Nix packages, and a detailed plan for dealing with NuGet that has its
own challenges.

> The ClearlyDefined dataset interests us because it includes content digests for the files contained within distributed archives for many OSS application package formats.

I think this is only part of what's needed. The code we crafted for
AboutCode deals with the content proper (source and binaries), not
just the digests to be able to answer that question: do we have all
the source code that corresponds to the binaries in that package
(npm/pypi/nuget/maven/etc)?

> To determine whether this dataset would be of use, we would like to query it by content digest (SHA-1 or SHA-256). However, I do not see such an API for that purpose today - so as a prototype, we would like to ingest the data into a temporary Azure Data Explorer database that would allow us to search the data by file hash. This is intended to be a temporary prototype to start, just to establish the usefulness of the dataset; if it proves to be useful, we would revisit what a more correct long-term approach would look like.

This makes sense. I also have a good part of that already in PurlDB
(including an API and a decent subset of ClearlyDefined) and I can
share a Postgres dump too to help you get started with a local
deployment. You may like what you see, or you may just use it for a
proof of concept, or ignore it. This is open source :)
We have build code there to get all the data from ClearlyDefined FWIW,
but this is not efficient.

> For now, I would like to create a service principal having read access to the specific data we need in Blob Storage, associate the service principal with a certificate we create in our Azure directory+subscription (such that my team pays for the Azure Data Explorer cluster), and use that service principal to copy the data into our temporary cluster.

FYI, as part of some plans that I am putting together, we would like
to ensure easy synchronization and mirroring of the whole data set in
the open by everyone. But that's a lesser issue in your case since you
would be Azure-to-Azure.

> Is this something that the project has done before and would be amenable to let us starting our experiment? Would be happy to hop on a call and discuss further if it would be helpful!

As said above, this would be an awesome experiment and I would love to
support in any modest way I can.
Let's setup a quick call to discuss the specifics.

[1] https://nlnet.nl/project/Back2source/
[2] https://nlnet.nl/project/Back2source-next/

--
Philippe Ombredanne
AboutCode.org
Package URL (PURL), ScanCode, DejaCode, PurlDB and VulnerableCode

Paul Malmsten

unread,
Oct 21, 2025, 5:04:49 PM10/21/25
to Philippe Ombredanne, Nick Vidal, Nana Wu, Tomlinson, Qing, James Siri (CELA), Josh Berkus, clearly...@googlegroups.com, Jeff Mendoza, Roman Iakovlev
Thanks for the replies Nick and Philippe. Apologies for the delay - I personally have been very distracted by some npm supply chain attacks in early September - but we would love to continue this conversation.

Adding my manager @Nana Wu who is going to bring a few more folks into this thread who can carry it forward.

Thanks,
~Paul Malmsten



From: Philippe Ombredanne <pombr...@aboutcode.org>
Sent: Thursday, September 11, 2025 4:33 AM
To: Nick Vidal <nick....@opensource.org>
Cc: Paul Malmsten <paulma...@microsoft.com>; Tomlinson, Qing <qing.to...@sap.com>; James Siri <james...@microsoft.com>; Josh Berkus <jbe...@redhat.com>; clearly...@googlegroups.com <clearly...@googlegroups.com>; Jeff Mendoza <je...@kusari.dev>; Roman Iakovlev <romani...@github.com>
Subject: Re: [EXTERNAL] Re: Accessing the ClearlyDefined dataset

[You don't often get email from pombr...@aboutcode.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
[1] https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnlnet.nl%2Fproject%2FBack2source%2F&data=05%7C02%7Cpaulmalmsten%40microsoft.com%7Ceda42ef913ad42fd9ce108ddf12712c9%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638931872778535980%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C40000%7C%7C%7C&sdata=sBFEIdXdfTM%2ByLThfA68vecyymt2sfqgMISPDhkYbrE%3D&reserved=0
[2] https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnlnet.nl%2Fproject%2FBack2source-next%2F&data=05%7C02%7Cpaulmalmsten%40microsoft.com%7Ceda42ef913ad42fd9ce108ddf12712c9%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638931872778554774%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C40000%7C%7C%7C&sdata=Pt12sKgEZiRQ3KT77YS5r0yNAPtK1eyUOVkqI8HLFe0%3D&reserved=0

Nana Wu

unread,
Oct 21, 2025, 5:20:04 PM10/21/25
to Paul Malmsten, Philippe Ombredanne, Nick Vidal, Sebastian Gomez, Joe Schmitt, Gabriel Castro, Edgar Ruiz Silva, Jorge Fernandez Alfonso, Tomlinson, Qing, James Siri (CELA), Josh Berkus, clearly...@googlegroups.com, Jeff Mendoza, Roman Iakovlev

Thanks @Paul Malmsten.   Adding @Sebastian Gomez and @Joe Schmitt on the thread to continue this effort and @Jorge Fernandez Alfonso,  @Gabriel Castro, @Edgar Ruiz Silva for visibility.

 

Hi @Philippe Ombredanne, @Nick Vidal,

 

Pushing forward this initiative listed below, we are working on launching a SFI campaign aiming to launch to prod by the end of this quarter.  Can you help @Sebastian Gomez and @Joe Schmitt to get started on evaluate this dataset? Including how to get access to the data, read this data, our plan is to get this data load into Kusto for our joined query to generate S360 action items.   

 

Thanks,

Nana

Paul Malmsten

unread,
Nov 4, 2025, 1:50:51 PM11/4/25
to clearly...@googlegroups.com, Philippe Ombredanne, Nick Vidal, Josh Berkus, Jeff Mendoza, Tomlinson, Qing, Nana Wu, Joe Schmitt, Edgar Ruiz Silva, Gabriel Castro, James Siri (CELA), Jorge Fernandez Alfonso, Roman Iakovlev, Mounika Rendedla, Sebastian Gomez, Jasmine Wang (1ES)
Hi again all,

I know a lot of time has elapsed on this conversation - I wish I had more time to keep the momentum going on this conversation in September.

In any case, I want to loop back to the original question: are there any strong objections to us running an experiment involving getting a read-only access key to the blob storage account (containing the content digests of files within components), and using that to hydrate an Azure Data Explorer cluster (in a separate subscription we pay for), such that we could efficiently query components by the content digests of files distributed within them?

If the experiment pans out, we would of course be happy to talk about how Microsoft could contribute to a longer-term, community-focused approach for enabling access to this data.

Is it reasonable to ask that any strong objections be raised this week if they exist?

Thanks for your time,
~Paul Malmsten

From: Nana Wu <nan...@microsoft.com>
Sent: Tuesday, October 21, 2025 2:20 PM
To: Paul Malmsten <paulma...@microsoft.com>; Philippe Ombredanne <pombr...@aboutcode.org>; Nick Vidal <nick....@opensource.org>; Sebastian Gomez <seg...@microsoft.com>; Joe Schmitt <Joseph....@microsoft.com>; Gabriel Castro <Gabriel...@microsoft.com>; Edgar Ruiz Silva <Edgar.R...@microsoft.com>; Jorge Fernandez Alfonso <jorg...@microsoft.com>
Cc: Tomlinson, Qing <qing.to...@sap.com>; James Siri (CELA) <james...@microsoft.com>; Josh Berkus <jbe...@redhat.com>; clearly...@googlegroups.com <clearly...@googlegroups.com>; Jeff Mendoza <je...@kusari.dev>; Roman Iakovlev <romani...@github.com>
Subject: RE: [EXTERNAL] Re: Accessing the ClearlyDefined dataset
 

Philippe Ombredanne

unread,
Nov 4, 2025, 3:08:27 PM11/4/25
to Paul Malmsten, clearly...@googlegroups.com, Nick Vidal, Josh Berkus, Jeff Mendoza, Tomlinson, Qing, Nana Wu, Joe Schmitt, Edgar Ruiz Silva, Gabriel Castro, James Siri (CELA), Jorge Fernandez Alfonso, Roman Iakovlev, Mounika Rendedla, Sebastian Gomez, Jasmine Wang (1ES)
Paul:
Sorry for the late reply!
I do not think anyone has any objection, otherwise they would have
been brought up! I'd be glad to help in any shape I can.
Feel free to also join the weekly calls

It would be awesome to get your feedback and contributions of course!

FWIW, for your checksum needs you will find:
- sha1 and sha256 for each file in the "clearlydefined" harvest
- sha1, md5, sha256 and sha1_git for each file in the latest scancode harvest

NB: the sha1_git is a the Git SHA1 blob which is also now an ISO
standard as part of the SoftwareHeritage
https://www.swhid.org/swhid-specification/v1.2/5.Core_identifiers/#52-contents

You can get these in the harvest section of the blob store at paths
similar to that of the API:
https://dev-api.clearlydefined.io/harvest/nuget/nuget/-/NuGet.Protocol/6.7.1?raw
would be JSON combined from multiple blobs stored at:
nuget/nuget/-/NuGet.Protocol/revision/6.7.1/tool/scancode
and
nuget/nuget/-/NuGet.Protocol/revision/6.7.1/tool/clearlydefined
and more

--
Cheers
Philippe Ombredanne
AboutCode.org
Package URL (PURL), ScanCode, DejaCode, PurlDB and VulnerableCode
Book a call at https://cal.com/pombreda

Paul Malmsten

unread,
Nov 6, 2025, 4:25:15 PM11/6/25
to Philippe Ombredanne, clearly...@googlegroups.com, Nick Vidal, Josh Berkus, Jeff Mendoza, Tomlinson, Qing, Nana Wu, Joe Schmitt, Edgar Ruiz Silva, Gabriel Castro, James Siri (CELA), Jorge Fernandez Alfonso, Roman Iakovlev, Mounika Rendedla, Sebastian Gomez, Jasmine Wang (1ES)
OK great, thank you! That additional context about types of file hashes calculated via clearlydefined vs. via scancode is helpful - I had looked at the clearlydefined harvest data but not the scancode one yet.

We will give this a try and let folks know how it goes.

Thanks,
~Paul Malmsten



From: Philippe Ombredanne <pombr...@aboutcode.org>
Sent: Tuesday, November 4, 2025 12:08 PM
To: Paul Malmsten <paulma...@microsoft.com>
Cc: clearly...@googlegroups.com <clearly...@googlegroups.com>; Nick Vidal <nick....@opensource.org>; Josh Berkus <jbe...@redhat.com>; Jeff Mendoza <je...@kusari.dev>; Tomlinson, Qing <qing.to...@sap.com>; Nana Wu <nan...@microsoft.com>; Joe Schmitt <Joseph....@microsoft.com>; Edgar Ruiz Silva <Edgar.R...@microsoft.com>; Gabriel Castro <Gabriel...@microsoft.com>; James Siri (CELA) <james...@microsoft.com>; Jorge Fernandez Alfonso <jorg...@microsoft.com>; Roman Iakovlev <romani...@github.com>; Mounika Rendedla <Mounika....@microsoft.com>; Sebastian Gomez <seg...@microsoft.com>; Jasmine Wang (1ES) <jasmi...@microsoft.com>

Subject: Re: [EXTERNAL] Re: Accessing the ClearlyDefined dataset
Paul:
Sorry for the late reply!
I do not think anyone has any objection, otherwise they would have
been brought up! I'd be glad to help in any shape I can.
Feel free to also join the weekly calls

It would be awesome to get your feedback and contributions of course!

FWIW, for your checksum needs you will find:
- sha1 and sha256 for each file in the "clearlydefined" harvest
- sha1, md5, sha256 and sha1_git for each file in the latest scancode harvest

NB: the sha1_git is a the Git SHA1 blob which is also now an ISO
standard as part of the SoftwareHeritage


You can get these in the harvest section of the blob store at paths
similar to that of the API:

would be JSON combined from multiple blobs stored at:
nuget/nuget/-/NuGet.Protocol/revision/6.7.1/tool/scancode
and
nuget/nuget/-/NuGet.Protocol/revision/6.7.1/tool/clearlydefined
and more

--
Cheers
Philippe Ombredanne
AboutCode.org
Package URL (PURL), ScanCode, DejaCode, PurlDB and VulnerableCode
> [1] https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnlnet.nl%2Fproject%2FBack2source%2F&data=05%7C02%7Cpaulmalmsten%40microsoft.com%7C05054082c5ea4d156c9708de1bdde9ea%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638978837115048800%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=HGJhlJxX8bqN8oR4P8Wi2mBq5e5yuCRUkcfljxrAGpU%3D&reserved=0
> [2] https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnlnet.nl%2Fproject%2FBack2source-next%2F&data=05%7C02%7Cpaulmalmsten%40microsoft.com%7C05054082c5ea4d156c9708de1bdde9ea%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638978837115059465%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=QxrXJmXa6hkvCV53rRZqC3eLvxCUEzv7rNQQHtUNdS8%3D&reserved=0

Philippe Ombredanne

unread,
Nov 6, 2025, 5:16:19 PM11/6/25
to Paul Malmsten, clearly...@googlegroups.com, Nick Vidal, Josh Berkus, Jeff Mendoza, Tomlinson, Qing, Nana Wu, Joe Schmitt, Edgar Ruiz Silva, Gabriel Castro, James Siri (CELA), Jorge Fernandez Alfonso, Roman Iakovlev, Mounika Rendedla, Sebastian Gomez, Jasmine Wang (1ES)
Paul:
Also please join our weekly community calls!
Some folks voice concerns there yesterday that heavy access to the
blob store could have some side effects on the overall performance of
the system, it would be great if you can join.
We agreed that the project will need to eventually also evolve a
project-level policy for these kind of read-only mass access (as
opposed to the more controlled access through the API) for things like
auth and similar.
--
Cordially
Philippe Ombredanne
AboutCode.org
Package URL (PURL), ScanCode, DejaCode, PurlDB and VulnerableCode
Book a call at https://cal.com/pombreda

Paul Malmsten

unread,
Nov 6, 2025, 7:08:23 PM11/6/25
to Philippe Ombredanne, clearly...@googlegroups.com, Nick Vidal, Josh Berkus, Jeff Mendoza, Tomlinson, Qing, Nana Wu, Joe Schmitt, Edgar Ruiz Silva, Gabriel Castro, James Siri (CELA), Jorge Fernandez Alfonso, Roman Iakovlev, Mounika Rendedla, Sebastian Gomez, Jasmine Wang (1ES)
Sure thing, I'll join the next call - we definitely don't want the reads to be disruptive.

It looks like the website says Wednesdays at 10am EST, but the minutes say 10:30am EST? Is the correct time 10:30am EST?

Thanks,
~Paul Malmsten



From: Philippe Ombredanne <pombr...@aboutcode.org>
Sent: Thursday, November 6, 2025 2:16 PM

To: Paul Malmsten <paulma...@microsoft.com>
Cc: clearly...@googlegroups.com <clearly...@googlegroups.com>; Nick Vidal <nick....@opensource.org>; Josh Berkus <jbe...@redhat.com>; Jeff Mendoza <je...@kusari.dev>; Tomlinson, Qing <qing.to...@sap.com>; Nana Wu <nan...@microsoft.com>; Joe Schmitt <Joseph....@microsoft.com>; Edgar Ruiz Silva <Edgar.R...@microsoft.com>; Gabriel Castro <Gabriel...@microsoft.com>; James Siri (CELA) <james...@microsoft.com>; Jorge Fernandez Alfonso <jorg...@microsoft.com>; Roman Iakovlev <romani...@github.com>; Mounika Rendedla <Mounika....@microsoft.com>; Sebastian Gomez <seg...@microsoft.com>; Jasmine Wang (1ES) <jasmi...@microsoft.com>
Subject: Re: [EXTERNAL] Re: Accessing the ClearlyDefined dataset

Paul:

Also please join our weekly community calls!
Some folks voice concerns there yesterday that heavy access to the
blob store could have some side effects on the overall performance of
the system, it would be great if you can join.
We agreed that the project will need to eventually also evolve a
project-level policy for these kind of read-only mass access (as
opposed to the more controlled access through the API) for things like
auth and similar.
--
Cordially
Philippe Ombredanne
AboutCode.org
Package URL (PURL), ScanCode, DejaCode, PurlDB and VulnerableCode


On Thu, 6 Nov 2025 at 22:25, Paul Malmsten <paulma...@microsoft.com> wrote:
>
> OK great, thank you! That additional context about types of file hashes calculated via clearlydefined vs. via scancode is helpful - I had looked at the clearlydefined harvest data but not the scancode one yet.
>
> We will give this a try and let folks know how it goes.
>
> Thanks,
> ~Paul Malmsten
>
>
> ________________________________
> From: Philippe Ombredanne <pombr...@aboutcode.org>
> Sent: Tuesday, November 4, 2025 12:08 PM
> To: Paul Malmsten <paulma...@microsoft.com>
> Cc: clearly...@googlegroups.com <clearly...@googlegroups.com>; Nick Vidal <nick....@opensource.org>; Josh Berkus <jbe...@redhat.com>; Jeff Mendoza <je...@kusari.dev>; Tomlinson, Qing <qing.to...@sap.com>; Nana Wu <nan...@microsoft.com>; Joe Schmitt <Joseph....@microsoft.com>; Edgar Ruiz Silva <Edgar.R...@microsoft.com>; Gabriel Castro <Gabriel...@microsoft.com>; James Siri (CELA) <james...@microsoft.com>; Jorge Fernandez Alfonso <jorg...@microsoft.com>; Roman Iakovlev <romani...@github.com>; Mounika Rendedla <Mounika....@microsoft.com>; Sebastian Gomez <seg...@microsoft.com>; Jasmine Wang (1ES) <jasmi...@microsoft.com>
> Subject: Re: [EXTERNAL] Re: Accessing the ClearlyDefined dataset
>
> Paul:
> Sorry for the late reply!
> I do not think anyone has any objection, otherwise they would have
> been brought up! I'd be glad to help in any shape I can.
> Feel free to also join the weekly calls
>
> It would be awesome to get your feedback and contributions of course!
>
> FWIW, for your checksum needs you will find:
> - sha1 and sha256 for each file in the "clearlydefined" harvest
> - sha1, md5, sha256 and sha1_git for each file in the latest scancode harvest
>
> NB: the sha1_git is a the Git SHA1 blob which is also now an ISO
> standard as part of the SoftwareHeritage

>
> You can get these in the harvest section of the blob store at paths
> similar to that of the API:

> would be JSON combined from multiple blobs stored at:
> nuget/nuget/-/NuGet.Protocol/revision/6.7.1/tool/scancode
> and
> nuget/nuget/-/NuGet.Protocol/revision/6.7.1/tool/clearlydefined
> and more
>
> --
> Cheers
> Philippe Ombredanne
> AboutCode.org
> Package URL (PURL), ScanCode, DejaCode, PurlDB and VulnerableCode
> > [1] https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnlnet.nl%2Fproject%2FBack2source%2F&data=05%7C02%7Cpaulmalmsten%40microsoft.com%7C6bd692c18b8d4c8cdb7708de1d821bb9%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638980641824697578%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=luU%2BrhhzSRSHTAvs4ALBuq2ZWodoo%2BBap7eiMuNxxk4%3D&reserved=0
> > [2] https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnlnet.nl%2Fproject%2FBack2source-next%2F&data=05%7C02%7Cpaulmalmsten%40microsoft.com%7C6bd692c18b8d4c8cdb7708de1d821bb9%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638980641824709061%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=P2pmX%2FUVTBImWCy1Rg%2BdOvEQoP%2BLaXLX6UL0EuRlwAY%3D&reserved=0

Nick Vidal

unread,
Nov 7, 2025, 6:59:12 AM11/7/25
to Paul Malmsten, Philippe Ombredanne, clearly...@googlegroups.com, Josh Berkus, Jeff Mendoza, Tomlinson, Qing, Nana Wu, Joe Schmitt, Edgar Ruiz Silva, Gabriel Castro, James Siri (CELA), Jorge Fernandez Alfonso, Roman Iakovlev, Mounika Rendedla, Sebastian Gomez, Jasmine Wang (1ES)
Hi Paul,

Thanks for flagging this. We recently changed the time of the meetings. The website is up-to-date, the meetings are on Wednesdays at 10am EST. I have now updated the time on the minutes as well.

Kind regards,
Nick

Paul Malmsten

unread,
Nov 7, 2025, 1:46:12 PM11/7/25
to Nick Vidal, Philippe Ombredanne, clearly...@googlegroups.com, Josh Berkus, Jeff Mendoza, Tomlinson, Qing, Nana Wu, Joe Schmitt, Edgar Ruiz Silva, Gabriel Castro, James Siri (CELA), Jorge Fernandez Alfonso, Roman Iakovlev, Mounika Rendedla, Sebastian Gomez, Jasmine Wang (1ES)
Got it, thanks! I'll join then.

~Paul Malmsten



From: Nick Vidal <nick....@opensource.org>
Sent: Friday, November 7, 2025 3:58 AM
To: Paul Malmsten <paulma...@microsoft.com>
Cc: Philippe Ombredanne <pombr...@aboutcode.org>; clearly...@googlegroups.com <clearly...@googlegroups.com>; Josh Berkus <jbe...@redhat.com>; Jeff Mendoza <je...@kusari.dev>; Tomlinson, Qing <qing.to...@sap.com>; Nana Wu <nan...@microsoft.com>; Joe Schmitt <Joseph....@microsoft.com>; Edgar Ruiz Silva <Edgar.R...@microsoft.com>; Gabriel Castro <Gabriel...@microsoft.com>; James Siri (CELA) <james...@microsoft.com>; Jorge Fernandez Alfonso <jorg...@microsoft.com>; Roman Iakovlev <romani...@github.com>; Mounika Rendedla <Mounika....@microsoft.com>; Sebastian Gomez <seg...@microsoft.com>; Jasmine Wang (1ES) <jasmi...@microsoft.com>

Nana Wu

unread,
Nov 17, 2025, 7:02:45 PM11/17/25
to Paul Malmsten, Nick Vidal, Philippe Ombredanne, Joe Schmitt, Edgar Ruiz Silva, Sebastian Gomez, clearly...@googlegroups.com, Josh Berkus, Jeff Mendoza, Tomlinson, Qing, Gabriel Castro, James Siri (CELA), Jorge Fernandez Alfonso, Roman Iakovlev, Mounika Rendedla, Jasmine Wang (1ES)

Hi @Nick Vidal, @Philippe Ombredanne, @Paul Malmsten,

 

FYI, @Edgar Ruiz Silva will be owning the clearly defined licensing information scope.  Last time, we added @Sebastian Gomez and @Joe Schmitt to investigate on how to load the clearly defined data to Kusto.  What was the findings from last meeting?

Paul Malmsten

unread,
Nov 17, 2025, 7:18:19 PM11/17/25
to Nana Wu, Nick Vidal, Philippe Ombredanne, Joe Schmitt, Edgar Ruiz Silva, Sebastian Gomez, clearly...@googlegroups.com, Josh Berkus, Jeff Mendoza, Tomlinson, Qing, Gabriel Castro, James Siri (CELA), Jorge Fernandez Alfonso, Roman Iakovlev, Mounika Rendedla, Jasmine Wang (1ES)
@Nana Wu This email thread just concerns how one could access the 'files' array (containing file hashes) from ClearlyDefined harvests to allow teams like us to read file hashes in bulk. I attended the meeting last week; meeting minutes for ClearlyDefined engineering meetings are at ClearlyDefined Developers Meetup - Google Docs.

If you have more specific questions about what we discussed, I suggest we take those to a separate thread with fewer people on it.

Thanks,
~Paul Malmsten

From: Nana Wu <nan...@microsoft.com>
Sent: Monday, November 17, 2025 4:02 PM
To: Paul Malmsten <paulma...@microsoft.com>; Nick Vidal <nick....@opensource.org>; Philippe Ombredanne <pombr...@aboutcode.org>; Joe Schmitt <Joseph....@microsoft.com>; Edgar Ruiz Silva <Edgar.R...@microsoft.com>; Sebastian Gomez <seg...@microsoft.com>
Cc: clearly...@googlegroups.com <clearly...@googlegroups.com>; Josh Berkus <jbe...@redhat.com>; Jeff Mendoza <je...@kusari.dev>; Tomlinson, Qing <qing.to...@sap.com>; Gabriel Castro <Gabriel...@microsoft.com>; James Siri (CELA) <james...@microsoft.com>; Jorge Fernandez Alfonso <jorg...@microsoft.com>; Roman Iakovlev <romani...@github.com>; Mounika Rendedla <Mounika....@microsoft.com>; Jasmine Wang (1ES) <jasmi...@microsoft.com>
Reply all
Reply to author
Forward
0 new messages