Howdy Qing and the rest of ClearlyDefined,
I am casting a broad net here because I am ignorant of what the proper processes are so apologies if this isn’t the correct path to follow so let me know if there is a more formal way to continue forward.
Paul is my teammate and is working on a way to identify malicious changes to packages after they have been published, and we are interested in the data ClearlyDefined has been building over the years because we believe we could leverage it in this battle. In his own words.
Microsoft is running an internal project to identify currently unrecognized binaries checked in to internal source control repositories at scale, which
we call 'foreign checked-in binaries'. The ClearlyDefined dataset interests us because it includes content digests for the files contained within distributed archives for many OSS application package formats.
To determine whether this dataset would be of use, we would like to query it by content digest (SHA-1 or SHA-256). However, I do not see such an API for that purpose today - so as a prototype, we would like to ingest the data
into a temporary Azure Data Explorer database that would allow us to search the data by file hash. This is intended to be a temporary prototype to start, just to establish the usefulness of the dataset; if it proves to be useful, we would revisit what a more
correct long-term approach would look like.
For now, I would like to create a service principal having read access to the specific data we need in Blob Storage, associate the service principal with a certificate we create in our Azure directory+subscription (such that my
team pays for the Azure Data Explorer cluster), and use that service principal to copy the data into our temporary cluster.
Is this something that the project has done before and would be amenable to let us starting our experiment? Would be happy to hop on a call and discuss further if it would be helpful!
Thank you for your time and hope to hear from you soon.
James
Delightful and much thanks Josh!
James
From:
Josh Berkus <jbe...@redhat.com>
Date: Thursday, September 4, 2025 at 17:17
To: James Siri <james...@microsoft.com>, Tomlinson, Qing <qing.to...@sap.com>, Paul Malmsten <paulma...@microsoft.com>, clearly...@googlegroups.com <clearly...@googlegroups.com>
Subject: [EXTERNAL] Re: Accessing the ClearlyDefined dataset
[You don't often get email from jbe...@redhat.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
Hi James,
Thank you for reaching out!
The SHA-1 and SHA-256 hashes for a specific package can be found within its definition under the described.hashes property. If these hashes meet your requirements, truncated versions of package definitions (excluding files) are already publicly accessible via the change notification blob store. For example, the truncated definition for pypi/pylint/3.2.3 can be accessed at:
https://clearlydefinedprod.blob.core.windows.net/changes-notifications/pypi/pypi/-/pylint/3.2.3.json.
For further details, you can refer to the documentation here:
https://github.com/clearlydefined/service/blob/master/docs/change-notification-api.md.
To the best of my knowledge, there is no rate limit currently applied to querying this endpoint.
I hope this information helps with your exploration. I've also included Roman, who is an expert on the topic of change notifications, in case you have any further questions.
Best regards,
Qing
--
You received this message because you are subscribed to the Google Groups "clearlydefined" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clearlydefine...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/clearlydefined/BY1PR21MB4512E293862EC6298C36C63CC803A%40BY1PR21MB4512.namprd21.prod.outlook.com.
Thanks @Paul Malmsten. Adding @Sebastian Gomez and @Joe Schmitt on the thread to continue this effort and @Jorge Fernandez Alfonso, @Gabriel Castro, @Edgar Ruiz Silva for visibility.
Hi @Philippe Ombredanne, @Nick Vidal,
Pushing forward this initiative listed below, we are working on launching a SFI campaign aiming to launch to prod by the end of this quarter. Can you help @Sebastian Gomez and @Joe Schmitt to get started on evaluate this dataset? Including how to get access to the data, read this data, our plan is to get this data load into Kusto for our joined query to generate S360 action items.
Thanks,
Nana
Hi @Nick Vidal, @Philippe Ombredanne, @Paul Malmsten,
FYI, @Edgar Ruiz Silva will be owning the clearly defined licensing information scope. Last time, we added @Sebastian Gomez and @Joe Schmitt to investigate on how to load the clearly defined data to Kusto. What was the findings from last meeting?