Usage of ClearlyDefined API

Yaroslav Russkih

unread,

Aug 8, 2023, 5:20:41 AM8/8/23

to clearly...@googlegroups.com

Dear ClearlyDefined community,

We have a few ideas on how we could use data from your API in our (JetBrains) products, primarily, in IDEs.

But with the current ratelimit, we could hardly get sufficient data on package licenses.

Do you think we could get your permission to use the data and a more relaxed ratelimit on your API?

If you could point me where to get the data as an artifact without querying your API, it would be even better as this way we could save common effort.

Thank you.

—

Yaroslav Russkih
Security and Data Protection Lead

Nick Vidal

unread,

Aug 10, 2023, 1:32:11 PM8/10/23

to Yaroslav Russkih, Jeff Wilcox, clearly...@googlegroups.com

Hi Yaroslav,

Thank you for your interested in using ClearlyDefined with JetBrains products.

I don't think we have a rate limit, but in any case I'm copying Jeff who might be able to clarify.

Kind regards,

Nick

--
You received this message because you are subscribed to the Google Groups "clearlydefined" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clearlydefine...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/clearlydefined/02AEFC68-9606-4F65-A953-A8CAF97FF86E%40jetbrains.com.

Jeff McAffer

unread,

Aug 10, 2023, 6:31:28 PM8/10/23

to Nick Vidal, Yaroslav Russkih, Jeff Wilcox, clearly...@googlegroups.com

There is (was) a ratelimit at least on the number of components you could list in one request. I thnk there was also a limit on the request frequency as well.

It would be great for JetBrains to integrate ClearlyDefined data. However, I don't think the team can commit to standing up a production server for that need. (I'm not deeply involved any more so may be speaking out of turn). The idea was instead that organizations wanting to use the data in production would mirror the data into a store/service suitable for their needs and serve it to their customers from there. This gives them full control of the uptime. Curation workflows can still either go directly to clearlydefined.io or hit the APIs via your product workflows.

I don't know the current state of the mirroring technology. I know it was high on the list some time ago but am not sure if it got implemented (or how).

Jeff

To view this discussion on the web visit https://groups.google.com/d/msgid/clearlydefined/CAFifBTrbSpWmTKXkikW31zkWO2HS84v6BPgGQSTqHhdjXMwnew%40mail.gmail.com.

Jeff McAffer

unread,

Aug 10, 2023, 6:50:35 PM8/10/23

to Nick Vidal, Yaroslav Russkih, Jeff Wilcox, clearly...@googlegroups.com

Quick clarification. I erroneously say the "Jeff" in that mail as me whereas it was actually referencing Jeff Wilcox. Please consider my information as anecdotal. Jeff (the other one) is much closer to the current state.

Jeff (McAffer)

Yaroslav Russkih

unread,

Aug 11, 2023, 7:14:38 AM8/11/23

to Jeff Wilcox, Nick Vidal, clearly...@googlegroups.com, Jeff McAffer

Hello Jeff, Jeff and Nick,

Thank you for your answers!

That’s exactly the plan - to cache the data on our side.

The issue as it looks now is that we need data for 12 million packages and with 1 request every 5 seconds (?), it takes some time.

The way I make requests is:

I make a call to e.g. `GET https://api.clearlydefined.io/definitions/npm/npmjs/-/lodash/4.17.21` with `accept: */*` within headers.
I get a response with `X-RateLimit-Remaining` header.
I wait this number of milliseconds before making a consequent request about the next package.

Do you think you could check the description above and advise if you see issues with that.

If there is a way to make a batch request, I would also appreciate an advice on it.

--

Yaroslav Russkih
Security and Data Protection Lead

JetBrains GmbH
Christoph-Rapparini-Bogen 23
80639 München

https://www.jetbrains.com
The Drive to Develop

Handelsregister: Amtsgericht München, HRB 187151
Geschäftsführer: Olga Dyka

Jeff McAffer

unread,

Aug 11, 2023, 11:40:19 AM8/11/23

to Wayne Beaton, Yaroslav Russkih, Jeff Wilcox, Nick Vidal, clearly...@googlegroups.com

Interesting that it fails if one is malformed. Originally I believe malformed or missing coordinates would just result in a null for that entry in the result. That should be an easy fix.

Better approach would be for us to finally implement a mirroring solution.

Jeff

On Fri, Aug 11, 2023, 8:33 AM Wayne Beaton <wayne....@eclipse-foundation.org> wrote:

We use the /definitions endpoint for batch queries.

https://api.clearlydefined.io/api-docs/#/definitions/post_definitions

I had to play with batch sizes a bit, but settled on querying in batches of 500.

If any of your IDs are not well-formed, the whole lot gets rejected, so make sure that you check them before you send them.

It also sometimes rejects well-formed IDs (AFAICT it's a handful of pypi coordinates that causes the issue but have not been able to diagnose if further). I wrote some defensive code that recursively splits the content and re-runs the queries to work around the offending items. There more on Issue 957.

HTH,

Wayne

To view this discussion on the web visit https://groups.google.com/d/msgid/clearlydefined/132AD4EE-A24B-4643-8C1F-3E0E0BB11DEC%40jetbrains.com.

--
Wayne Beaton
Director of Open Source Projects | Eclipse Foundation

My working day may not be your working day! Please don’t feel obliged to read or reply to this e-mail outside of your normal working hours.

Yaroslav Russkih

unread,

Aug 11, 2023, 3:08:09 PM8/11/23

to Wayne Beaton, Jeff McAffer, Jeff Wilcox, Nick Vidal, clearly...@googlegroups.com

Thank you Wayne for sharing the insight on batched requests! We’ll try it and let you know in case of further questions.

P.S. On an idea for mirroring, it’s not necessarily my business and I believe you thought of it already, but as at some point you may get significant number of requests, it might probably be reasonable to cache data of individual packages using a CDN-like solution and let folks query the CDN for static data without loading your application/infrastucture.

--
Yaroslav Russkih
Security and Data Protection Lead

JetBrains GmbH
Christoph-Rapparini-Bogen 23
80639 München

https://www.jetbrains.com
The Drive to Develop

Handelsregister: Amtsgericht München, HRB 187151
Geschäftsführer: Olga Dyka

Jeff Wilcox

unread,

Aug 15, 2023, 7:09:32 AM8/15/23

to Nick Vidal, Yaroslav Russkih, clearly...@googlegroups.com

Hi all-

If there is rate limiting today, it's very rudimentary.

I think our recommendation for anyone looking to use this in an actual product or service would be to provide your own caching front-end to both handle the rare case where the data is unavailable, but also to help reduce direct load on the community resources.

There has been discussion about trying to provide much more explicit rate limiting in the future so that would provide the most flexibility.

From: Nick Vidal <nick....@opensource.org>
Sent: Thursday, August 10, 2023 10:31 AM
To: Yaroslav Russkih <yaroslav...@jetbrains.com>; Jeff Wilcox <Jeff....@microsoft.com>
Cc: clearly...@googlegroups.com <clearly...@googlegroups.com>
Subject: Re: Usage of ClearlyDefined API

Wayne Beaton

unread,

Aug 15, 2023, 7:09:38 AM8/15/23

to Yaroslav Russkih, Jeff Wilcox, Nick Vidal, clearly...@googlegroups.com, Jeff McAffer

We use the /definitions endpoint for batch queries.

https://api.clearlydefined.io/api-docs/#/definitions/post_definitions

I had to play with batch sizes a bit, but settled on querying in batches of 500.

If any of your IDs are not well-formed, the whole lot gets rejected, so make sure that you check them before you send them.

It also sometimes rejects well-formed IDs (AFAICT it's a handful of pypi coordinates that causes the issue but have not been able to diagnose if further). I wrote some defensive code that recursively splits the content and re-runs the queries to work around the offending items. There more on Issue 957.

HTH,

Wayne

To view this discussion on the web visit https://groups.google.com/d/msgid/clearlydefined/132AD4EE-A24B-4643-8C1F-3E0E0BB11DEC%40jetbrains.com.

Wayne Beaton

unread,

Aug 15, 2023, 9:56:35 AM8/15/23

to Jeff Wilcox, Nick Vidal, Yaroslav Russkih, clearly...@googlegroups.com

api.clearlydefined.io definitely implements rate limiting.

Here's what the response headers show after I've poked at the API a few times.

$ curl -v -X POST "https://api.clearlydefined.io/definitions" -H "accept: application/json" -H "Content-Type: application/json"

...

< x-ratelimit-limit: 250
< x-ratelimit-remaining: 235
< x-ratelimit-reset: 1692101784

...

$ _

When we hit the limit, we have to wait until specified UNIX "reset" time to try again.

Because of how we use the API, we don't often hit the rate limit. But it does happen.

Wayne

To view this discussion on the web visit https://groups.google.com/d/msgid/clearlydefined/DS7PR21MB32205BCFA1AD9F765534DE51E613A%40DS7PR21MB3220.namprd21.prod.outlook.com.

Yaroslav Russkih

unread,

Sep 1, 2023, 6:12:02 AM9/1/23

to Wayne Beaton, Jeff Wilcox, Nick Vidal, clearly...@googlegroups.com

Hello Wayne,

I’m sorry for bothering, I’m trying to use batched requests but in fails with SocketTimeoutException in 70% of cases with 15 seconds socket timeout when I run requests with 500 packages a request.

Is this a known issue? Should I reduce a number of packages I send in a single request?

Thank you.

--

Yaroslav Russkih
Security and Data Protection Lead

JetBrains GmbH
Christoph-Rapparini-Bogen 23
80639 München

https://www.jetbrains.com
The Drive to Develop

Handelsregister: Amtsgericht München, HRB 187151
Geschäftsführer: Olga Dyka

Reply all

Reply to author

Forward