NuGet 3.0 API

287 views

Skip to first unread message

Maarten Balliauw

unread,

Jul 16, 2014, 2:52:00 AM7/16/14

to nuget-e...@googlegroups.com

First of all, congratulations on the NuGet v3.0 progress! For thos new to the game, read up on http://blog.nuget.org/20140715/nuget-3.0-ctp1.html.

I've been exploring some of the new API endpoints and here's my understanding of how it all works, for future reference and getting brains aligned.

http://preview.nuget.org/ver3-ctp1/intercept.json - checked by the client to check which endpoints are available for various purposes
http://preview.nuget.org/ver3-ctp1/packageregistrations/1/ - the base URL for resolving individual packages. For example Newtonsoft.Json lives at http://preview.nuget.org/ver3-ctp1/packageregistrations/1/newtonsoft.json.json and lists all versions and properties for that specific package ID
https://api-search.nuget.org/search/query - perform searches against this endpoint (would like to see some docs on how to query and if continuations will be used in here :-))
Several views on the data:

http://preview.nuget.org/ver3-ctp1/islatest/segment_index.json - lists all segments of "islatest"
http://preview.nuget.org/ver3-ctp1/islateststable/segment_index.json - lists all segments of "islateststable"
http://preview.nuget.org/ver3-ctp1/allversions/segment_index.json - lists all segments of all packages

Segments tell the client the range of package info (lowest - highest, alphabetically) that can be found at various URL endpoints, for example http://nugetprod0.blob.core.windows.net/ver3-ctp1/allversions/segment_0.json
Segment contains references to the package information document (see Newtonsoft.Json above) as well as the id, version and description of the package

Do correct me if I'm wrong on the above, but they are the baseline for discussion on how the new API seems to work.

Again, congratulations on this, it's all pretty straightforward to consume and generate, however I have some remarks up for discussion... I'll break them up into several bullets, if I have to split them up in separate topics let me know and I'll happily do that.

Number of requests having to be done by the client on average
It seems to me that for every action that will be performed by the client, at least two HTTP requests are to be made: the intercept.json and one of the endpoints listed in there. Search being the most widely used feature probably, I guess two requests for search + one (or two if you count redirects) to download will be the average in the client (3-4 in total). Package restore will be quicker: intercept, packageregistration, package download.

Looping packages becomes a bit more funky: intercept, segment index, segment, count * package registration -> that's a lot of traffic if I want to list some packages somewhere without searching.
Authentication
Different URLs and lots of opportunity to split requests over domains and all. I love that from a scalability point of view!

But... what with authentication? Basic authentication will be a mess here if these endpoints are indeed split across multiple domains. At least basic authentication where the RFC is followed (https://www.ietf.org/rfc/rfc2617.txt, in short the canonical URL + auth realm are used to identify a protected resource). A deviation could be to only use the realm and not the canonical URL, however what with servers / clients that do follow the standard?
Dynamic generation of segment files could be hard
Imagine having 100.000 packages. That's quite some segment index and segment files to write out, but feasible to do every now and then. An open question from me: imagine a 100.001'st package is added: AAAAA-1.0. Will this be in the first entry in the segment index? Or can it be appended to the segment index saying lowest = AAAAA-1.0 and highest = whatever comes next?

If the answer is: it has to be in the first, then it's pretty crazy stuff to have to generate the segment index and relevant segments over and over again for added packages. In other words: is a full database dump really required on database mutations? Or can it somehow be generated incrementally?

In addition to that, MyGet and I'm also guessing ProGet and Artifactory have the notion of upstream package sources. Imagine having a feed that aggregates three other feeds, of which all three can be a mixture of v2 and v3 endpoints. Generating the segment index and segments themselves would mean these three package sources will have to be queried for ALL the packages they have. Relatively easy to do on a v3 endpoint, but on v2 that means fetching all pages of the OData feed. Just for fun I tried this on NuGet.org's v2 endpoint and I can tell you it's quite the background job to run and will bring eventual consistency of such an aggregate feed to a whole new level :-)
If appending is feasible, then this "fetch from upstream" job is more feasible as it means only one full crawl and then incremental crawls (for example based on LastModified date of packages). So again: is appending feasible or does AAAAA-1.0 have to be in the first few segments?

Maarten Balliauw

unread,

Jul 18, 2014, 2:57:38 AM7/18/14

to nuget-e...@googlegroups.com

One more to add :-)

Segments for individual packages
Looking through some of our feeds, I found a couple using their MyGet feed for CI purposes big time. The package, which I will call Steak as I'm in BBQ mood, has over 6.000 versions of it. Most are prerelease, some are stable. What happens if someone tries to query http://preview.nuget.org/ver3-ctp1/packageregistrations/1/steak.json? Does this mean a JSON document with over 6.000 entries will be served? Or will this be segmented as well?

To make this even more complicated: what if I have a client that wants to fetch the 10 latest versions (stable and unstable) for this package so they can show it on their product homepage or company portal? (that's a use case supported by the OData API and I know a couple of people doing this). Yes, generating such file could be done but tomorrow someone will ask for the 15 latest versions. You get the picture :-) Or is this something that should be handled by the search API? (and now for a very bold question, I know the reasoning is making it all static and more scalable...) If the search API does handle this, why do we still need the other API endpoints?

And this of course also relates to my earlier question on "Dynamic generation of segment files"

Jeff Handley

unread,

Jul 18, 2014, 5:23:18 AM7/18/14

to Maarten Balliauw, nuget-e...@googlegroups.com

Thanks for jumping in so quickly and eagerly and getting right down to some awesome questions!

We should be able to get back with you on these questions early next week.

--
You received this message because you are subscribed to the Google Groups "NuGet Ecosystem" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nuget-ecosyst...@googlegroups.com.
To post to this group, send email to nuget-e...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

John Taylor

unread,

Jul 21, 2014, 2:59:52 PM7/21/14

to nuget-e...@googlegroups.com

On Tuesday, July 15, 2014 11:52:00 PM UTC-7, Maarten Balliauw wrote:

First of all, congratulations on the NuGet v3.0 progress! For thos new to the game, read up on http://blog.nuget.org/20140715/nuget-3.0-ctp1.html.

I've been exploring some of the new API endpoints and here's my understanding of how it all works, for future reference and getting brains aligned.

http://preview.nuget.org/ver3-ctp1/intercept.json - checked by the client to check which endpoints are available for various purposes
http://preview.nuget.org/ver3-ctp1/packageregistrations/1/ - the base URL for resolving individual packages. For example Newtonsoft.Json lives at http://preview.nuget.org/ver3-ctp1/packageregistrations/1/newtonsoft.json.json and lists all versions and properties for that specific package ID
https://api-search.nuget.org/search/query - perform searches against this endpoint (would like to see some docs on how to query and if continuations will be used in here :-))
Several views on the data:
http://preview.nuget.org/ver3-ctp1/islatest/segment_index.json - lists all segments of "islatest"
http://preview.nuget.org/ver3-ctp1/islateststable/segment_index.json - lists all segments of "islateststable"
http://preview.nuget.org/ver3-ctp1/allversions/segment_index.json - lists all segments of all packages
Segments tell the client the range of package info (lowest - highest, alphabetically) that can be found at various URL endpoints, for example http://nugetprod0.blob.core.windows.net/ver3-ctp1/allversions/segment_0.json
Segment contains references to the package information document (see Newtonsoft.Json above) as well as the id, version and description of the package
Do correct me if I'm wrong on the above, but they are the baseline for discussion on how the new API seems to work.

Again, congratulations on this, it's all pretty straightforward to consume and generate, however I have some remarks up for discussion... I'll break them up into several bullets, if I have to split them up in separate topics let me know and I'll happily do that.

Thanks, and your breakdown looks right. But also note that some aspects of this are very specific to how the intercept code works in the client. A clean-room api v3 will simplify some of the constructs and behaviors. There are a couple of services we will add in the next few weeks but this is the basic model.

Number of requests having to be done by the client on average
It seems to me that for every action that will be performed by the client, at least two HTTP requests are to be made: the intercept.json and one of the endpoints listed in there. Search being the most widely used feature probably, I guess two requests for search + one (or two if you count redirects) to download will be the average in the client (3-4 in total). Package restore will be quicker: intercept, packageregistration, package download.

Looping packages becomes a bit more funky: intercept, segment index, segment, count * package registration -> that's a lot of traffic if I want to list some packages somewhere without searching.

One of the aspects of the new protocol is that it is specifically tailored to make use of client caching behavior.

For example, consider "get-package -ListAvailable" in the current PowerShell console. With the current odata endpoint the client makes a call for every 30 packages. For 25K listed packages in nuget.org that is more than 800 http calls - all of which have to be handled by the SQL server on the back. In the new API the client makes a single call to the segment root page then having cached that it makes subsequent calls one for each segment. We currently formatted the segments to contain 1000 packages each. We intercept the call to get 30, instead we get 1000, but we cache. So overall we'd make around 26 http calls for the basic list packages. We also only include the fields the scenario requires, which in this case was id, version and description. If you want more data follow links (the current intercept shim doesn't.)

Note we arrange the resources (the "segments") into a tree. We experiment a lot with this and this ended out being the simplest from a couple of different angles.

Also note that because we pre-create the results the http calls are always being served from our CDN and not from any web service.

So the performance improvement is a combination of factors, and certainly some of which we could have done with tweaking the odata endpoint, however, we also achieved high-availability which is massive. (Its also much cheaper in terms of resources, so its faster, more available and much cheaper.)

The code is also significantly simpler, if you want to serve this data without a CDN that's no problem, we don't really presume the CDN or storage or anything like that. We just designed it so the simplest implementation can be a load of static files. Certainly you could simulate that with code and a database but we're not, we literally materialize the files.

We also took the opportunity to fix the client code around search. The client was making 2 calls every time you opened that manage packages dialog box. The first call to count the second to get results. The results also contained the count. Now we just make the call to get the results, and again we cache that. This alone would make the search twice as fast. But we also by pass the gallery code and call the back end search service directly which also helps a little. Overall its much faster. There is more we could do here, for example, the first "empty" search call could be served by a static blob from the CDN too, no need to even hit the search service for that. Anyhow you'll notice its faster.

Authentication
Different URLs and lots of opportunity to split requests over domains and all. I love that from a scalability point of view!

But... what with authentication? Basic authentication will be a mess here if these endpoints are indeed split across multiple domains. At least basic authentication where the RFC is followed (https://www.ietf.org/rfc/rfc2617.txt, in short the canonical URL + auth realm are used to identify a protected resource). A deviation could be to only use the realm and not the canonical URL, however what with servers / clients that do follow the standard?

I don't think authentication changes much here. If you need to authenticate then you need to put that data behind a secure endpoint or encrypt it. Obviously that would limit the applicability of things like a content-delivery-network but the basic model of lots of interlinked static files doesn't change.

Dynamic generation of segment files could be hard
Imagine having 100.000 packages. That's quite some segment index and segment files to write out, but feasible to do every now and then. An open question from me: imagine a 100.001'st package is added: AAAAA-1.0. Will this be in the first entry in the segment index? Or can it be appended to the segment index saying lowest = AAAAA-1.0 and highest = whatever comes next?

If the answer is: it has to be in the first, then it's pretty crazy stuff to have to generate the segment index and relevant segments over and over again for added packages. In other words: is a full database dump really required on database mutations? Or can it somehow be generated incrementally?

In addition to that, MyGet and I'm also guessing ProGet and Artifactory have the notion of upstream package sources. Imagine having a feed that aggregates three other feeds, of which all three can be a mixture of v2 and v3 endpoints. Generating the segment index and segments themselves would mean these three package sources will have to be queried for ALL the packages they have. Relatively easy to do on a v3 endpoint, but on v2 that means fetching all pages of the OData feed. Just for fun I tried this on NuGet.org's v2 endpoint and I can tell you it's quite the background job to run and will bring eventual consistency of such an aggregate feed to a whole new level :-)
If appending is feasible, then this "fetch from upstream" job is more feasible as it means only one full crawl and then incremental crawls (for example based on LastModified date of packages). So again: is appending feasible or does AAAAA-1.0 have to be in the first few segments?

We are still working on the incremental update code. I've got some prototype implementations that I'm reasonably happy with. What you are seeing in the published segments were generated directly from the database. Couple of things that are interesting here: we have had some success when we arrange the data into trees, trees are easy to update with consistency because we work up from the leaf. Its also easy to find stuff in ordered trees. Internally we are using an append only structure ordered by commit time, we call that the Catalog. We love append only structures for various reasons and that idea is fundamental to the Catalog. This stuff is surprisingly simple and works fast at serious scale. One thing we have learnt is that it can be nice to have initial builds behave very differently than incremental updates. There are certainly optimizations that you want to make use of on an initial build that aren't necessary or even available with an incremental update. So I don't know whether this answers your questions other than to say this is still a work in progress, perhaps take a look at the Catalog stuff in NuGet.Services.Metadata; I think the segment stuff will end out similar, basically with additional smarts around incremental change within the structure not just at the tail.

Specifically regarding the replication from upstream package sources, I suspect that will be very similar to the [currently] internal Catalog. If you are interested http://nugetprod0.blob.core.windows.net/ng-catalogs/0/index.json is the root of the internal Catalog. If you look at the NuGet.Services.Metadata CatalogTest project there is some examples of code to traverse this. There is a simple node.js program for example. A real client, will test against the "type" and only walk from the last timestamp it recorded (see the ResolverCollector for an example.) I could imagine the upstream source protocol behaving in a similar way.

John Taylor

unread,

Jul 21, 2014, 3:05:25 PM7/21/14

to nuget-e...@googlegroups.com

On Thursday, July 17, 2014 11:57:38 PM UTC-7, Maarten Balliauw wrote:

One more to add :-)

Segments for individual packages
Looking through some of our feeds, I found a couple using their MyGet feed for CI purposes big time. The package, which I will call Steak as I'm in BBQ mood, has over 6.000 versions of it. Most are prerelease, some are stable. What happens if someone tries to query http://preview.nuget.org/ver3-ctp1/packageregistrations/1/steak.json? Does this mean a JSON document with over 6.000 entries will be served? Or will this be segmented as well?

Segmented in a way I imagine. One thing to note is that even the traversal of the initial result is hyper linked, in the current case the package URIs on the registration page are fragment URIs. Logically they could be more remote. (Note that the current intercept code does not traverse - it just loads the JSON. But in the future we'll make this smarter.)

Maarten Balliauw

unread,

Jul 22, 2014, 2:33:59 AM7/22/14

to nuget-e...@googlegroups.com

Can a full list of endpoints and how they relate / have to be built in terms of structure be posted once the API is a bit stable in terms of interface?

Reply all

Reply to author

Forward

0 new messages