how to download jars

48 views
Skip to first unread message

Fredrik Ekholdt

unread,
Feb 21, 2013, 12:23:50 PM2/21/13
to adep...@googlegroups.com

Another question that comes up is how adept should download the actual jars (or any other files that  are not in the metadata)
I am therefore assuming that only the metadata is downloaded for offline use not the actual jars. The reason I think this is fair to assume is that downloading all jars would take to long and would be too error-prone.

The key attributes for me would be that the files are always there to download (e.g. decentralized servers) and that you can download them quickly.

Torrent have both of these properties, so I was thinking that the metadata could contain torrent files that points to the different jars. One torrent would be all artificats in a dep (jar, sources, javadoc, ...) - with torrent you can always choose to pick only some files inside of the torrent.

Any central repository would then just be like a torrent tracker and would seed the files it has. It would be easy to setup more seeders, just grab the torrent wait till they are downloaded and seed.
If the central repository is down or experience heavy traffic, this scheme would be perfect.

What remains to be seen though is whether a torrent really is fast enough (I have the impression it takes a bit of time to start them up) and how many bytes a torrent file takes. From a random sample it seems like it is around 10-20K, but that is on bigger files (~700mbs)  and not on jars.

What do you guys think?

I hope this is not overlapping with the storing of the jars thread.



Mark Harrah

unread,
Feb 21, 2013, 7:46:46 PM2/21/13
to adep...@googlegroups.com
On Thu, 21 Feb 2013 09:23:50 -0800 (PST)
Fredrik Ekholdt <fre...@gmail.com> wrote:

>
> Another question that comes up is how adept should download the actual jars
> (or any other files that are not in the metadata)
> I am therefore assuming that only the metadata is downloaded for offline
> use not the actual jars. The reason I think this is fair to assume is that
> downloading all jars would take to long and would be too error-prone.

Agreed.

> The key attributes for me would be that the files are always there to
> download (e.g. decentralized servers) and that you can download them
> quickly.
>
> Torrent have both of these properties,

Can you explain what this would look like for someone who wants to publish a project?

> so I was thinking that the metadata
> could contain torrent files that points to the different jars. One torrent
> would be all artificats in a dep (jar, sources, javadoc, ...) - with
> torrent you can always choose to pick only some files inside of the torrent.
>
> Any central repository would then just be like a torrent tracker and would
> seed the files it has. It would be easy to setup more seeders, just grab
> the torrent wait till they are downloaded and seed.
> If the central repository is down or experience heavy traffic, this scheme
> would be perfect.
>
> What remains to be seen though is whether a torrent really is fast enough
> (I have the impression it takes a bit of time to start them up) and how
> many bytes a torrent file takes. From a random sample it seems like it is
> around 10-20K, but that is on bigger files (~700mbs) and not on jars.
>
> What do you guys think?

I'm not sure about the torrent information being in the metadata. Having only hashes cleanly decouples dependency resolution from retrieval. Resolution gives you a list of hashes and retrieval takes the hashes and downloads the corresponding files. I would personally like to see experiments and I think that independent components facilitate experiments.

Also, the dominant factor in torrent file size seems to be related to the number of pieces. Each piece gets a 160 byte hash in the torrent file. 60 chunks gives a 10kB torrent file (for the 700 MB file, that's about a 10MB chunk size). Even 1 kB for each piece of metadata just for the download information seems like lot for a batch update approach.

-Mark

Fredrik Ekholdt

unread,
Feb 22, 2013, 8:12:20 AM2/22/13
to adep...@googlegroups.com


On Friday, 22 February 2013 01:46:46 UTC+1, Mark Harrah wrote:
On Thu, 21 Feb 2013 09:23:50 -0800 (PST)
Fredrik Ekholdt <fre...@gmail.com> wrote:

>
> Another question that comes up is how adept should download the actual jars
> (or any other files that  are not in the metadata)
> I am therefore assuming that only the metadata is downloaded for offline
> use not the actual jars. The reason I think this is fair to assume is that
> downloading all jars would take to long and would be too error-prone.

Agreed.

> The key attributes for me would be that the files are always there to
> download (e.g. decentralized servers) and that you can download them
> quickly.
>
> Torrent have both of these properties,

Can you explain what this would look like for someone who wants to publish a project?
When you publish a project to another repository adept will push the metadata, the torrent file that was generated with the tracker you selected (e.g. the repository you are publishing to, or any other) and seed the torrent till the torrent has been shared with the central repository.
When somebody pulls the repository they would download the torrent file. If they decide on downloading a jar, they would use the torrent to do so.


> so I was thinking that the metadata
> could contain torrent files that points to the different jars. One torrent
> would be all artificats in a dep (jar, sources, javadoc, ...) - with
> torrent you can always choose to pick only some files inside of the torrent.
>
> Any central repository would then just be like a torrent tracker and would
> seed the files it has. It would be easy to setup more seeders, just grab
> the torrent wait till they are downloaded and seed.
> If the central repository is down or experience heavy traffic, this scheme
> would be perfect.
>
> What remains to be seen though is whether a torrent really is fast enough
> (I have the impression it takes a bit of time to start them up) and how
> many bytes a torrent file takes. From a random sample it seems like it is
> around 10-20K, but that is on bigger files (~700mbs)  and not on jars.
>
> What do you guys think?

I'm not sure about the torrent information being in the metadata.  Having only hashes cleanly decouples dependency resolution from retrieval.  Resolution gives you a list of hashes and retrieval takes the hashes and downloads the corresponding files.  I would personally like to see experiments and I think that independent components facilitate experiments.
I agree that a metadata and the actual files should be cleanly coupled, the way I was thinking about this was to have the location of the files (let us say urls)  in the metadata. A torrent file can be viewed as the description of where to download a file. I fully agree on the need for prototyping this though. As I was saying, I am not sure if this will scale well for an offline repository.


Also, the dominant factor in torrent file size seems to be related to the number of pieces.  Each piece gets a 160 byte hash in the torrent file.  60 chunks gives a 10kB torrent file (for the 700 MB file, that's about a 10MB chunk size).  Even 1 kB for each piece of metadata just for the download information seems like lot for a batch update approach.
Yeah, that is what I am worried about. The question whether this approach would work is related to the size of torrent files and the number you are likely to have. Do you know how many files there are in maven central for example?

Josh Suereth

unread,
Feb 22, 2013, 8:29:13 AM2/22/13
to Fredrik Ekholdt, adep...@googlegroups.com
The downside to torrent is all the firewalls and things that auto-block torrents.   It's kind of like web-sockets in the sense that you need to a degradation path if the torrent is unable to download....

Just something to think about.

Fredrik Ekholdt

unread,
Feb 22, 2013, 8:42:55 AM2/22/13
to Josh Suereth, adep...@googlegroups.com

On Feb 22, 2013, at 2:29 PM, Josh Suereth wrote:

The downside to torrent is all the firewalls and things that auto-block torrents.   It's kind of like web-sockets in the sense that you need to a degradation path if the torrent is unable to download….
Yes, that is a good point. I guess choosing a different port in adept might not be enough...

Mark Harrah

unread,
Feb 22, 2013, 6:22:35 PM2/22/13
to adep...@googlegroups.com
On Fri, 22 Feb 2013 05:12:20 -0800 (PST)
Fredrik Ekholdt <fre...@gmail.com> wrote:

>
>
> On Friday, 22 February 2013 01:46:46 UTC+1, Mark Harrah wrote:
> >
> > On Thu, 21 Feb 2013 09:23:50 -0800 (PST)
> > Fredrik Ekholdt <fre...@gmail.com <javascript:>> wrote:
> >
> > >
> > > Another question that comes up is how adept should download the actual
> > jars
> > > (or any other files that are not in the metadata)
> > > I am therefore assuming that only the metadata is downloaded for offline
> > > use not the actual jars. The reason I think this is fair to assume is
> > that
> > > downloading all jars would take to long and would be too error-prone.
> >
> > Agreed.
> >
> > > The key attributes for me would be that the files are always there to
> > > download (e.g. decentralized servers) and that you can download them
> > > quickly.
> > >
> > > Torrent have both of these properties,
> >
> > Can you explain what this would look like for someone who wants to publish
> > a project?
> >
> When you publish a project to another repository adept will push the
> metadata, the torrent file that was generated with the tracker you selected
> (e.g. the repository you are publishing to, or any other) and seed the
> torrent till the torrent has been shared with the central repository.
> When somebody pulls the repository they would download the torrent file. If
> they decide on downloading a jar, they would use the torrent to do so.

So you still need a central host to store all artifacts and be the seed? You also need people to leave a running adept instance to seed others, right?
A problem is that the torrent file might change because the tracker location might change. So, now there are files that change being located outside of the versioned metadata. Perhaps that is what you meant by not scaling for an offline repository.

> > Also, the dominant factor in torrent file size seems to be related to the
> > number of pieces. Each piece gets a 160 byte hash in the torrent file. 60
> > chunks gives a 10kB torrent file (for the 700 MB file, that's about a 10MB
> > chunk size). Even 1 kB for each piece of metadata just for the download
> > information seems like lot for a batch update approach.
> >
> Yeah, that is what I am worried about. The question whether this approach
> would work is related to the size of torrent files and the number you are
> likely to have. Do you know how many files there are in maven central for
> example?

I think there are something like 50k modules. At NEScala, I think someone mentioned that another community (non-Java) had published that many in much less time.

-Mark

nafg

unread,
Mar 19, 2013, 5:33:40 PM3/19/13
to adep...@googlegroups.com
I think you may have missed the point. Mark wants to *de*couple them, that means the metadata should *not* mention the location of the file. Just as an illustration, if the two are decoupled you could have 10 servers that each have 80% of the files and adept would still be able to find the file.

Fredrik Ekholdt

unread,
Mar 20, 2013, 4:47:18 AM3/20/13
to adep...@googlegroups.com
Hey! Thanks for the comment!

I guess what I wanted to figure out in this thread is more how to actually download the files and suggested torrents as a possibility. This would indeed enable you to have a very flexible hosting solution where you can have 10 servers that each have 80% of the files. You could also just add seeds on files that are popular, etc etc.
 
Then, the question that immediately rises is how will the metadata reflect the chosen way of doing it. I mean: you have to get the download location from somewhere if you do not want the user to have to specify this, which I think is a very bad idea.

The way I am thinking about this now is that the module information/metadata has a hash that is unique to it (hashed from org, version, name, artifact hash) and a hash of the artifact it represents. This metadata is downloaded locally and versioned. In addition to this metadata you get the artifacts which is only linked to the metadata by the hash of the artifact. So you have 2 different indexes/db/set of files: one for metadata and one for artifacts. 
A repository is something that has metadata and also possibly the artifacts locations (and even the artifacts).

Locally on your HD/SSDH, you would have several repositories, which are completely unrelated to each other. When querying for locations of an artifact though, you would get all the locations from all the repositories. The reason for this is that you would have more possibilities to actually download the artifact hash.  
That means that if you want dependencies from junit:junit:4.7 for example, it would try to figure out the repositories it has junit in. It quickly finds all the transitive deps it needs (because this is offline and you only need to do a query to your local index) and lists the artifact hashes it needs. Then you can ask for the locations of these hashes. The location could be a url or a torrent or a link to torrent. I am not sure what is best, that is up to some experimentation I will (hopefully) have the time to do one of these days.

If you are wondering: it goes with the story that if there are multiple different versions of the same dependency in you repositories, adept would fail with an error that tells you that have more than one of the same and ask you to specify which you are meaning by giving the unique id of the dependency you actually want.

Pweeh,  a lot of rambling there - not sure if this makes sense for you? If so is it decoupled enough? If not, it would be cool to hear what you had in mind :)
Reply all
Reply to author
Forward
0 new messages