git based uber repo: act one

90 views
Skip to first unread message

Doug Tangren

unread,
Feb 24, 2013, 2:10:36 PM2/24/13
to adep...@googlegroups.com
I've pushed a conscript driven sketch of adept to https://github.com/softprops/adept which covers most of the metadata needs I think. After thinking about the uber repo approach I am being more a fan. Here's a rough explaining of my sketch.

As a refresher much of the metadata repository handling covered my the dcvs, in this case git which simplifies a lot in terms of storage, updating, syncing, ect (yay version control)

There's beauty in simplicity.

A metadata repository is just a normal git repo with some expectations of file layout.
It's expected there is a metadata dir at the top level of the repo. Under metadata, you should find a list of dirs named after organizationIds, then foreach organizationId a list of module names, then for each module dir a version dir containing a module.json file containing specific information for a version of that module

/metadata
  /com.mycompany
    /bippy
       /0.1.0/
         module.json
       /0.1.1
         module.json

  /com.yourcompany
     /boopy
        /0.1.0
          module.json

This structure makes listing modules by orgid and/or name simple and fast (I don't need to add a database here)

I don't actually need to read files until I want detailed info about a version, in which case I'd read the module.json version which includes extensible metadata about the version.


adept simply manages a list of metadata repos with a few high level commands for updating, locking, adding and removing repos which delegates most work to git.


The file structure of the repos makes it easy to quickly get a listing of modules or to get information about a specific module version and its dependency graph.


A proposal for managing metadata from the library authors side. adept would expose a cmd that updates a given metadata repo with json describing the attributes of the version locally then issuing a pull request remotely (through github) github makes this stupid easy to manage. there's a control queue for privileged users by there's also a programatic api exposed that could read the pull request, verify the data (possibly tieing in pgp auth here for validating the authenticity of the request) including artifact resolution before merging. This could be automated which is a plus. No waiting on authorities and procedural steps to approve your new version!
Once this gets merged into the remote repo. Others can just use git to update their local repos. Simple!

notes: 

modules are expected (for good reason) to define semantic versions (http://semver.org/) covered by https://github.com/softprops/semverfi. This makes it easy to defined a well defined and structured sorting mechanism for sorting versions.

-Doug Tangren
http://lessis.me

Fredrik Ekholdt

unread,
Feb 24, 2013, 6:52:14 PM2/24/13
to adep...@googlegroups.com
Hey Doug!

Just thought I'd share some thoughts related to using git vs something else. 

In my repo https://github.com/freekh/adept I started on a DB approach. 
Naturally, I when seeing this I began thinking about reasons why git-based is not the right choice :) - don't worry though the conclusion is friendly ;)

So I was considering git in beginning when Mark mentioned it but I moved away from it because I started worrying about how to do data-consistency (reading/writing based on predefined schemes, cleaning up resources with foreign keys, …), advanced queries and  testing harnesses, etc etc, which are things which is easy enough with DB  (or I know how to do well with a DB - you can choose ;). 
Finally, I managed to convince myself to go for a DB, because I was worried it would be slow in particular because I also want to have adept as a cmd line tool  in addition to a lib and a sbt plugin. 

Looking at what you got, I am not sure a 100% sure that going for a DB is right though.
Git is perfect for handling the pulling and pushing, would have to be custom built with a DB.
When it comes to testing it is probably not that hard to just built it up for adept. (Data) consistency might be a bit tougher without a DB, but it is not something that comes for free with a DB either.
I was also thinking that there was some complicated ways I needed to look up dependencies, but in fact there is only searching for modules that really needs a query language. Each time you know want you want you have the org, name,version,  which is all you need.
For the speed I think your point is good: you always have the org,name,version as mentioned so it is just a matter of looking up the file in their respective dirs and parse them. Looking up in a DB might be faster (or not), but even so, I do not think either approach really takes a lot of time.

Conclusion: the concept is really cool! Looking forward to see more on it!
I want to see how pushing and pulling can work with a DB because in my mind it seems easy considering the constraints I have but if it turns out to be hard,  git sure sounds the way to go - it does sound much cooler that it is built on git as well :)


Cheers,
F

Josh Suereth

unread,
Feb 24, 2013, 10:16:09 PM2/24/13
to adept-dev

Fred -

Think more about the db as flushable optimisation bits.   When updating the metadata, as a first pass, we can just recreate the db from scratch.   A real solution would incrementally update using git-diff, but I think your fears of db maintenance are valid.   Git doesn't alleviate them, but it gives us the escape hatch of just dropping the db entirely and rebuilding if we hit a bad state....

Mark Harrah

unread,
Mar 2, 2013, 7:51:22 PM3/2/13
to adep...@googlegroups.com
Unfortunately, I don't think semantic versioning is a reasonable expectation beyond prototyping. At a minimum, Scala doesn't follow semver. sbt has been 0.x for years, which doesn't make any binary compatibility claims. It also doesn't encode binary vs. source or forward vs. reverse compatibility, which are important aspects of compatibility currently being considered in the Scala world.

These are related to cross versioning as well. In my proposal, I suggested making version strings have no internal structure. They are just strings. If you want structure, you put it in the metadata. For example,

version: {
source: 2.10
binary: 2.10
unique: 2.10.1-20130302-194315-3a4f34
common: 2.10.1-SNAPSHOT
}

Then, a consuming module specifies the constraints it wants. There are details to work out and I don't know if this is viable, but this is my current thinking. Something this affects is the directory layout. You need to hash the version structure and use the hash in the path for uniqueness.

-Mark

> -Doug Tangren
> http://lessis.me

nafg

unread,
Mar 19, 2013, 6:05:26 PM3/19/13
to adep...@googlegroups.com
You don't necessarily know the version number if an artifact depends on a range etc.

Evan Chan

unread,
Mar 21, 2013, 3:40:28 AM3/21/13
to adep...@googlegroups.com
Hi Fredrik,

I think what a DB gives you for free, that a file based / git approach doesn't, is indexing (although you can build an index yourself, but the cost is maintaining consistency).

In mature dependency systems (dpkg/apt, RPM, etc.), either the user, or more commonly, the dependency engine, is always asking these kind of queries:
- who depends on this current version of a package
- what dependencies satisfies a given query (imagine if you had API dependencies, like SLF4J-API, or scala-json, that multiple packages could satisfy)
- what newer versions of a dependency there are, and if it  satisfies all your existing requirements

Excellent indexes are crucial to good dependency resolution performance.  The above queries requires lots of graph traversals.

I suppose one core question that should be answered is, how flexible of dependencies do we want adept to support?  I would think at a minimum:
- support for version ranges.  For example, "0.8.2+" but not "0.9".

Indexes can always be rebuilt after a git metadata repo update, and if one uses git diffs, then each index update might not even have to be that expensive.

-Evan

Mark Harrah

unread,
Mar 22, 2013, 5:31:53 PM3/22/13
to adep...@googlegroups.com
On Thu, 21 Mar 2013 00:40:28 -0700 (PDT)
Evan Chan <vel...@gmail.com> wrote:

> Hi Fredrik,
>
> I think what a DB gives you for free, that a file based / git approach
> doesn't, is indexing (although you can build an index yourself, but the
> cost is maintaining consistency).
>
> In mature dependency systems (dpkg/apt, RPM, etc.), either the user, or
> more commonly, the dependency engine, is always asking these kind of
> queries:
> - who depends on this current version of a package
> - what dependencies satisfies a given query (imagine if you had API
> dependencies, like SLF4J-API, or scala-json, that multiple packages could
> satisfy)
> - what newer versions of a dependency there are, and if it satisfies all
> your existing requirements
>
> Excellent indexes are crucial to good dependency resolution performance.
> The above queries requires lots of graph traversals.

I agree. The original proposal as I understood it was for a database to be the native representation instead of an implementation detail. I believe Fredrik is still assuming a database would be populated from the git repository, but I'll let him clarify that.

> I suppose one core question that should be answered is, how flexible of
> dependencies do we want adept to support? I would think at a minimum:
> - support for version ranges. For example, "0.8.2+" but not "0.9".

This is a good question and one that was discussed rather extensively in the follow up session the day after the talk. Unfortunately it wasn't recorded and there wasn't a solid conclusion anyway!

What I want to at least consider, even if it doesn't ultimately work out, is whether it makes sense for versions to have internal structure at all. I think we really want compatibility information encoded instead (but who knows, maybe not). This would be a producer-side model (the library maintainer explicitly indicates compatibility) instead of consumer-side (the library user infers compatibility or otherwise acquires it from other channels). I think I mentioned it in the talk, but I don't remember how much I elaborated. The example is here:

https://github.com/sbt/adept/wiki/NEScala-Proposal#metadata-example-from-the-slides

and some more here:

https://github.com/sbt/adept/wiki/NEScala-Proposal#version-attributes

The idea is you would depend on a library like:

org=org.example, name=demo, binaryVersion=2.11

instead of cramming all versioning information into a single version. I also like it because it is opt-in and potentially flexible enough to handle the actual compatibility problems in Scala right now.

Alternatives are of course semver.org or other traditional versions with internal structure. I am interested to hear pros and cons of various things that have been tried.

> Indexes can always be rebuilt after a git metadata repo update, and if one
> uses git diffs, then each index update might not even have to be that
> expensive.

Yep, hopefully.

-Mark

Evan Chan

unread,
Mar 23, 2013, 8:48:14 PM3/23/13
to adep...@googlegroups.com, adep...@googlegroups.com
Mark et al,

 

Excellent indexes are crucial to good dependency resolution performance.
The above queries requires lots of graph traversals.

I agree.  The original proposal as I understood it was for a database to be the native representation instead of an implementation detail.  I believe Fredrik is still assuming a database would be populated from the git repository, but I'll let him clarify that.

OTOH if you actually check in the database then you negate many of the advantages of git in the first place.  But if you generate a DB from git that might be too slow.  The Gremlin folks implemented a graph DB prototype on Git, so this may work:


Perhaps our requirements are just for simple maven deps (which would make me sad), and this would reduce indexing requirements. 



I suppose one core question that should be answered is, how flexible of
dependencies do we want adept to support?  I would think at a minimum:
- support for version ranges.  For example, "0.8.2+" but not "0.9".

This is a good question and one that was discussed rather extensively in the follow up session the day after the talk.  Unfortunately it wasn't recorded and there wasn't a solid conclusion anyway!

What I want to at least consider, even if it doesn't ultimately work out, is whether it makes sense for versions to have internal structure at all.  I think we really want compatibility information encoded instead (but who knows, maybe not).  This would be a producer-side model (the library maintainer explicitly indicates compatibility) instead of consumer-side

Yes, I totally agree producer side should provide enough metadata to help compatibility. 
The question is, do enough people care to make this work/a reality?

A good goal would be to get much of the Scala community to hop on board.  I'm afraid too many folks would say, so what's wrong with Maven?

To take a step back, what would motivate people to switch? (And why didn't Ivy repos succeed?) 
I would hope:
- ease of publishing (compared to Sonatype)
- more sane predictable dep resolution 
- more conflicts resolved at dep resolution instead of runtime

-Evan

(the library user infers compatibility or otherwise acquires it from

other channels).  I think I mentioned it in the talk, but I don't remember how much I elaborated.  The example is here:

https://github.com/sbt/adept/wiki/NEScala-Proposal#metadata-example-from-the-slides

and some more here:

https://github.com/sbt/adept/wiki/NEScala-Proposal#version-attributes

The idea is you would depend on a library like:

 org=org.example, name=demo, binaryVersion=2.11


Is the org field an absolute necessity?

Mark Harrah

unread,
Mar 25, 2013, 9:32:56 AM3/25/13
to adep...@googlegroups.com
On Sat, 23 Mar 2013 17:48:14 -0700
Evan Chan <vel...@gmail.com> wrote:

> Mark et al,
>
> >>
> >>
> >> Excellent indexes are crucial to good dependency resolution performance.
> >> The above queries requires lots of graph traversals.
> >
> > I agree. The original proposal as I understood it was for a database to be the native representation instead of an implementation detail. I believe Fredrik is still assuming a database would be populated from the git repository, but I'll let him clarify that.
>
> OTOH if you actually check in the database then you negate many of the advantages of git in the first place. But if you generate a DB from git that might be too slow. The Gremlin folks implemented a graph DB prototype on Git, so this may work:
>
> https://groups.google.com/forum/m/#!msg/gremlin-users/trBKdAW105U/qBfckF0bOiIJ

Yes, that is a possibility. One thing we'd need to see is whether it would actually be readable by users. For example, if the encoding of metadata into a text file is ultimately one or two steps removed from what the user writes, they wouldn't understand the contents or diffs.

By this I mean, if dependencies have to be encoded as a string ID, we'd probably have to have a hash and then the user wouldn't be able to quickly read off the dependencies of a module.

> Perhaps our requirements are just for simple maven deps (which would make me sad), and this would reduce indexing requirements.

If you mean limited to what Maven metadata can describe, I doubt they would be sufficient. Those limitations are half of the reason I bothered to give the talk.

> >> I suppose one core question that should be answered is, how flexible of
> >> dependencies do we want adept to support? I would think at a minimum:
> >> - support for version ranges. For example, "0.8.2+" but not "0.9".
> >
> > This is a good question and one that was discussed rather extensively in the follow up session the day after the talk. Unfortunately it wasn't recorded and there wasn't a solid conclusion anyway!
> >
> > What I want to at least consider, even if it doesn't ultimately work out, is whether it makes sense for versions to have internal structure at all. I think we really want compatibility information encoded instead (but who knows, maybe not). This would be a producer-side model (the library maintainer explicitly indicates compatibility) instead of consumer-side
>
> Yes, I totally agree producer side should provide enough metadata to help compatibility.
> The question is, do enough people care to make this work/a reality?

For me, the emphasis should be on a metadata format that faciliates reliable automation. Then, you just make it work for users. As an example, the build tool can automatically manage compatibility if it has places to put the information. Right now, the cross-versioning hack is necessary because the underlying system doesn't have a place to put that information in a way that works properly.

> A good goal would be to get much of the Scala community to hop on board. I'm afraid too many folks would say, so what's wrong with Maven?
> To take a step back, what would motivate people to switch? (And why didn't Ivy repos succeed?)

I expect involving Ivy is complicated because it wasn't just about dependency management, but also the build tool and the types of users using those build tools. I don't know the history so these are guesses. It had a fairly high barrier to entry for small projects and there wasn't a large central repository. Maven bundled dependency resolution with a build tool that took advantage of it and was based on convention that included a central repository so there wasn't much initial work to configure it.

> I would hope:
> - ease of publishing (compared to Sonatype)
> - more sane predictable dep resolution
> - more conflicts resolved at dep resolution instead of runtime

Yes, I think these are the minimum, but also a proper user experience, with things like searching for dependencies locally instead of via a website.

-Mark
Reply all
Reply to author
Forward
0 new messages