JGit DFS backend - has anyone tried to implement Cassandra?

920 views
Skip to first unread message

Luca Milanesio

unread,
Jan 7, 2016, 7:44:47 AM1/7/16
to repo-discuss, JGit Developers list
Hi JGit and Gerrit devs :-)

I was looking for a more distributed, fault-tolerant and scalable backend for Git objects ... and I was thinking about using Apache Cassandra :-)
Has anyone tried the approach? Succeeded? Failed?

I've found some references from Shawn in a post a few years ago at [1] but have no idea if there was any further action on that.

Thank you in advance for the feedback.

Luca.

Alex Blewitt

unread,
Jan 7, 2016, 8:41:51 AM1/7/16
to Luca Milanesio, repo-discuss, JGit Developers list
These things are possible, though there hasn't been an open sourced version of that approach. The implementation requires the DfsRefDatabase to be subclassed and supplied as part of (a subclass of) DfsRepository. The ref database can then reach out to something like Cassandra to store individual ref updates. The abstract methods are quite easy to implement - scanAllRefs needs to get a list and updates are controlled with compareAndPut or compareAndRemove.

The DfsObjDatabase is the one that needs more work - everything is treated as a list of .pack files. A pack file is the same as the ordinary git implementation but has additional metadata. You really need the pack files to be synchronised with the refs, so that if a ref is in the ref database then the pack is available everywhere. The internal implementation of the library does most of the work for you but if eg Cassandra is shared across many machines but the pack is written by (eg) nfs then an immediate read to the updated ref might not find the pack on a remote nfs mount. 

The approach that DfsGarbageCollector takes is to compress everything down to a single* pack file (or used to, anyway). I'm not sure whether Cassandra would make sense for pack files - my gut instinct would be "no" unless they are small repositories. 

I built in a retry if a pack wasn't immediately available as part of the implementation, and because the DfsReader was random access and DfsOutputStream was one-shot I ended up caching incoming content to a spool file and then calculating checksums to upload to the remote server (due to back end limitations). 

So these things are possible but implementations outside the InMemoryRepositoy aren't open source. You can use that as a template and I'd be happy to answer questions where I can. 

Alex

Sent from my iPhat 6
--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Luca Milanesio

unread,
Jan 7, 2016, 8:56:57 AM1/7/16
to Alex Blewitt, repo-discuss, JGit Developers list
Hi Alex,
thank you for your quick reply: as "you know someone" who did a real JGit DFS implementation ... and you believe is possible ... we get more confidence in starting this work.
Should you have spare time to support us with answers or code, it will be really appreciated :-)

Cassandra is getting momentum for his ability of being scalable, very fast in read, distributed on single or multiple geographical zones, which would make it a perfect candidate for Gerrit.
We may have a Cassandra expert helping us with this work ... and maybe someone from DataStax could help as well.

Waiting for Shawn to wake up if he has some updates on his 5 years old post on this topic.

Luca.

David Ostrovsky

unread,
Jan 7, 2016, 9:12:08 AM1/7/16
to Repo and Gerrit Discussion

Am Donnerstag, 7. Januar 2016 14:56:57 UTC+1 schrieb lucamilanesio:
Hi Alex,
thank you for your quick reply: as "you know someone" who did a real JGit DFS implementation ... and you believe is possible ... we get more confidence in starting this work.
Should you have spare time to support us with answers or code, it will be really appreciated :-)

Cassandra is getting momentum for his ability of being scalable, very fast in read, distributed on single or multiple geographical zones, which would make it a perfect candidate for Gerrit.
We may have a Cassandra expert helping us with this work ... and maybe someone from DataStax could help as well.

Waiting for Shawn to wake up if he has some updates on his 5 years old post on this topic.

If I understand the thread correctly that you've referenced in your previous mail,
it was about switching Gerrit SQL-database to No-SQL for Gerrit metadata and
not Git backend. I guess that Gerrit@Google was using SQL-database backend
before it was ported to Google Megastore (multi tenant Big Table database).


Alex Blewitt

unread,
Jan 7, 2016, 9:41:07 AM1/7/16
to Luca Milanesio, repo-discuss, JGit Developers list


> On 7 Jan 2016, at 13:56, Luca Milanesio <luca.mi...@gmail.com> wrote:
>
> Hi Alex,
> thank you for your quick reply: as "you know someone" who did a real JGit DFS implementation ... and you believe is possible ... we get more confidence in starting this work.

Cassandra will be Ok for supporting references but I don't believe that it will work for storing the pack files (objects). So the question becomes: where do you store the data. You'd really need to have a plan for that first. Cassandra has limited sizes for individual entries (2G max, in practice orders of magnitude less according to the wiki) which would make it unsuitable for storing pack files in general.

You might consider storing pack files in S3 and the references in Cassandra for example to provide a scalable solution. But you shouldn't rely on Cassandra for the pack files.

So where did you intend on storing the large data blobs?

Alex

Luca Milanesio

unread,
Jan 7, 2016, 10:30:38 AM1/7/16
to Alex Blewitt, repo-discuss, JGit Developers list
Hi Alex,
I see your point and it wouldn't make more sense to have smaller pack files when you have a system like Cassandra?

In practice you should use packfiles of around 250MB (which means for a 1:5 compress ratio, a maximum file size of 1GB) which is still reasonable for keeping a Git server "healthy and well scalable".
For large files in Git I would rather opt for Git LFS support rather than store them natively in Git.

Do you foresee other problems with managing a Git repo with smaller pack-files?

P.S. On GerritHub.io the largest packfile we have is 1.6GB, big but still within Cassandra's limitations: the majority of packfiles are < 100MBytes anyway.

Luca.

Shawn Pearce

unread,
Jan 7, 2016, 10:37:32 AM1/7/16
to Luca Milanesio, Alex Blewitt, repo-discuss, JGit Developers list
On Thu, Jan 7, 2016 at 5:56 AM, Luca Milanesio <luca.mi...@gmail.com> wrote:
> Hi Alex,
> thank you for your quick reply: as "you know someone" who did a real JGit
> DFS implementation ... and you believe is possible ... we get more
> confidence in starting this work.
> Should you have spare time to support us with answers or code, it will be
> really appreciated :-)

FWIW, Alex wrote his DFS implementation with almost no help from me. :)

> Cassandra is getting momentum for his ability of being scalable, very fast
> in read, distributed on single or multiple geographical zones,

How does current Cassandra do on the Jepson tests?
https://aphyr.com/posts/294-jepsen-cassandra

> which would
> make it a perfect candidate for Gerrit.
> We may have a Cassandra expert helping us with this work ... and maybe
> someone from DataStax could help as well.
>
> Waiting for Shawn to wake up if he has some updates on his 5 years old post
> on this topic.

The 5 year ago Cassandra work I did was based on JGit DHT, which is a
different design. No database could keep up with JGit DHT, so I
abandoned that approach and deleted the code from JGit. JGit DFS was
the outcome of all of that.

_If_ you wanted to put everything into Cassandra, I would chunk pack
files into say 1 MiB chunks and store the chunks in individual rows.
This means configuring the DfsBlockCache using
DfsBlockCacheConfig.setBlockSize(1 * MB). When creating a new pack
generate a random unique name for the DfsPackDescription and use that
name and the block offset as the row key.

DfsOutputStream buffers 1 MiB of data in RAM and then passes that
buffer off as a row insert into Cassandra.

The DfsObjDatabase.openFile() method supplies a ReadableChannel that
is accessed in aligned blockSize units, so 1 MB alignments. If your
row keys are the pack name and the offset of the first byte of the
block (so 0, 1048576, 2097152, ...) read method calls nicely line up
to row reads from Cassandra. The DfsBlockCache will smooth out
frequent calls for rows.

Use another row in Cassandra to store the list of packs. The
listPacks() method then just loads that row. commitPacks() updates
that row by inserting some values and removing other values. What you
really want to store here is the pack name and the length so that you
can generate the row keys.

Reference API in DfsRefDatabase is simple. But I just committed a
change to JGit to allow other uses of RefDatabases. Because...


The new RefTree type[1] is part of a larger change set to allow
storing references inside of Git tree objects. (Git, in Git! Ahh the
recursion!) This may simplify things a little bit as we only really
need to store the pack and object data. Reference data is derived from
pack data.

[1] https://git.eclipse.org/r/62967

RefTree on its own is incomplete. I should get another few commits
uploaded today that provide a full RefDatabase around the RefTree
type. I have it coded and working, just working on the unit tests to
verify its working.


The longer term trend here is I'm doing some Git multi-master work
inside JGit now. RefTree is an important building block, but is far
from complete. $DAY_JOB is evolving our Git multi-master system for
$REASONS, and in the process trying to put support into JGit.

Alex Blewitt

unread,
Jan 7, 2016, 2:45:53 PM1/7/16
to Shawn Pearce, Luca Milanesio, repo-discuss, JGit Developers list

On 7 Jan 2016, at 15:37, Shawn Pearce <spe...@spearce.org> wrote:

_If_ you wanted to put everything into Cassandra, I would chunk pack
files into say 1 MiB chunks and store the chunks in individual rows.
This means configuring the DfsBlockCache using
DfsBlockCacheConfig.setBlockSize(1 * MB). When creating a new pack
generate a random unique name for the DfsPackDescription and use that
name and the block offset as the row key.

My recollection is that the DfsGarbageCollector coalesced everything into a single pack (that wasn’t garbage, which got its own pack and gets deleted later)


Of course I may have got that bit wrong too :)

That would mean that triggering a repo.gc() would potentially overflow the Cassandra based storage, wouldn’t it?

Alex

Shawn Pearce

unread,
Jan 7, 2016, 2:49:00 PM1/7/16
to Alex Blewitt, Luca Milanesio, repo-discuss, JGit Developers list
On Thu, Jan 7, 2016 at 11:45 AM, Alex Blewitt <alex.b...@gmail.com> wrote:
>
> On 7 Jan 2016, at 15:37, Shawn Pearce <spe...@spearce.org> wrote:
>
> _If_ you wanted to put everything into Cassandra, I would chunk pack
> files into say 1 MiB chunks and store the chunks in individual rows.
> This means configuring the DfsBlockCache using
> DfsBlockCacheConfig.setBlockSize(1 * MB). When creating a new pack
> generate a random unique name for the DfsPackDescription and use that
> name and the block offset as the row key.
>
>
> My recollection is that the DfsGarbageCollector coalesced everything into a
> single pack (that wasn’t garbage, which got its own pack and gets deleted
> later)

It makes two or three packs:

1) refs/heads/*
2) everything else (e.g. refs/meta/config)
3) remaining garbage

> https://github.com/eclipse/jgit/blob/master/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/dfs/DfsGarbageCollector.java#L354
>
> Of course I may have got that bit wrong too :)
>
> That would mean that triggering a repo.gc() would potentially overflow the
> Cassandra based storage, wouldn’t it?

Why would it overflow? DfsOutputStream is chunking into 1 MiB blocks
and storing each block into its own row. Cassandra is infinitely
scalable with its hashring distributing rows over the server pool. You
could store terabytes if you have enough disk and enough Cassandra
nodes.

Alex Blewitt

unread,
Jan 7, 2016, 3:02:41 PM1/7/16
to Shawn Pearce, Luca Milanesio, repo-discuss, JGit Developers list

> On 7 Jan 2016, at 19:48, Shawn Pearce <spe...@spearce.org> wrote:
>
> Why would it overflow? DfsOutputStream is chunking into 1 MiB blocks
> and storing each block into its own row. Cassandra is infinitely
> scalable with its hashring distributing rows over the server pool. You
> could store terabytes if you have enough disk and enough Cassandra
> nodes.

Ah so it was me - I had implemented the DfsOutputStream to just append to a file (effectively) so that on disk I had a .pack file which could be understood by native git tools. But if each DfsOutputStream writes to its own file then I understand that you could do the petitioning at that point.

Alex

Luca Milanesio

unread,
Jan 8, 2016, 6:26:51 AM1/8/16
to Shawn Pearce, Alex Blewitt, repo-discuss, JGit Developers list
Hi Shawn,
thanks for your feedback, see below my comments.

> On 7 Jan 2016, at 15:37, Shawn Pearce <spe...@spearce.org> wrote:
>
> On Thu, Jan 7, 2016 at 5:56 AM, Luca Milanesio <luca.mi...@gmail.com> wrote:
>> Hi Alex,
>> thank you for your quick reply: as "you know someone" who did a real JGit
>> DFS implementation ... and you believe is possible ... we get more
>> confidence in starting this work.
>> Should you have spare time to support us with answers or code, it will be
>> really appreciated :-)
>
> FWIW, Alex wrote his DFS implementation with almost no help from me. :)

@Alex: I know your DFS isn't OpenSource but did you present the overall approach in any conference by not mentioning the Client's name?

>
>> Cassandra is getting momentum for his ability of being scalable, very fast
>> in read, distributed on single or multiple geographical zones,
>
> How does current Cassandra do on the Jepson tests?
> https://aphyr.com/posts/294-jepsen-cassandra

Actually Cassandra chooses to let the last write win: if you have client A and client B in two different zones writing exactly at the same time the same key, however comes last (in Cassandra's timing) will be written to disk.
I was thinking about Cassandra for extensibility and scalability of storage in the *same* geographical zone (local consistency) and Cassandra serves well and quickly that scenario.

If we want to push Cassandra to the global consistency level across remote geo-locations without a powerful dedicated CDN, it wouldn't work in practice as the introduced latency for the consistency checks would be too high.

Another approach is to "endorse inconsistency" and managing it.

Let's say that client A wants to push commit CA to master branch and at the same time client B wants to push commit CB to the same master branch of the same project at the same millisecond across two different zones.
CA and CB, assuming that they are different commits with different content, will have different SHA1 for their BLOBs, Trees and commit objects: we don't have then any conflict from a Cassandra perspective.
The trouble is when we want to update the ref, will the new master point to CA or CB?

A solution could be: append both and let the read operation resolve the conflict.
If you add both CA and CB, the client A and client B will find that after the push there are two values appended instead of one => they can then treat this case as a conflict and ask to repeat the operation.

As both CA and CB have all their objects already pushed (but only the refs wasn't updated), their second push attempt will be very quick. The retry wouldn't then create too much trouble to both.

Do you (or Alex) foresee problems with this approach?
Thanks for the suggestion, that would definitely work.

If we choose 50MiB for a Cassandra row (which is still very acceptable), the number of packs across multiple rows will be quite limited anyway.
It shouldn't then create significant performance penalty.

One of the reasons of choosing Cassandra is for making the Git storage virtually "unlimited" with zero downtime.
Imagine that I have now 10 Cassandra nodes with 10 TB capacity ... if the volume increases a lot, it would be very easy to add more nodes and get extra storage without significant performance penalty in read or write.

Honestly I was thinking about both Cassandra AND GlusterFS to scale on Git storage, but Cassandra looked more promising as required less work on the infrastructure level.
Additionally I could use Cassandra as well for the DB side, so you have the same repo for Git data and DB even now in Gerrit 2.13.

All the feedback received so far on the feasibility of the Cassandra approaches look promising :-)

Luca.

David Ostrovsky

unread,
Jan 8, 2016, 7:31:37 AM1/8/16
to Repo and Gerrit Discussion

Am Freitag, 8. Januar 2016 12:26:51 UTC+1 schrieb lucamilanesio:
Hi Shawn,
thanks for your feedback, see below my comments.


[...]

Additionally I could use Cassandra as well for the DB side, so you have the same repo for Git data and DB even now in Gerrit 2.13.

Wouldn't this require No-SQL Cassandra driver for gwtorm?
Or do you have it already ;-) ?

Luca Milanesio

unread,
Jan 8, 2016, 8:44:23 AM1/8/16
to David Ostrovsky, Repo and Gerrit Discussion
Hi David,
yes you are right, it will and doesn't exist yet ... nothing exists yet to be honest :-)

Luca.

David Ostrovsky

unread,
Jan 8, 2016, 8:53:04 AM1/8/16
to Repo and Gerrit Discussion

Am Freitag, 8. Januar 2016 14:44:23 UTC+1 schrieb lucamilanesio:
Hi David,
yes you are right, it will and doesn't exist yet ... nothing exists yet to be honest :-)


Thanks, Luca, for confirming that. TBH I would be interested in this project ;-)
I talked to Shawn years ago about adding Cassandra dialect to gwtorm, to have
open source No-SQL option for Gerrit, but he argued that it wouldn't make
really sense, as Gerrit metadata backend is going to be migrated to Git.

Luca Milanesio

unread,
Jan 8, 2016, 9:00:00 AM1/8/16
to David Ostrovsky, Repo and Gerrit Discussion
Shawn is actually right :-) it all depends how quickly is Gerrit 3.0 doing compared to the Cassandra work.
In either case, Cassandra JGit backend would be useful, with or without meta-data on DB.

Luca.

Shawn Pearce

unread,
Jan 8, 2016, 11:29:11 AM1/8/16
to Luca Milanesio, Alex Blewitt, repo-discuss, JGit Developers list
On Fri, Jan 8, 2016 at 3:26 AM, Luca Milanesio <luca.mi...@gmail.com> wrote:
>
>> On 7 Jan 2016, at 15:37, Shawn Pearce <spe...@spearce.org> wrote:
>>
>> On Thu, Jan 7, 2016 at 5:56 AM, Luca Milanesio <luca.mi...@gmail.com> wrote:
>>
>>> Cassandra is getting momentum for his ability of being scalable, very fast
>>> in read, distributed on single or multiple geographical zones,
>>
>> How does current Cassandra do on the Jepson tests?
>> https://aphyr.com/posts/294-jepsen-cassandra

Did you read this ^^ ?

> Actually Cassandra chooses to let the last write win: if you have client A and client B in two different zones writing exactly at the same time the same key, however comes last (in Cassandra's timing) will be written to disk.

Last time I looked, Cassandra's timing is just the wall/system clock
of the computer its running on. If you have 5 machines, even in the
same rack of the same data center with NTP, their times will be off by
at least a few milliseconds. Its a well known concept in distributed
computing that you cannot rely on the clock. This is why vector clocks
came about. And then of course Google went and proved otherwise with
Spanner[2]. But Cassandra does not incorporate what Spanner does to
make system clocks work.

[2] http://research.google.com/archive/spanner.html

> I was thinking about Cassandra for extensibility and scalability of storage in the *same* geographical zone (local consistency) and Cassandra serves well and quickly that scenario.
>
> If we want to push Cassandra to the global consistency level across remote geo-locations without a powerful dedicated CDN, it wouldn't work in practice as the introduced latency for the consistency checks would be too high.
>
> Another approach is to "endorse inconsistency" and managing it.
>
> Let's say that client A wants to push commit CA to master branch and at the same time client B wants to push commit CB to the same master branch of the same project at the same millisecond across two different zones.
> CA and CB, assuming that they are different commits with different content, will have different SHA1 for their BLOBs, Trees and commit objects: we don't have then any conflict from a Cassandra perspective.
> The trouble is when we want to update the ref, will the new master point to CA or CB?
>
> A solution could be: append both and let the read operation resolve the conflict.
> If you add both CA and CB, the client A and client B will find that after the push there are two values appended instead of one => they can then treat this case as a conflict and ask to repeat the operation.

You may not be able to see CA yet at the end of the CB push. How long
does client B wait while polling to see if a CA is created?

What if the timestamps of CA and CB differ by only a couple of
milliseconds? From the point of view of Cassandra CB wins because its
timestamp is 2 milliseconds later than CA. But from the point of view
of the users, CA and CB happened at the same time. Human's can't
really perceive a 2 millisecond time. Heck that 2 millisecond
difference could have just been introduced by a kernel scheduling
hiccup alternating one thread off a core to run some other thread.

> As both CA and CB have all their objects already pushed (but only the refs wasn't updated), their second push attempt will be very quick. The retry wouldn't then create too much trouble to both.

For that 2nd push to be really quick you need the ls-remotes during
push to advertise both CA and CB. But only one value can be shown on
the reference. So the other will have to be hidden in a ".have" line.
Entirely doable, but requires a bit more coding. You have to override
RefDatabase.getAdditionalRefs() and return values there so they can be
folded into the ".have" advertisement. IIRC Gerrit ignores this list
by default, so you may also have to tweak Gerrit.

> Do you (or Alex) foresee problems with this approach?

Yes. :)
Be careful about that. If the Cassandra server and client treat that
as a single byte array 50 MiB in size, moving that around is going to
be very very slow. Latency to first byte for example is about the same
as latency for the last byte, as the entire array has to be loaded in
from Cassandra's storage files, then put into the wire protocol
buffer, then read in by the client library in the JGit process before
JGit can see any of the data.

If you use smaller chunks (e.g. 1 MiB) then latency for the first MiB
is far lower. You spread the latency out over the chunks of the file.
Which may help if you don't need very much of the file. E.g to compute
a diff between two commits for Gerrit you may only need 512 KiB worth
of data, but its spread around 3 parts of the pack file. At 50 MiB per
chunk you may have to load 50 MiB to use 512 KiB. At 1 MiB per chunk
you may load 3 MiB to use 512 KiB. Latency here to move 3 MiB is
probably going to be lower than to move 50 MiB.

This is one reason our DFS system at $DAY_JOB actually uses a block
size of 64 KiB. Small reads are common in things like Gerrit and
Gitiles because the entire repository is not being accessed.

During clone we catch this and increase our read size considerably
using the setReadAheadBytes() method of JGit's dfs.ReadableChannel
interface. We use the advice to just start async loading larger
amounts of data, but the small chunk sizes means we can start making
data available to the Git client almost immediately rather than
waiting for 50 MiB to come into our server process.

> the number of packs across multiple rows will be quite limited anyway.
> It shouldn't then create significant performance penalty.
>
> One of the reasons of choosing Cassandra is for making the Git storage virtually "unlimited" with zero downtime.
> Imagine that I have now 10 Cassandra nodes with 10 TB capacity ... if the volume increases a lot, it would be very easy to add more nodes and get extra storage without significant performance penalty in read or write.

True, IIRC you can grow a Cassandra cluster while it is online, so you
can add capacity with no impact.

Luca Milanesio

unread,
Jan 8, 2016, 1:19:50 PM1/8/16
to Shawn Pearce, Alex Blewitt, repo-discuss, JGit Developers list

> On 8 Jan 2016, at 16:28, Shawn Pearce <spe...@spearce.org> wrote:
>
> On Fri, Jan 8, 2016 at 3:26 AM, Luca Milanesio <luca.mi...@gmail.com> wrote:
>>
>>> On 7 Jan 2016, at 15:37, Shawn Pearce <spe...@spearce.org> wrote:
>>>
>>> On Thu, Jan 7, 2016 at 5:56 AM, Luca Milanesio <luca.mi...@gmail.com> wrote:
>>>
>>>> Cassandra is getting momentum for his ability of being scalable, very fast
>>>> in read, distributed on single or multiple geographical zones,
>>>
>>> How does current Cassandra do on the Jepson tests?
>>> https://aphyr.com/posts/294-jepsen-cassandra
>
> Did you read this ^^ ?

Yes I did and I discarded the idea of updating cells :-) You cannot reliably know who was the "last who updated the cell" as you may think is you (on your replica) but as a matter of fact someone else in another replica on the other side of the planet can think the same thing ... and at the end of the day just Cassandra knows who won, not you, which isn't nice as one of the two has lost his write.

I am not concerned about Cassandra loosing data on updates, as soon as the client of that write has a way to detect that the situation and try again.
The problem with cell updates is that you can never know if you lost your data or not :-(

>
>> Actually Cassandra chooses to let the last write win: if you have client A and client B in two different zones writing exactly at the same time the same key, however comes last (in Cassandra's timing) will be written to disk.
>
> Last time I looked, Cassandra's timing is just the wall/system clock
> of the computer its running on. If you have 5 machines, even in the
> same rack of the same data center with NTP, their times will be off by
> at least a few milliseconds. Its a well known concept in distributed
> computing that you cannot rely on the clock. This is why vector clocks
> came about. And then of course Google went and proved otherwise with
> Spanner[2]. But Cassandra does not incorporate what Spanner does to
> make system clocks work.
>
> [2] http://research.google.com/archive/spanner.html

Agreed, that bad and makes the updates not reliable.

>
>> I was thinking about Cassandra for extensibility and scalability of storage in the *same* geographical zone (local consistency) and Cassandra serves well and quickly that scenario.
>>
>> If we want to push Cassandra to the global consistency level across remote geo-locations without a powerful dedicated CDN, it wouldn't work in practice as the introduced latency for the consistency checks would be too high.
>>
>> Another approach is to "endorse inconsistency" and managing it.
>>
>> Let's say that client A wants to push commit CA to master branch and at the same time client B wants to push commit CB to the same master branch of the same project at the same millisecond across two different zones.
>> CA and CB, assuming that they are different commits with different content, will have different SHA1 for their BLOBs, Trees and commit objects: we don't have then any conflict from a Cassandra perspective.
>> The trouble is when we want to update the ref, will the new master point to CA or CB?
>>
>> A solution could be: append both and let the read operation resolve the conflict.
>> If you add both CA and CB, the client A and client B will find that after the push there are two values appended instead of one => they can then treat this case as a conflict and ask to repeat the operation.
>
> You may not be able to see CA yet at the end of the CB push. How long
> does client B wait while polling to see if a CA is created?

The approach I was proposing was for geographically co-located Cassandra instances for disk-space scalability: it wouldn't work if the nodes are in completely different locations with very high latency or even moment of "temporary disconnections".

I need to investigate further into Cassandra to understand what it does in these conditions: will the nodes just put off the cluster? How quickly that happens? Needs a bit of investigation on that ...

>
> What if the timestamps of CA and CB differ by only a couple of
> milliseconds? From the point of view of Cassandra CB wins because its
> timestamp is 2 milliseconds later than CA. But from the point of view
> of the users, CA and CB happened at the same time. Human's can't
> really perceive a 2 millisecond time. Heck that 2 millisecond
> difference could have just been introduced by a kernel scheduling
> hiccup alternating one thread off a core to run some other thread.

If both add to the same collection, Cassandra allows the two objects to be added but cannot guarantee the order as it is not reliable.
The decision of "who won" cannot be done on the "last one" as both will be there in the collection. Will need to be resolved at read time and make a decision on who has to be considered as winner.

Let's put in this way: the "winner" will be eventually be elected somehow amongst the commits that ended up in the collection. We may even decide that there is "no winner" if the collection contains more than one element.
There must be a moment in time where the collection is going to be compacted to a single element: that is the moment when you read it and define the "winner" or "no winners".

In the above example:
For the cell project.master that contains the SHA1 of the master branch, there will be a collection associated to project.master.pushed.
In project.master.pushed there will be both CA and CB.

If we use the "no winners" logic, this means that both CA and CB will be considered as failed.
The problem is the cleanup or compaction logic and when to do it ... still have no answer or thoughts on that :-(

It could even be that Cassandra could be the right tool for storing Git objects and packs but NOT AT ALL for refs and I may need to use something else.
To be honest with you, my current scalability problems are on the Git objects (BLOBs, Trees, Commits) and not on the refs. It would still solve my problem if I choose the use Cassandra for objects and something else for refs.

>
>> As both CA and CB have all their objects already pushed (but only the refs wasn't updated), their second push attempt will be very quick. The retry wouldn't then create too much trouble to both.
>
> For that 2nd push to be really quick you need the ls-remotes during
> push to advertise both CA and CB. But only one value can be shown on
> the reference. So the other will have to be hidden in a ".have" line.
> Entirely doable, but requires a bit more coding. You have to override
> RefDatabase.getAdditionalRefs() and return values there so they can be
> folded into the ".have" advertisement. IIRC Gerrit ignores this list
> by default, so you may also have to tweak Gerrit.

Good to know, more work then on the horizon :-(

>
>> Do you (or Alex) foresee problems with this approach?
>
> Yes. :)

That's the type of feedback I needed :) You guys have done this before me and "knowledge review" is always as good as "code review" !
Good to know, is true that even if I would use Cassandra as a "distributed disk for JGit" it will still need to go through the network and latency is then going to be an issue.
What is acceptable and very fast on your local disk (maybe even SSD) may take a much bigger time over a Cassandra network of nodes, maybe even distributed !

lucamilanesio

unread,
Jan 12, 2016, 11:06:10 AM1/12/16
to Repo and Gerrit Discussion, spe...@spearce.org, alex.b...@gmail.com, jgit...@eclipse.org
Here is my 2c for getting Cassandra to work when updating a ref, see below.
My proposal would be the following: 
- keeping for each ref to keys on Cassandra. <ref> and <ref>.updates
- Always append data to the <branch>.updates as a collection of values: we do not have the update loss problem
- Each element of the <ref>.updates collection contains a pair (old-sha1,new-sha1)
- During read, we get all the values in <ref>.updates collection and we "navigate them" from the initial SHA1 of <ref>
- When we reach the end of the collection read => that's the final tip of the ref

A collision is more than one pair in the <ref>.updates collection with the same old-sha1.
I need to make now some intensive Cassandra writes test to see if the above algorithm works ... but it should in theory :-)

The <ref>.updates collection resolution can be compressed and reduced to <ref> at GC time.

Martin Fick

unread,
Jan 12, 2016, 3:58:17 PM1/12/16
to repo-d...@googlegroups.com, lucamilanesio, spe...@spearce.org, alex.b...@gmail.com, jgit...@eclipse.org
On Tuesday, January 12, 2016 08:06:10 AM lucamilanesio
wrote:
>
> A collision is more than one pair in the <ref>.updates
> collection with the same old-sha1.

"Collisions" as defined above are possible from perfectly
legal git operations (rewinds),

-Martin


--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation

Luca Milanesio

unread,
Jan 12, 2016, 4:14:16 PM1/12/16
to Martin Fick, Repo and Gerrit Discussion, Shawn Pearce, alex.b...@gmail.com, jgit...@eclipse.org
Fair point ... I can go back and forth multiple times and thus I would always "think" they are collisions, whilst they are not ... bugger!
Need to think again and find something else :-)

@Martin thanks for the feedback :-)

Luca.
Reply all
Reply to author
Forward
0 new messages