IDE research: putting objects in databases

11 views

Skip to first unread message

Jeremy Morse

unread,

Feb 9, 2015, 10:19:12 AM2/9/15

to srobo...@googlegroups.com

Hi,

A while ago I suggested on this mailing list that having the IDE store
data in a real database (i.e., mysql or similar) might be a better model
for storing data, presuming the IDE design doesn't change significantly.
Out of pure escapism (and a desire to understand libmysqlclient) I took
a shot at this over Christmas and reached a sub-prototype state. I don't
have time to beat it any more, so figured I should report some things on
it. I see Peter as maintaining all IDE related things: this is research
into alternatives and whether they would be feasible.

The base idea is to dump all git data into a real database (with
indexes), using mysql as an example. This would be achievable with:
* libgit2
* a libgit2 custom backend (already supported facility) to pump git
objects and references into the database
* pygit2, to use libgit2 from python, much more pleasant than writing
an IDE in C.
The overall aim would be to make the IDE perform better, and have a
simplified way of managing the data it operates on.

The essential database structure winds up with _all_ git objects and
references being in two database tables. References are named:

refs/heads/$team/$master_or_username/$projname

I used all the git repos from sr2014 as an example data set. I can
distribute this to other people; however the code is copyright the
competitors, so it can't be distributed everywhere. ~200mb of on-disk
data was reduced to 36mb in the database. It's worth noting `git gc`
reduced the on-disk copies to 110mb when applied.

The real benefit comes from a) not having to invoke `git` repeatedly and
b) code simplicity, see this example of poll.py [0]. (That's not a full
implementation of poll though, although the corner cases are small).
This example delivers an order of magnitude performance improvement: an
average response time of 200ms when the poll endpoint is beaten by 100
clients [1] compared to 2 seconds for the vanilla IDE. Peter pointed out
to me that poll is probably the endpoint with the greatest potential for
improvement for this technique, so this might not distribute across
other endpoints.

Benefits of this technique:
* It's potentially faster (according to preliminary results)
* We don't rely on the filesystem cache for performance
* Things are indexed magnificently
* Web service and data storage are de-coupled when it comes to scaling.
In particular, we could just buy a database service (think AWS) and
scale up provisioning as necessary.
* IDE locking can be put in the/a database (could lock special
references)
* Backup / restore / movement is a database operation rather than
having to run rsync / tar, having it as an atomic operation would
be good.

I was going to cite the ability to keep a competition backup VM and the
live copy of the website in sync, however after speaking to someone
who's done database replication for a reasonably large website,
apparently this is 'geo replication' and would lead to us hating ourselves.

Limitations of this technique:
* We're sunk if there's a SHA1 collision, but we're sunk if that
happens anyway, it's just more likely
* Project names have to be valid branch names, which are more
restrictive (although allow unicode)
* pylint and zipping will still require putting files onto a
filesystem, although this could be a memory filesystem.
* One cannot simply frob a teams git repos to perform maintenence
* Raw access to git repos would become funky, possibly too risky.
* Over time a pygit2 repo would build up a cache of objects from the
database, which is undesirable
* It's unclear whether the increased amount of context switching to
communicate with the database will outweigh having the data in a
real storage engine.

It also doesn't help that I had to seriously mangle the libgit2 "custom
backend" for mysql, and patch pygit2 to accept custom repos (I'll submit
a pull request at some point). For reference, the substance of this
material is:
* This [2] small series of IDE patches
* This [3] python package for installing libgit2 backends, and creating
git_repository's that are connected to a database
* This [4] large series of patches to libgit2-backends to support
references being stored in mysql
* This [5] patch to pygit2 to enable creation of pygit2 repos from
custom git_repositories.
* Some scripts for sucking git repos into the database, and a modified
version of Sams locust script, to exercise the IDE poll endpoint
randomly. Not published as it has a list of (sometimes offensive)
project names that teams used last year.

If this is the kind of thing you might be interested in pursuing further
to evaluate whether it's a feasible option, drop me a line and I can
help set up a working environment.

[0]
https://github.com/jmorse/cyanide/commit/7ff5f465d29110570e9ac450a2a43c218a640439
[1] On a VM on my SSD laptop, with 2Gb of memory, with the team being
polled randomly distributed across the ~50 teams with repos.
[2] https://github.com/jmorse/cyanide/tree/pygit2
[3] https://github.com/jmorse/pygit2-backends
[4] https://github.com/jmorse/libgit2-backends
[5] https://github.com/jmorse/pygit2/tree/pygit2-0.19-backends

--
Thanks,
Jeremy