Needless to say, this changes things a bit. I also noticed the venti
stats: many of the ventis I've seen don't have more than about 30G
used. 30G? I can get a 1U with 512G for 18K nowadays. 1TB for 36K.
Finally, it's always bothered me that to most people, Plan 9 file
systems are a black box. Plan9ports has a very nice "devnull" venti
that seemed like it could be used for a real system.
I thought it might be fun to have a very simple venti that uses an
mmapped backing store and a common hash library (posix hash, available
everywhere) as a demo and to maybe get people wanting to hack in this
area. So on the long boring flight to Spain, I threw one together.
So, see it here: http://pastebin.com/47aC2XGv
To use, just drop into src/cmd/venti, fix mkfile, mk all
to make an arena, the usual dd from /dev/zero
To run, you can, e.g.:
./o.devram -s 2 -a 'tcp!*!6666' /tmp/arenas
the '2' here means two GB. To grow the arena, tack zeros onto the end,
and restart.
To test, used randtest. Or vac. Or whatever.
I've set -s to 32 and had a 32GB venti. Sync is a no-op; it always
syncs on each write. Performance: well, you try it :-)
Mistakes and bugs and things to be done better? Of course! Lucho
pointed out some to me already. You can find more. Hack away!
And, there's going to be a better version soon. I think. But the more
versions floating around, the better.
have fun.
ron
I'm half-way there, but the boat takes priority this month.
--lyndon
hang in there.
ron
Ach! Ye of little Faithe!
1) Write drivers for obtuse RAID controllers.
or
2) Port venti to POSIX.
Hmm ... let me think about that for a minute ...
Time's up! Back to dealing with POSIX :-) And given enough
tequila, it can revert to almost pure ANSI C.
--lyndon
just use aoe.
- erik
Isn't p9p venti good enough?
Nope. It only works where p9p works. I want code that will compile
on any POSIX-compliant host.
hang in there for just a bit longer. I understand what you want.
ron
Sent from my iPhone
Isn't p9p POSIX enough? Confused I am, but wasn't that the point of p9p?
> Isn't p9p POSIX enough? Confused I am, but wasn't that the point of p9p?
>
p9p gives you a runtime environment just like Plan 9s. From the point
of view of a programmer you can even pretend you're not in a POSIX
world.It's wonderful but there are times when people want the
functionality (e.g. venti server) but not using p9p libraries, but
POSIX libraries.
We hit that issue a lot in the early days of xcpu. The first few
versions were very much p9p code. Users complained about the need for
the extra libraries and unfamiliar programming environment. Later
versions of xcpu were all POSIX, no p9p at all.
Hope I said that right.
ron
I don't know if its still the case, but when I was playing with venti
a few years ago it had problems with chunks of memory > 2G. I was
trying to run p9p venti on a sever with 64GB of RAM but could only use
a fraction of that for the venti caches. Now that may have been more
of a venti problem than a p9p problem, sadly I didn't have the time to
track it down.
-eric
9base? (http://tools.suckless.org/9base) Doesn't stick to just
libraries but already has packages on Ubungo anyways. I think it at
least cuts the GUI tools which eliminates the xorg-dev requirement
which is probably the most onerous.
-eric
It's very nice code.
There will soon be a googlecode repo (lucho is setting it up now) with
a non-plan9-ports version (vtmm). Find it in googlecode at libvt.
It's also quite nice and much more capable than what I posted yesterday.
We now have 3 very simple implementations of venti. I hope people look
at this stuff. I think it makes the concepts of venti much more
accessible.
Note a difference between lucho and me: I ignore vtsync (I always sync
on writes) and he properly pays attention to it. Question for the
student: which one is better? Why?
Could we make little venti files and finally try to build an SCM using
these files?
Have fun!
ron
p.s. while you can't run this on plan9 for anything big (you need to
be able to have processes that can get bigger than 4G) you will be
able to run it in small scale on Plan 9 and bigger on nix. You'll have
to remove the use of mmap and replace the msync with writes to a file,
but that's pretty trivial. A good project for someone.
question cannot be answered due to insufficient
information about what "better" means. are you after
performance or reliability?
- erik
this is what i was looking into just this morning. i wanted to
package factotum - and others - individually in hope that more
programs use it.
i spent about 3 hours faffing with debain's packaging tools, then
remembered that i have work to do :-(
That's part of the question Which is better? ->Why?<-
Maybe I should say 'explain your answer' :-)
ron
OK, there is a go version that lucho wrote: https://code.google.com/p/govt/
you have to define better first, and you have to define
what you mean by "flushing immediately". i see three
general approaches to this problem, flush eventually,
flush immediately, and flush before ack. this is the same
dillema any non content-addressed disk has. performance
vs. safety. and of course one size doesn't fit all, so there are knobs in
most disks to turn off write caching.
this is a cs101 prerequsite question, is it not?
- erik
are you alluding to the fact that the client has no
way of doing a synchronize cache with the venti disk?
- erik
it's not as obvious a tradeoff as it seems.
Anyway, I'm more interested in hearing from people who do something
with the code.
ron
It's a matter of laziness; I'd rather port venti to POSIX once rather
than port p9p to many things. There are just enough
platform-dependent bits in p9p to make it enough of an annoyance for
me to go the POSIX route.
Funny you should say that!
May be I should post my half-assed ideas on extending hgfs for
commits etc. I was thinking a proper frontend/backend
separation of an SCM-FS would allow one to choose hg/git/venti
or even some distributed hash thingy as the backend.
Pay attention to vtsync? May be not for your mythical multiTB
ramflash but in real life syncing on every write is expensive.
[As I see it] in a sense venti has an atomic `changeset'
concept (each changeset maps to a single "fingerprint"). A
partial changeset is of not much use.
> Pay attention to vtsync? May be not for your mythical multiTB
> ramflash but in real life syncing on every write is expensive.
are you sure? On a multicore server, why not have a syncing task and a
serving task? Since all of the arena is in ram, the synciing task will
not interfere with the serving task, esp. if sata controller and
network are on different PCI busses.
I don't think the tradeoffs are obvious at all.
ron
I'm about to bench lucho's server on a 32GB arena (all of which will
be mmap'ed of course).
ron
flash has noticable latency.
> [As I see it] in a sense venti has an atomic `changeset'
> concept (each changeset maps to a single "fingerprint"). A
> partial changeset is of not much use.
on the other hand, not every write is a meaningful state to
the client.
- erik
Not sure we are on the same page.... Possible I missed what
you are really asking!
I thought you were comparing your implementation with lucho's.
From a quick scan of your mmap based code it seems you do an
msync on every write which I think is excessive.
I don't know under what conditions vtsync is sent but
presumably the client sends it at least at the end of an
update. But that doesn't stop the server from doing
opportunistic syncs in a separate thread to reduce the amount
of work that remains to be done when it receives an actual
vtsync from the client. But when it does receive one it has
ensure that all the data is synced before responding back.
> I don't think the tradeoffs are obvious at all.
I thought that was obvious!
that doesn't sound synchronous to me. what i think of when
i think of flush on write is that the i/o is done before the reply
to the write. this has two implications, there's no way to do
any elevatoring, and you take a full round-trip to the disk
delay for each write, no amortization is possible.
i would think that the client is in the best position
to tell the storage when things must be flushed.
it might be best to only write when told to flush and do so in
such a way that it's clear if the transaction has finished. that way,
if you're really careful and flush caches down to the storage media,
you can recover if things go sideways.
- erik
> Is it goinstallable? If so, I'm not sure what I'm doing wrong. I very
> rarely use any 3rd party Go code but my own :-).
no idea. I just hg clone'd and did a make
ron
goinstall govt.googlecode.com/hg/vt/vtclnt
goinstall govt.googlecode.com/hg/vt/vtsrv
Works for me.
fhs
ron
yes, SplitN was introduced in release r59. Latest weekly will also work.
fhs
Venti would make a great backend for Git. I believe Git's commit and tree
format are simple enough to be re-implemented, if porting proves to be too
bothersome.
either the whole .git/ directory would be held in venti, or just the
.git/objects/ -- the storage.
----
on the other hand, Git has that interesting feature that a bunch of recent
commits is held in loose files, while older commits are re-packed into space-
efficient format using deltas.
was similar strategy ever considered for Venti? as in, to keep fresh data in
present-day format, but migrate older data into denser format?
--
dexen deVries
[[[↓][→]]]
For example, if the first thing in the file is:
<?kzy irefvba="1.0" rapbqvat="ebg13"?>
an XML parser will recognize that the document is stored in the traditional
ROT13 encoding.
(( Joe English, http://www.flightlab.com/~joe/sgml/faq-not.txt ))
> Venti would make a great backend for Git.
I guess you could do this; it'd be interesting to do something more in
keeping with the Unix model than git is.
Create repo:
dd -if etc. etc.
checkout
unvac
commit
vac etc.
compare two trees:
vacfs score1 /tree1
vacfs score2 /tree2
diff-somehow /tree1 /tree2
It seems to me that all the things that are done with git today, with
all its special purpose commands, might be done with a Unix tool
approach. Plus, some of the git commands are pretty interesting and
might be useful if they could be applied to other contexts.
ron
#include <u.h>
#include <libc.h>
#include <bio.h>
#include <flate.h>
#include <thread.h>
#include <venti.h>
#include <libsec.h>
typedef uchar byte;
typedef u64int uint64;
typedef u32int uint32;
typedef struct IEntry IEntry;
struct IEntry
{
IEntry *link;
// disk data
byte score[VtScoreSize];
uint64 offset;
};
typedef struct Chunk Chunk;
struct Chunk
{
byte score[VtScoreSize];
uint32 size;
byte *data;
};
IEntry **ihash;
uint nihash;
uint nientry;
void
rehash(void)
{
IEntry **new, *e, *next;
uint i, n;
uint32 h;
n = nihash<<1;
new = vtmallocz(nihash*sizeof new[0]);
for(i=0; i<nihash; i++) {
for(e = ihash[i]; e; e = next) {
next = e->link;
h = *(uint32*)e->score & (n - 1);
e->link = new[h];
new[h] = e;
}
}
free(ihash);
ihash = new;
nihash = n;
}
IEntry*
ilookup(byte *score)
{
uint32 h;
IEntry *e;
// Not a great hash; assumes we are
// seeing all blocks, not just some chosen subset.
h = *(uint32*)score & (nihash - 1);
for(e = ihash[h]; e; e = e->link)
if(memcmp(e->score, score, VtScoreSize) == 0)
return e;
return nil;
}
IEntry*
iinsert(byte *score)
{
uint32 h;
IEntry *e;
if(nihash < (1<<28) && nientry > 2*nihash)
rehash();
h = *(uint32*)score & (nihash - 1);
e = vtmallocz(sizeof(IEntry));
e->link = ihash[h];
ihash[h] = e;
memmove(e->score, score, VtScoreSize);
return e;
}
void
iload(Biobuf *b)
{
char *p;
char *f[10];
int nf;
byte score[VtScoreSize];
uint64 offset;
IEntry *e;
while((p = Brdline(b, '\n')) != nil) {
p[Blinelen(b)-1] = '\0';
nf = tokenize(p, f, nelem(f));
if(nf != 2 || vtparsescore(f[0], nil, score) < 0 || (offset =
strtoull(f[1], 0, 0)) == 0) {
sysfatal("malformed index");
return;
}
e = iinsert(score);
e->offset = offset;
}
}
void
iwrite(int fd, IEntry *e)
{
fprint(fd, "%V %-22llud\n", e->score, e->offset);
}
enum {
ArenaBlock = 1<<30
};
uint64
dwrite(int fd, Chunk *c)
{
byte *zdat, *w;
int nzdat;
uint nw;
uint64 offset, eoffset;
zdat = vtmallocz(c->size + 1024);
nzdat = deflateblock(zdat, c->size + 1024, c->data, c->size, 6, 0);
if(nzdat < 0 || nzdat > c->size - 512) {
// don't bother with compression
w = c->data;
nw = c->size;
} else {
w = zdat;
nw = nzdat;
}
offset = seek(fd, 0, 1);
eoffset = offset + 2*VtScoreSize + 12 + 12 + 1 + nw;
if(eoffset / ArenaBlock != offset / ArenaBlock) {
offset /= ArenaBlock;
offset++;
offset *= ArenaBlock;
seek(fd, offset, 0);
}
fprint(fd, "%V %-11ud %-11ud\n", c->score, c->size, nw);
write(fd, w, nw);
free(zdat);
return offset;
}
void
dread(Biobuf *b, Chunk *c)
{
char *p, *f[10];
int nf;
uint zsize;
byte *r;
uint64 offset;
char buf[100];
offset = Boffset(b);
p = Brdline(b, '\n');
if(p == nil || Blinelen(b) >= sizeof buf)
sysfatal("malformed data - EOF");
memmove(buf, p, Blinelen(b));
buf[Blinelen(b)-1] = '\0';
nf = tokenize(buf, f, nelem(f));
if(nf != 3 || vtparsescore(f[0], nil, c->score) < 0 || (c->size =
strtoul(f[1], 0, 0)) == 0 || (zsize = strtoul(f[2], 0, 0)) == 0) {
sysfatal("malformed data at %llud / %d", offset, nf);
return;
}
c->data = vtmalloc(c->size);
if(c->size == zsize)
r = c->data;
else
r = vtmallocz(zsize);
Bread(b, r, zsize);
if(c->size != zsize) {
if((nf = inflateblock(c->data, c->size, r, zsize)) < 0)
sysfatal("inflateblock fail %d %d %d %.10H...", c->size, zsize, nf, r);
free(r);
}
}
Biobuf *bindexr;
Biobuf *bdatar;
int indexw, dataw;
Biobuf *bsha1; // TODO
int doCreate;
int verbose;
void
doOpen(char *name, char *what, int *w, Biobuf **r)
{
int fd;
char buf[100];
char *p;
snprint(buf, sizeof buf, "# sventi %s\n", what);
if((*w = open(name, OWRITE)) < 0) {
if(!doCreate)
sysfatal("open %s: %r", name);
if((*w = create(name, OWRITE, 0644)) < 0)
sysfatal("create %s: %r", name);
write(*w, buf, strlen(buf));
}
if((fd = open(name, OREAD)) < 0)
sysfatal("open %s: %r", name);
*r = Bfdopen(fd, OREAD);
p = Brdline(*r, '\n');
if(p == nil || Blinelen(*r) != strlen(buf) || memcmp(p, buf, strlen(buf)) != 0)
sysfatal("%s is not %s - %p", name, what);
}
void
usage(void)
{
fprint(2, "usage: sventi [-cv] [-a address] [-d dir]\n");
threadexitsall("usage");
}
void
threadmain(int argc, char **argv)
{
VtReq *r;
VtSrv *srv;
char *address, *dir;
Chunk c;
IEntry *e;
nihash = 16;
ihash = vtmallocz(nihash * sizeof ihash[0]);
fmtinstall('F', vtfcallfmt);
fmtinstall('V', vtscorefmt);
fmtinstall('H', encodefmt);
address = "tcp!*!venti";
dir = ".";
deflateinit();
inflateinit();
ARGBEGIN{
case 'v':
verbose++;
break;
case 'a':
address = EARGF(usage());
break;
case 'c':
doCreate = 1;
break;
case 'd':
dir = EARGF(usage());
break;
default:
usage();
}ARGEND
if(argc != 0)
usage();
if(chdir(dir) < 0)
sysfatal("chdir %s: %r", dir);
doOpen("sventi.index", "index", &indexw, &bindexr);
doOpen("sventi.data", "data", &dataw, &bdatar);
iload(bindexr);
Bterm(bindexr);
seek(indexw, 0, 2);
seek(dataw, 0, 2);
srv = vtlisten(address);
if(srv == nil)
sysfatal("vtlisten %s: %r", address);
while((r = vtgetreq(srv)) != nil) {
r->rx.msgtype = r->tx.msgtype+1;
if(verbose)
fprint(2, "<- %F\n", &r->tx);
switch(r->tx.msgtype) {
case VtTping:
break;
case VtTgoodbye:
break;
case VtTsync:
fsync(dataw);
fsync(indexw);
break;
case VtTread:
e = ilookup(r->tx.score);
if(e == nil) {
r->rx.msgtype = VtRerror;
r->rx.error = vtstrdup("block not found");
break;
}
Bseek(bdatar, e->offset, 0);
dread(bdatar, &c);
if(memcmp(c.score, r->tx.score, VtScoreSize) != 0)
sysfatal("data/index score mismatch");
r->rx.data = packetforeign(c.data, c.size, free, c.data);
packetsha1(r->rx.data, c.score);
if(memcmp(c.score, r->tx.score, VtScoreSize) != 0)
sysfatal("data/data score mismatch");
break;
case VtTwrite:
packetsha1(r->tx.data, r->rx.score);
e = ilookup(r->rx.score);
if(e != nil)
break;
c.size = packetsize(r->tx.data);
c.data = vtmalloc(c.size);
packetconsume(r->tx.data, c.data, c.size);
memmove(c.score, r->rx.score, VtScoreSize);
e = iinsert(c.score);
e->offset = dwrite(dataw, &c);
free(c.data);
iwrite(indexw, e);
break;
}
if(verbose)
fprint(2, "-> %F\n", &r->rx);
vtrespond(r);
}
threadexitsall(nil);
}
Shouldn't it be 'ventina'?
Venti seems feminine.