But I don't think the re-serializability will be a common problem. I'd value easy parsing over easy serialization.
I imagine y'all have considered this, but I'm curious: What about
requiring all blobs have Mime headers? Since you're transacting them via
http anyway, you have to handle mime headers coming and going; why not just
store them with the header included? You could potentially make the blob
server work with mime headers only, and then the indexer could use JSON
schemas (wrapped under mime headers) for the more complex issues. This makes
it trivial (and robust) to identify schema blobs, and also makes it easier
down the road to introduce other kinds of schema blobs (e.g. maybe some day
you decide to use non-JSON schemas for some reason). Also provides a Content-Type:
for direct http access to blobs. And you could potentially put claim signatures
in the mime header rather than in the JSON, which not only eliminates the JSON
serialization fudging, but now allows the signature verification to happen as a
sort of pre-filter before even parsing the claim, which is one more level of
security against attacks (and also means if you do have non-JSON claim processing
down the road the signature verification is already handled). And in a way it
makes the blob server simpler because it no longer needs JSON, and the indexer
simpler because its JSON is unadulterated. And now the indexer can do things
like catalog all images.
-Brandyn
--
---------- bra...@sifter.org ------- http://www.sifter.org/~brandyn ----------
When you reach the end of your rope,
cut it and start a new metaphor.
Kenton Varda said (on Jan 31):
> How is the indexer going to figure out the type of blobs in general? [...]
> I understand the desire to avoid attaching any kind of type information toI imagine y'all have considered this,
> blobs, but it seems like it may make a lot of things difficult. (Speaking
> of which, what are you going to return in the Content-Type: header for a
> random blob?)
but I'm curious:
What about
requiring all blobs have Mime headers? Since you're transacting them via
http anyway, you have to handle mime headers coming and going; why not just
store them with the header included?
Why does the indexer care if it's a camli blob or not? Doesn't it need to index other blobs, too?
How is the indexer going to figure out the type of blobs in general?
Not all file types have unique magic numbers, so it seems like the indexer would need some pretty complicated content detection to be effective.
Detecting camli blobs -- even without the magic sequence -- ought to be relatively easy. 99% of non-JSON files can be ruled out after just a few bytes; for the rest, load and parse the whole thing, and then inspect the parsed object.
I understand the desire to avoid attaching any kind of type information to blobs, but it seems like it may make a lot of things difficult.
(Speaking of which, what are you going to return in the Content-Type: header for a random blob?)
I totally agree with the goal of keeping it as simple as possible, and
mostly like the current approach in spirit, but my spidy-sense has me worried. :)
Since you ask, here's my answer to the above Q's, starting with explanation:
In any robust computational environment (everything from assembly language
to high level language to data structures to file formats to whatever), "type"
is externally defined to the object itself. A float in a C struct is just bits,
but you know it's a float because of where it is in the struct. An "object"
in C++ is firstly a C++ object -- that is its type. Knowing that, you can find
the pointer to its class (just more bits--but you know what they mean because of
where they are) and the class tells you how to interpret the body of the object,
so still the bits in the object are externally typed. I.e., even "self-identifying"
objects are just more of the usual hierarchy of externally typed data.
Knowing the type is crucial to proper application of any data. If you confuse
an int for a float (which you can't tell apart just from their bits...) you think
it means one thing when in fact it means another. This is why all designed systems
have 100% reliable ways of externally identifying type.
The rare exception to this is something like the "file" command, which you
mention, which actually parses the raw data to decide what it is. But file was
an after-thought for how to deal with an evolved system--not a designed solution.
File extensions are another attempt to provide external type information for the
contents of a file. The "file" command to some extent was necessitated into
existence because people weren't reliably adhereing to the file extension protocol,
and because there was no central authority on file extensions--resulting in multiple
types sharing the same extensions. I.e., file systems did not provide a standard
mechanism for externally typing a file, and the result is a mish-mash of hacks to
try to work around that. The consequence is occasional failure -- "file" sometimes
makes mistakes, and things break when that happens.
To give a concrete example of where things could go haywire with Camli as
currently defined: I write a camli-backed IM client. I stuff the
raw body of each message into a blob, put the meta data in another blob. Pretty
straight forward, clean, seems fine... Then you guys start using it, and in the
middle of a technical discussion about the indexer, someone starts pasting the
bodies of schema blobs into the chat window as illustrative examples.
Oops!
What _should_ happen is the body of the chat text blob is only ever treated
as chat text. What _does_ happen is the indexer sniffs out a claim and starts
editing state, and some time later you realize your data is screwed up and it
becomes a mysterious bug that takes five years to track down.
In my opinion, there are two clean options: Either all blobs are completely
opaque (the blob server never looks inside, ever, other than to verify checksum),
or they are self-identifying (maybe nothing more than the mime type--not nec.
a general "header").
Really either of those is probably fine, it just distributes the burden
slightly differently.
The self-identifying case I already kinda described. The opaque case would
work something like this:
The blob server knows nothing about what's in the blobs. Any code that does
look inside, like the indexer, always knows the type first. So, for instance, to
submit a new claim, I can't just build the claim and throw it into my collection
and expect my indexer to pick it up. Rather, I submit it directly to my indexer
as a claim and my indexer in turn stows it via the blob server while also adding
the claim's blobref to a list of known claims. Assuming the indexer maintains
its own state in blobs, the indexer needs to allocate a permanode for that state
when I first set it up, and it better remember the blobref of that permanode
(e.g., in a config file)!
That approach is conceptually more pure and closer to how most databases
and programming languages actually work, but it is _very brittle_ against any
mishaps. The self-identifying case (mime type) is a little mucky because
it's not super clear what "type" to assign some blobs. (Is a C source file
a C source file or a Text file? The best answer to this is probably: It is
both, because C source file is a subclass of Text file, but this opens up the
can of worms of type taxonomy...) So... I appreciate that both of what I am
calling the two "clean" options are not without problems, and I grok why you
have chosen what you have. However, personally, all things weighed in, I
would probably choose self-identifying blobs myself, and probably with Only
a cr-terminated type name (not full mime headers), and probably using mime
type names baring good reason not to. Any type taxonomies would be up to
the higher level code (typically you would label something a "C" file and
the higher level would know this is a subclass of "Text"). This would make
a Blob analogous to an Object in any object-oriented language, where there
is a class identifier, and the data, and all further interpretation of the
data is up to the class.
blob = (type, data)
So given all that context...
> What happens when I have a blob with bytes "{}" with a mime type of
> "text/json" and bytes "{}" with type "application/json"?
In my opinion, these are two different things, just like an integer
and a float that happen to have the same binary representation are two
different things. The point of content addressability is to find the same
conceptual object by the same name, not to capture some coincidental similarity
between two unrelated things that happen to be represented by the same bits.
I would hash the body and type together. If you change the type, it's
a new object.
For claims, I would probably do the same thing again: The second
line of the blob (after the type) would be a version number and the
signature, followed by a CR, followed by unadulterated JSON. (Or version
number \n signature \n)
Point is, once the type ("calmi/claim") is well known, we're free
to do whatever we want with the next layer in.
Dean suggests:
> perhaps something like <hash>-<hexdigest>-<type/subtype;parameters>.
Actually, that might be even better... The downside is the blobrefs
get bigger (and they're replicated a lot...). The upside is it's easy
for the blob server to symlink like <hash>-<hexdigest>'s together... and
also the type now is visible on a blobref which makes human-inspection
of goings-on easier. Offhand I would say "parameters" should probably
be pushed inside the file, so I might use:
<hash>-<hexdigest>.<subtype.type>
Which is generally backward compatible with file-extensions and
would allow file (and blobref) expansions like "*.camli" or "*.claim.camli"
or "*.text". (Alternately, <subtype.type>.<hash>-<hexdigest> would sort
very nicely.)
Anyway, just ideas.
-Brandyn
--
---------- bra...@sifter.org ------- http://www.sifter.org/~brandyn ----------
A cold pot never boils.
Brandyn Webb said (on Feb 1):
> Assuming the indexer maintains
> its own state in blobs, the indexer needs to allocate a permanode for that state
> when I first set it up, and it better remember the blobref of that permanode
> (e.g., in a config file)!
Actually, this doesn't work at all on second thought. The indexer would
have to maintain at least some persistent dynamic state outside of the blobs,
which is kind of hazardous. (Realistically an indexer will end up doing that
anyway, but conceptually that's all "cache" whereas in the above example it
would be critical, conceptually unrecoverable state.)
> > perhaps something like <hash>-<hexdigest>-<type/subtype;parameters>.
>
> Actually, that might be even better... The downside is the blobrefs
> get bigger (and they're replicated a lot...). The upside is it's easy
> for the blob server to symlink like <hash>-<hexdigest>'s together...
(But I still think this would almost never happen, so don't
consider it important personally.)
-Brandyn
--
---------- bra...@sifter.org ------- http://www.sifter.org/~brandyn ----------
"A romantic is a person who seeks sublime moments, those
rare experiences that are at the limits of human emotion,
endurance, and understanding." -Greg Robbins
I was agnostic on this issue until I read your argument, but I think I
reached a different conclusion than you did.
With the above (and the rest of what you wrote) in mind, it seems to me
that blobs should just be blobs, with no type information whatsoever,
including magic, and it is up to the layers above to define meaning for
those blobs.
And further, the layers above should define meaning by *context*, not by
sniffing. If I'm holding some data that tells me that
sha1-d26e20e8bcbd7911b0ad257d65c1440c00681687 is where I can find a
profile picture for you, then my app needs to know what "profile
picture" means in this context in order to know what to do with the blob.
This doesn't mean that sniffing can't be a valid processing model for
certain application protocols built on top of the blob store. For
example, whatever application protocol defined "profile picture" in my
example above might define it as "either a JPEG, PNG or GIF image, to be
distinguished using header sniffing". The sniffing method here is
well-defined, so the behavior is predictable and the set of possible
outcomes is much smaller.
But to just pluck a random blob out of the blob store and try to guess
its type with no context whatsoever seems like folly, since the same
blob could be referred to in two contexts with different processing
expected for each.
How is the indexer going to figure out the type of blobs in general?Like file(1).
Not all file types have unique magic numbers, so it seems like the indexer would need some pretty complicated content detection to be effective.Almost all interesting files do.
Anything else can have a camli json wrapper for other files/blobs.
Indeed, the camlistore source tree already includes just such a file
(doc/json-signing/example/some-notes.txt.camli); will uploading the
camlistore source confuse/break a future version of camlistore?
Expecting developers to sanitize application data to avoid this seems
likely to lead to a lot of difficult to track down bugs and (possibly)
security vulnerabilities.
I actually agree with this philosophically.
The problem is that the blobs are immutable. It's perhaps a little
subtle why this is an issue, but you have to think all the way through how
you would actually implement the indexer.
I assume there is a design objective that the collection of blobs
completely defines the database, and that anything outside the collection of
blobs is conceptually just cache. This is analogous to, say, the tuples
in an SQL db completely defining the data, and any indices are just caches
for speed and can be (reliably) regenerated if lost.
As far as I can figure (maybe I'm just missing an implementation or
representation trick), you can't have immutable blobs AND no external
data AND no blob sniffing.
If it's not clear why, let me know and I can elaborate.
For what it's worth, the architecture I was starting to sketch out
before finding camli was similar, but allows mutable objects. Similar
to camli, the names are content hashes, and I have essentially
the same thing as a permanode as the root of mutable objects, but
my "blob server" directly supports something analogous to Camli's
"permanode-become" (but for any object, not just file systems...),
which are essentially "pointers" and provide a single and concise
point of mutability.
Now, I suppose you could say these are still two separate layers,
and you could do the same thing in Camli by having the indexer store
essential state (current permanode root), but to me that feels unclean
because now you have two supposedly separate data stores which Must be
kept perfectly in sync or there is corruption. To me that pretty well
defines one inseparable layer, so I think of it as such. May just be
semantics.
So, I agree with you, actually, but with the condition that the
blob server directly supports permanode-become. Without that, afaict,
you Must store essential state in the Indexer, which imo is hazardous.
-Brandyn
--
---------- bra...@sifter.org ------- http://www.sifter.org/~brandyn ----------
Humans are like Slinkies. Not good for much, but you just can't
help but smile when you see one go down a flight of stairs.
(Anonymous)
Likewise. I'm going to return to working on my own model, since for my
purposes I need "provable" robustness and a sniffing model will always be
slightly leaky, but there remains a great deal of overlap so if I can find
a way down the road to use the camli back-end I will. (Open to collaboration
if anybody else is working on a similar variant.)
-Brandyn
--
---------- bra...@sifter.org ------- http://www.sifter.org/~brandyn ----------
When the going gets tough, the tough get going.
But the rest of the time they're just bored.