the magic camliVersion key

76 views
Skip to first unread message

Martin Atkins

unread,
Jan 31, 2011, 4:58:04 PM1/31/11
to Camlistore
Hi,

Still trying to load all of the information about this project into my
brain, but I did want to chime in on this magic "camliVersion" JSON
property.

I have some mixed feelings about the approach of using a JSON key for
the magic and essentially constraining JSON syntax in order to make it
work.

In order for the "camliVersion" property to survive a deserialize/
serialize round-trip it'll require special touches to ensure that you
don't either end up with two camliVersion properties (due to re-
serializing a structure with it already embedded) or break the magic
sequence (by not handling it at all).

Given that this requires special handling anyway, I wonder what the
advantage is of making it parsable with a standard JSON parser. The
magic could easily be a fixed sequence of bytes appended *before* the
{ which a consumer would strip off (or replace with spaces) before
feeding it to the JSON parser. This still requires special handling
but this handling is outside of the JSON handling rather than
intermingled with it, and it will make it harder to accidentally get
it wrong without noticing.

Just a thought...

-Martin

Brad Fitzpatrick

unread,
Jan 31, 2011, 5:08:59 PM1/31/11
to camli...@googlegroups.com
Yeah, I'm not sure how I feel about this either.  Remember that the only reason for the magic number prefix is to make the indexer's life easier.  It only needs to fetch the very beginning (with an HTTP GET with a Range header) to figure out what the thing is.  But we could relax that in the future saying that if the blob is under 1KB, the "camliVersion" could be anywhere.

Or say that the "magic number" for a JSON schema blob is the first byte being '{'.  If the first byte is '{' and camliVersion is found anywhere in the first 4KB, then it's fully parsed as JSON and we see if it has a "camliVersion" top-level key. 

I do like the property that the blobs are valid JSON, though.

Alternatively, I supposed the prepended magic you're proposing could be a mix of whitespace.  Maybe " \t    \t\t" (binary out of spaces and tabs for 67, ascii for "C" as in Camli).  :-)

But I don't think the re-serializability will be a common problem.  I'd value easy parsing over easy serialization.

Anybody else have preferences on magics of:

a) '{"camliVersion":'
b) " \t    \t\t"
c) '{'

?

Kenton Varda

unread,
Jan 31, 2011, 6:45:59 PM1/31/11
to camli...@googlegroups.com
Why does the indexer care if it's a camli blob or not?  Doesn't it need to index other blobs, too?

How is the indexer going to figure out the type of blobs in general?  Not all file types have unique magic numbers, so it seems like the indexer would need some pretty complicated content detection to be effective.  Detecting camli blobs -- even without the magic sequence -- ought to be relatively easy.  99% of non-JSON files can be ruled out after just a few bytes; for the rest, load and parse the whole thing, and then inspect the parsed object.

I understand the desire to avoid attaching any kind of type information to blobs, but it seems like it may make a lot of things difficult.  (Speaking of which, what are you going to return in the Content-Type: header for a random blob?)

Torben Weis

unread,
Feb 1, 2011, 4:13:35 AM2/1/11
to camli...@googlegroups.com


2011/1/31 Brad Fitzpatrick <br...@danga.com>


But I don't think the re-serializability will be a common problem.  I'd value easy parsing over easy serialization.

My understanding of the JSON signing is that the signature won't survive serialization/deserialization because the signature depends on the ordering of fields and white spaces used by the original JSON byte sequence. Unless I misunderstood this, the re-serializability is not possible anyway, at least not for claims.

Greetings
Torben 

Brandyn Webb

unread,
Feb 1, 2011, 5:03:57 AM2/1/11
to camli...@googlegroups.com
Kenton Varda said (on Jan 31):
> How is the indexer going to figure out the type of blobs in general? [...]

> I understand the desire to avoid attaching any kind of type information to
> blobs, but it seems like it may make a lot of things difficult. (Speaking
> of which, what are you going to return in the Content-Type: header for a
> random blob?)

I imagine y'all have considered this, but I'm curious: What about
requiring all blobs have Mime headers? Since you're transacting them via
http anyway, you have to handle mime headers coming and going; why not just
store them with the header included? You could potentially make the blob
server work with mime headers only, and then the indexer could use JSON
schemas (wrapped under mime headers) for the more complex issues. This makes
it trivial (and robust) to identify schema blobs, and also makes it easier
down the road to introduce other kinds of schema blobs (e.g. maybe some day
you decide to use non-JSON schemas for some reason). Also provides a Content-Type:
for direct http access to blobs. And you could potentially put claim signatures
in the mime header rather than in the JSON, which not only eliminates the JSON
serialization fudging, but now allows the signature verification to happen as a
sort of pre-filter before even parsing the claim, which is one more level of
security against attacks (and also means if you do have non-JSON claim processing
down the road the signature verification is already handled). And in a way it
makes the blob server simpler because it no longer needs JSON, and the indexer
simpler because its JSON is unadulterated. And now the indexer can do things
like catalog all images.

-Brandyn

--
---------- bra...@sifter.org ------- http://www.sifter.org/~brandyn ----------

When you reach the end of your rope,
cut it and start a new metaphor.

Dean Landolt

unread,
Feb 1, 2011, 7:21:04 AM2/1/11
to camli...@googlegroups.com
I believe this is the point of canonical json -- defining a serialization that will be byte-equivalent no matter how many times it's roundtripped.

Torben Weis

unread,
Feb 1, 2011, 9:09:34 AM2/1/11
to camli...@googlegroups.com

2011/2/1 Dean Landolt <de...@deanlandolt.com>
It is. However, my understanding of the signing process is that it is NOT canonical. The docs state that any serialization is ok, then you sign it and then you combine data and signature in one JSON document.

Greetings
Torben

Brad Fitzpatrick

unread,
Feb 1, 2011, 9:09:38 AM2/1/11
to camli...@googlegroups.com
Exactly.  Re-serializability isn't a goal.  Blobs are the fundamental unit and you should just hold on to if you want to route it along.  No need to parse it just to reserialize it to send it.

Dean Landolt

unread,
Feb 1, 2011, 7:46:16 AM2/1/11
to camli...@googlegroups.com
This is an interesting idea -- having a mime around would be extraordinarily handy, and yeah, it could be parameterized with a signature and cut out the awkwardness around signing, and bypass the need for a magic number entirely. But where do you maintain this data without violating content-addressability? One option could be to inline it into the key itself, perhaps something like <hash>-<hexdigest>-<type/subtype;parameters>. Sure, this looks a bit little sloppy, but you already have a data structure buried in the key that needs to be parsed (simple as that parse may be). You could treat the key like a custom uri scheme -- in fact, it's already kind of designed that way.

Brad Fitzpatrick

unread,
Feb 1, 2011, 12:28:46 PM2/1/11
to camli...@googlegroups.com
On Tue, Feb 1, 2011 at 2:03 AM, Brandyn Webb <brandy...@gmail.com> wrote:
Kenton Varda said (on Jan 31):
> How is the indexer going to figure out the type of blobs in general? [...]
> I understand the desire to avoid attaching any kind of type information to
> blobs, but it seems like it may make a lot of things difficult.  (Speaking
> of which, what are you going to return in the Content-Type: header for a
> random blob?)

       I imagine y'all have considered this,

yes   :-)
 
but I'm curious:

you're not the only one.  This is quickly becoming a FAQ!  )Time to build a FAQ page.)
 
What about
requiring all blobs have Mime headers?  Since you're transacting them via
http anyway, you have to handle mime headers coming and going; why not just
store them with the header included?

That's metadata.  That's at the wrong layer.  The blobserver is dumb bytes.  If you want metadata, you go up a layer or so.

I asked a couple people these question about this the other day:  (when they proposed/fought for the same thing)

What happens when I have a blob with bytes "{}" with a mime type of "text/json" and bytes "{}" with type "application/json"?  Are those different blobs (different blobrefs) or not?  If they're the same, how do I handle sync conflicts if two parties have each version and share?  Which metadata wins and why?  If they're different blobs, then what are you arguing for?  You want a prescribed encapsulation format for the tuple (mime type, data)?  That's what the camli "schema" is for:  prescribing ways to encode metadata.

Amusingly, when confronted with the above question, one person told me they were different blobs (they were thinking at the wrong layer) and one person told me they were the same blob (and has no answer for the sync conflict problem).

I feel like because the most complete implementation that we have right now (and the easiest to conceptually understand) layer we have right now is the blobserver, everybody wants to shove stuff into it.  I refuse to logically expand the blobserver's role until we flesh out the whole picture end-to-end and we absolutely need to add stuff to the base.  I'm pretty confident we can do all this with the blobserver staying dumb as hell. 

> ...
> [snip]
> And now the indexer can do things like catalog all images

It already can.  You can magic-sniff an image in the same way you can magic-sniff a JSON document.

Brad Fitzpatrick

unread,
Feb 1, 2011, 12:35:39 PM2/1/11
to camli...@googlegroups.com
On Mon, Jan 31, 2011 at 3:45 PM, Kenton Varda <temp...@gmail.com> wrote:
Why does the indexer care if it's a camli blob or not?  Doesn't it need to index other blobs, too?

No.  It doesn't *need* to.  It'll index things that are known by the camli schema and maybe a handful of other things (JPEG EXIF, spatial index on GPS coordinates maybe for cuteness...).  But other blobs are just dumb blobs, unknown by the indexer.
 
How is the indexer going to figure out the type of blobs in general?

Like file(1).

 
 Not all file types have unique magic numbers, so it seems like the indexer would need some pretty complicated content detection to be effective.

Almost all interesting files do.  Certainly everything we care about.  JPEG, PNG, GIF, JSON, music, videos are all easy to tell in the first few bytes.  Anything else can have a camli json wrapper for other files/blobs.
 
 Detecting camli blobs -- even without the magic sequence -- ought to be relatively easy.  99% of non-JSON files can be ruled out after just a few bytes; for the rest, load and parse the whole thing, and then inspect the parsed object.

Yeah, I'm leaning towards saying that '{' is the "magic" for JSON and removing the "camliVersion" header requirement.
 
I understand the desire to avoid attaching any kind of type information to blobs, but it seems like it may make a lot of things difficult.
 (Speaking of which, what are you going to return in the Content-Type: header for a random blob?)

Undefined.  I'll make it explicitly undefined in the docs.

I return application/octet-stream right now but I made the front-end for the sharing demo (at /docs/sharing) do some content-sniffing to guess the right mime type (text/plain; charset=utf-8, image/jpeg, image/png, image/gif, etc) just for ease of demos.  But that's a front-end's job, not the blobserver's job.  Blobs are opaque series of bytes.

Brandyn Webb

unread,
Feb 1, 2011, 11:02:43 PM2/1/11
to camli...@googlegroups.com
Brad Fitzpatrick said (on Feb 1):

> That's metadata. That's at the wrong layer. The blobserver is dumb bytes.
> If you want metadata, you go up a layer or so.
> [...]

> What happens when I have a blob with bytes "{}" with a mime type of
> "text/json" and bytes "{}" with type "application/json"?
> [...]

> I feel like because the most complete implementation that we have right now
> (and the easiest to conceptually understand) layer we have right now is the
> blobserver, everybody wants to shove stuff into it. I refuse to logically
> expand the blobserver's role until we flesh out the whole picture end-to-end
> and we absolutely need to add stuff to the base. I'm pretty confident we
> can do all this with the blobserver staying dumb as hell.
> [...]

> You can magic-sniff an image in the same way you can
> magic-sniff a JSON document.

I totally agree with the goal of keeping it as simple as possible, and
mostly like the current approach in spirit, but my spidy-sense has me worried. :)
Since you ask, here's my answer to the above Q's, starting with explanation:

In any robust computational environment (everything from assembly language
to high level language to data structures to file formats to whatever), "type"
is externally defined to the object itself. A float in a C struct is just bits,
but you know it's a float because of where it is in the struct. An "object"
in C++ is firstly a C++ object -- that is its type. Knowing that, you can find
the pointer to its class (just more bits--but you know what they mean because of
where they are) and the class tells you how to interpret the body of the object,
so still the bits in the object are externally typed. I.e., even "self-identifying"
objects are just more of the usual hierarchy of externally typed data.

Knowing the type is crucial to proper application of any data. If you confuse
an int for a float (which you can't tell apart just from their bits...) you think
it means one thing when in fact it means another. This is why all designed systems
have 100% reliable ways of externally identifying type.

The rare exception to this is something like the "file" command, which you
mention, which actually parses the raw data to decide what it is. But file was
an after-thought for how to deal with an evolved system--not a designed solution.
File extensions are another attempt to provide external type information for the
contents of a file. The "file" command to some extent was necessitated into
existence because people weren't reliably adhereing to the file extension protocol,
and because there was no central authority on file extensions--resulting in multiple
types sharing the same extensions. I.e., file systems did not provide a standard
mechanism for externally typing a file, and the result is a mish-mash of hacks to
try to work around that. The consequence is occasional failure -- "file" sometimes
makes mistakes, and things break when that happens.

To give a concrete example of where things could go haywire with Camli as
currently defined: I write a camli-backed IM client. I stuff the
raw body of each message into a blob, put the meta data in another blob. Pretty
straight forward, clean, seems fine... Then you guys start using it, and in the
middle of a technical discussion about the indexer, someone starts pasting the
bodies of schema blobs into the chat window as illustrative examples.

Oops!

What _should_ happen is the body of the chat text blob is only ever treated
as chat text. What _does_ happen is the indexer sniffs out a claim and starts
editing state, and some time later you realize your data is screwed up and it
becomes a mysterious bug that takes five years to track down.

In my opinion, there are two clean options: Either all blobs are completely
opaque (the blob server never looks inside, ever, other than to verify checksum),
or they are self-identifying (maybe nothing more than the mime type--not nec.
a general "header").

Really either of those is probably fine, it just distributes the burden
slightly differently.

The self-identifying case I already kinda described. The opaque case would
work something like this:

The blob server knows nothing about what's in the blobs. Any code that does
look inside, like the indexer, always knows the type first. So, for instance, to
submit a new claim, I can't just build the claim and throw it into my collection
and expect my indexer to pick it up. Rather, I submit it directly to my indexer
as a claim and my indexer in turn stows it via the blob server while also adding
the claim's blobref to a list of known claims. Assuming the indexer maintains
its own state in blobs, the indexer needs to allocate a permanode for that state
when I first set it up, and it better remember the blobref of that permanode
(e.g., in a config file)!

That approach is conceptually more pure and closer to how most databases
and programming languages actually work, but it is _very brittle_ against any
mishaps. The self-identifying case (mime type) is a little mucky because
it's not super clear what "type" to assign some blobs. (Is a C source file
a C source file or a Text file? The best answer to this is probably: It is
both, because C source file is a subclass of Text file, but this opens up the
can of worms of type taxonomy...) So... I appreciate that both of what I am
calling the two "clean" options are not without problems, and I grok why you
have chosen what you have. However, personally, all things weighed in, I
would probably choose self-identifying blobs myself, and probably with Only
a cr-terminated type name (not full mime headers), and probably using mime
type names baring good reason not to. Any type taxonomies would be up to
the higher level code (typically you would label something a "C" file and
the higher level would know this is a subclass of "Text"). This would make
a Blob analogous to an Object in any object-oriented language, where there
is a class identifier, and the data, and all further interpretation of the
data is up to the class.

blob = (type, data)

So given all that context...

> What happens when I have a blob with bytes "{}" with a mime type of
> "text/json" and bytes "{}" with type "application/json"?

In my opinion, these are two different things, just like an integer
and a float that happen to have the same binary representation are two
different things. The point of content addressability is to find the same
conceptual object by the same name, not to capture some coincidental similarity
between two unrelated things that happen to be represented by the same bits.
I would hash the body and type together. If you change the type, it's
a new object.

For claims, I would probably do the same thing again: The second
line of the blob (after the type) would be a version number and the
signature, followed by a CR, followed by unadulterated JSON. (Or version
number \n signature \n)

Point is, once the type ("calmi/claim") is well known, we're free
to do whatever we want with the next layer in.

Dean suggests:

> perhaps something like <hash>-<hexdigest>-<type/subtype;parameters>.

Actually, that might be even better... The downside is the blobrefs
get bigger (and they're replicated a lot...). The upside is it's easy
for the blob server to symlink like <hash>-<hexdigest>'s together... and
also the type now is visible on a blobref which makes human-inspection
of goings-on easier. Offhand I would say "parameters" should probably
be pushed inside the file, so I might use:

<hash>-<hexdigest>.<subtype.type>

Which is generally backward compatible with file-extensions and
would allow file (and blobref) expansions like "*.camli" or "*.claim.camli"
or "*.text". (Alternately, <subtype.type>.<hash>-<hexdigest> would sort
very nicely.)

Anyway, just ideas.

-Brandyn

--
---------- bra...@sifter.org ------- http://www.sifter.org/~brandyn ----------

A cold pot never boils.

Brandyn Webb

unread,
Feb 1, 2011, 11:43:34 PM2/1/11
to camli...@googlegroups.com
Minor addendums to prior tome...

Brandyn Webb said (on Feb 1):


> Assuming the indexer maintains
> its own state in blobs, the indexer needs to allocate a permanode for that state
> when I first set it up, and it better remember the blobref of that permanode
> (e.g., in a config file)!

Actually, this doesn't work at all on second thought. The indexer would
have to maintain at least some persistent dynamic state outside of the blobs,
which is kind of hazardous. (Realistically an indexer will end up doing that
anyway, but conceptually that's all "cache" whereas in the above example it
would be critical, conceptually unrecoverable state.)

> > perhaps something like <hash>-<hexdigest>-<type/subtype;parameters>.
>
> Actually, that might be even better... The downside is the blobrefs
> get bigger (and they're replicated a lot...). The upside is it's easy
> for the blob server to symlink like <hash>-<hexdigest>'s together...

(But I still think this would almost never happen, so don't
consider it important personally.)

-Brandyn

--
---------- bra...@sifter.org ------- http://www.sifter.org/~brandyn ----------

"A romantic is a person who seeks sublime moments, those
rare experiences that are at the limits of human emotion,
endurance, and understanding." -Greg Robbins

Martin Atkins

unread,
Feb 2, 2011, 12:06:27 PM2/2/11
to camli...@googlegroups.com
On 02/01/2011 08:02 PM, Brandyn Webb wrote:
>
> In any robust computational environment (everything from assembly language
> to high level language to data structures to file formats to whatever), "type"
> is externally defined to the object itself. A float in a C struct is just bits,
> but you know it's a float because of where it is in the struct. An "object"
> in C++ is firstly a C++ object -- that is its type. Knowing that, you can find
> the pointer to its class (just more bits--but you know what they mean because of
> where they are) and the class tells you how to interpret the body of the object,
> so still the bits in the object are externally typed. I.e., even "self-identifying"
> objects are just more of the usual hierarchy of externally typed data.
>

I was agnostic on this issue until I read your argument, but I think I
reached a different conclusion than you did.

With the above (and the rest of what you wrote) in mind, it seems to me
that blobs should just be blobs, with no type information whatsoever,
including magic, and it is up to the layers above to define meaning for
those blobs.

And further, the layers above should define meaning by *context*, not by
sniffing. If I'm holding some data that tells me that
sha1-d26e20e8bcbd7911b0ad257d65c1440c00681687 is where I can find a
profile picture for you, then my app needs to know what "profile
picture" means in this context in order to know what to do with the blob.

This doesn't mean that sniffing can't be a valid processing model for
certain application protocols built on top of the blob store. For
example, whatever application protocol defined "profile picture" in my
example above might define it as "either a JPEG, PNG or GIF image, to be
distinguished using header sniffing". The sniffing method here is
well-defined, so the behavior is predictable and the set of possible
outcomes is much smaller.

But to just pluck a random blob out of the blob store and try to guess
its type with no context whatsoever seems like folly, since the same
blob could be referred to in two contexts with different processing
expected for each.

Rhett

unread,
Feb 2, 2011, 12:44:09 PM2/2/11
to Camlistore
Excited to see this thread after spending a later night than I should
mulling over some of
the aspects of Camli that rubbed me the wrong way.

To risk a little armchair architecture, I think one of the conceptual
issues comes
from the main design doc where 'level 1' covers both the dumb blob
store, and this
'camliVersion' sniffing. We might be able to talk about these things
more precisely
by being clear where level boundaries are:

Level 0 - Blob store. No types, essentially just key-value
Level 1 - Camli Meta Layer ? Structured Data
Level 2 - Application layer

I think this problem becomes a little easier if we sacrifice a little
on the ultimate
flexibility that I assume is just being called 'future proofing'.
Again, I don't have
any code invested to back any of this up, just thinking about it as
newcomer:

- Accept JSON-only for Level 1 layer. No sniffing. camliVersion just
has to be a key in there somewhere.
- Indexing is done with context (at Level 1 rather than Level 0).

So you would never just grab a random blob (from enumerating I guess)
and programmatically
inspect it. From a client application perspective, you'd probably
rarely do this anyway,
as you'd have references to permanodes or results from indexers. Those
would always be
Layer 1 objects. Layer 1 objects might reference raw data in Level 0,
and tell you the
type (image, text whatever).

The downside here is that you now have to define some interface for an
indexer. It
can't just work directly off of the data store. I wouldn't be
surprised though if
this ended up being much more convenient to deal with as I can imagine
very quickly
wanting to prune down the blobs an indexer needs to deal with. More
than likely it
will be of little use to have the indexer tell me about EXIF data in
the jpeg I
uploaded rather than represent that image by whatever meta data is
provided in Layer 1.

That interface to the indexer could simply key off mime-type on the
way in. When
inserting a new blog with content-type 'application/camlijson' accept
that it's a
Layer 1 blog and index it. Or maybe if you ever indicate that it's
JSON in the mime
type it will try to parse it and look for that camliVersion key. Mime
type doesn't
need to be stored, just indicating to Camli that it might be able to
parse it if it cares.

Rhett

Kenton Varda

unread,
Feb 2, 2011, 7:42:22 PM2/2/11
to camli...@googlegroups.com
On Tue, Feb 1, 2011 at 9:35 AM, Brad Fitzpatrick <br...@danga.com> wrote:
 
How is the indexer going to figure out the type of blobs in general?

Like file(1).

file(1) is a great heuristic tool for users trying to figure things out, because it works 99% of the time.  But I don't think I'd feel comfortable building complex software on top of it.

Just the other day I read about an interesting problem in reiserfs's fsck implementation:  If you have a reseirfs filesystem, and on it you happened to be storing a disk image of another reiserfs filesystem (e.g. for a VM), it can completely confuse fsck.  Basically, fsck looks for things that look like filesystem structures in an effort to repair the system, and it thinks that the stuff inside the image is actually part of the outer filesystem.

I worry that camli's design is going to lead to the same problem.

Brandyn's example is good too...  What happens if I am a camlistore developer, and in development I happen to save an "example" camlistore object as a text blob?  Camlistore is going to attach meaning to that blob that I didn't intend, leading to unexpected effects.

This could even lead to security problems.  If I store my e-mail as text blobs in camlistore, someone could send me an e-mail containing a camli directive and cause unexpected things to happen.

Sure, you can say "Apps should make sure to frame their data such that it can't be misinterpreted".  But it seems excessively easy to accidentally get this wrong.
 
 Not all file types have unique magic numbers, so it seems like the indexer would need some pretty complicated content detection to be effective.

Almost all interesting files do.

I don't think that's true.  Most text-based formats do not contain magic numbers.  JSON is easy enough to identify as JSON, but much harder to distinguish as any particular kind of JSON.  Similarly with XML.
 
Anything else can have a camli json wrapper for other files/blobs.

That doesn't prevent misinterpretation, unless you make sure to always search for said wrapper before making any assumptions about a blob.  Of course, you have to search for the wrapper's wrapper too, and the wrapper's wrapper's wrapper, ...

I think there are a couple solutions here:

1) For each blob, maintain a single bit of metadata that simply indicates whether the blob is a camli object.  When receiving the same blob from multiple sources, you can simply "or" the bits -- if either one is known to be a special camli object, you want to treat it as such.

2) Treat no object as a camli object unless you have seen a reference to that object in a context that indicates that it must be a camli object.  This requires that you manually track some "root" objects which you know to be special.  Those root objects will somehow indicate which of their blobrefs point to other camli objects, and so on.

Both of these solutions could actually be handled by the indexer and garbage collector rather than the storage layer itself.  However, this would mean that when a user writes a camli object, they'd have to talk to the indexer and/or GC to let them know, rather than letting them just find the object by enumeration later.  Personally, I think that's fine -- I never liked enumeration in the first place.  :)

Michael Stephens

unread,
Feb 2, 2011, 10:43:40 PM2/2/11
to camli...@googlegroups.com
On Wed, Feb 2, 2011 at 7:42 PM, Kenton Varda <temp...@gmail.com> wrote:
> On Tue, Feb 1, 2011 at 9:35 AM, Brad Fitzpatrick <br...@danga.com> wrote:
>>
>>
>>>
>>> How is the indexer going to figure out the type of blobs in general?
>>
>> Like file(1).
>
> file(1) is a great heuristic tool for users trying to figure things out,
> because it works 99% of the time.  But I don't think I'd feel comfortable
> building complex software on top of it.
> Just the other day I read about an interesting problem in reiserfs's fsck
> implementation:  If you have a reseirfs filesystem, and on it you happened
> to be storing a disk image of another reiserfs filesystem (e.g. for a VM),
> it can completely confuse fsck.  Basically, fsck looks for things that look
> like filesystem structures in an effort to repair the system, and it thinks
> that the stuff inside the image is actually part of the outer filesystem.
> I worry that camli's design is going to lead to the same problem.
> Brandyn's example is good too...  What happens if I am a camlistore
> developer, and in development I happen to save an "example" camlistore
> object as a text blob?  Camlistore is going to attach meaning to that blob
> that I didn't intend, leading to unexpected effects.

Indeed, the camlistore source tree already includes just such a file
(doc/json-signing/example/some-notes.txt.camli); will uploading the
camlistore source confuse/break a future version of camlistore?
Expecting developers to sanitize application data to avoid this seems
likely to lead to a lot of difficult to track down bugs and (possibly)
security vulnerabilities.

Brandyn Webb

unread,
Feb 3, 2011, 2:19:29 AM2/3/11
to camli...@googlegroups.com
Martin Atkins said (on Feb 2):

> I was agnostic on this issue until I read your argument, but I think I
> reached a different conclusion than you did.
>
> With the above (and the rest of what you wrote) in mind, it seems to me
> that blobs should just be blobs, with no type information whatsoever,
> including magic, and it is up to the layers above to define meaning for
> those blobs.
>
> And further, the layers above should define meaning by *context*, not by
> sniffing. [...]

I actually agree with this philosophically.

The problem is that the blobs are immutable. It's perhaps a little
subtle why this is an issue, but you have to think all the way through how
you would actually implement the indexer.

I assume there is a design objective that the collection of blobs
completely defines the database, and that anything outside the collection of
blobs is conceptually just cache. This is analogous to, say, the tuples
in an SQL db completely defining the data, and any indices are just caches
for speed and can be (reliably) regenerated if lost.

As far as I can figure (maybe I'm just missing an implementation or
representation trick), you can't have immutable blobs AND no external
data AND no blob sniffing.

If it's not clear why, let me know and I can elaborate.

For what it's worth, the architecture I was starting to sketch out
before finding camli was similar, but allows mutable objects. Similar
to camli, the names are content hashes, and I have essentially
the same thing as a permanode as the root of mutable objects, but
my "blob server" directly supports something analogous to Camli's
"permanode-become" (but for any object, not just file systems...),
which are essentially "pointers" and provide a single and concise
point of mutability.

Now, I suppose you could say these are still two separate layers,
and you could do the same thing in Camli by having the indexer store
essential state (current permanode root), but to me that feels unclean
because now you have two supposedly separate data stores which Must be
kept perfectly in sync or there is corruption. To me that pretty well
defines one inseparable layer, so I think of it as such. May just be
semantics.

So, I agree with you, actually, but with the condition that the
blob server directly supports permanode-become. Without that, afaict,
you Must store essential state in the Indexer, which imo is hazardous.

-Brandyn

--
---------- bra...@sifter.org ------- http://www.sifter.org/~brandyn ----------

Humans are like Slinkies. Not good for much, but you just can't
help but smile when you see one go down a flight of stairs.

(Anonymous)

Brandyn Webb

unread,
Feb 3, 2011, 2:43:33 PM2/3/11
to camli...@googlegroups.com
Kenton Varda said (on Feb 2):

> file(1) is a great heuristic tool for users trying to figure things out,
> because it works 99% of the time. But I don't think I'd feel comfortable
> building complex software on top of it.

Likewise. I'm going to return to working on my own model, since for my
purposes I need "provable" robustness and a sniffing model will always be
slightly leaky, but there remains a great deal of overlap so if I can find
a way down the road to use the camli back-end I will. (Open to collaboration
if anybody else is working on a similar variant.)

-Brandyn

--
---------- bra...@sifter.org ------- http://www.sifter.org/~brandyn ----------

When the going gets tough, the tough get going.
But the rest of the time they're just bored.

Reply all
Reply to author
Forward
0 new messages