* THE OBVIOUS
A clean abstration to store and retrieve objects in Thrudoc, and index
and search in Thrucene.
* SCHEMA DEFINITION
I'm not too happy with the idea of using thrift files to declare
object schemas. It's almost trivial to implement a small DSL to
declare the fields directly in the class. Something like:
class User < ActiveDocument::Model
field :login, :string, :indexed, :sortable
field :email, :string, :indexed
field :created_on, :datetime
field :password, :string
end
It declares the fields, and also flags which ones should be indexed.
And the implementation simply builds the FIELDS hash used by the
thrift module.
* BACKWARD COMPATIBLE SCHEMAS
Thrift handles this by letting the user specify the ID for each field
(which is the only identifier stored in the datastream). The schema
definition DSL might allow the user to specify an ID, but ideally this
should be handled behind the scenes.
A simple counter won't work, because any change in the class
declaration would break the sequence. Removing the "email" field in
the example above would mean that "password" would now get id 3
instead of 4, breaking compatibility with existing documents.
We could store some schema metadata on the database itself. A list of
field names and types, along with their "assigned id". This way, all
"title" fields of type "string" would be assigned id 42 (which might
be different on another server, but that doesn't matter as long as
it's consistent in each server).
* SELF-IDENTIFYING CLASSES
Thrift structures don't know which class they are. They are just a
stream of fields ids and data. To deserialize a document, one needs to
know which class it should be deserialized into. This is fine if all
you want to do is User.find(params[:id]) but what happens if you want
to retrieve all objects (regardless of class) tagged by a given user?
Thrucene will give you a list of ids, but you can't deserialize them
because you have no clue as to what class to use.
The simple solution is to store the class name in the document itself
as yet another attribute. The simple implementation is to decode the
object twice, as suggested in http://3.rdrail.net/blog/working-with-thrift-structures/
, but we can improve on that by refactoring the thrift library a little.
And of course, we won't have to do all this extra work in those cases
where we already know the class to use.
* VALIDATIONS, CALLBACKS, ETC
Rick suggested we use ActiveModel (can't find a link right now) which
is a refactoring/extraction of those features from ActiveRecord.
* RELATIONS
I'm not keen on converting ThruDB into a SQL abstraction, but
providing simple mechanisms to relate objects among themselves is a
very useful feature. And reusing existing concepts to represent those
relations can ease adoption. ActiveDocument should provide
implementations of "has_many", "belongs_to" and
"has_and_belongs_to_many". They are implemented as fields storing one
or many guids and indexed in thrucene.
Obviously you can't do joins in thrudb, but we might be able to figure
out some caching mechanism that reduces the number of calls to the
server when retrieving related objects.
* COMPATIBILITY WITH OTHER DATABASES
At least for some of us, the idea of ActiveDocument was triggered by
ThruDB. This doesn't mean that we can't implement it for other
databases such as CouchDB or SimpleDB (let google deal with BigTable
on their own).
I'm fine with that. But just like with ActiveRecord, I'd rather have
that "compatibility" be very thin. That is, there is no guarantee that
your code will run without changes in all the databases. Query
languages are too different across the databases for this. But the api
will be the same, and simple calls will work as expected.
Maybe in the future someone can come up with a query abstraction that
can work across all doc-dbs, but for now I won't even bother.
Well, that's it... those are the big ideas I had in my little head.
Let the feedback flow.
Sebastian
On 1/13/08, sroske <sro...@gmail.com> wrote:
<snip>
> > I'm fine with that. But just like with ActiveRecord, I'd rather have
> > that "compatibility" be very thin. That is, there is no guarantee that
> > your code will run without changes in all the databases. Query
> > languages are too different across the databases for this. But the api
> > will be the same, and simple calls will work as expected.
> >
> > Maybe in the future someone can come up with a query abstraction that
> > can work across all doc-dbs, but for now I won't even bother.
>
> I think the query abstraction and database adapter setup is important
> starting out. This is the key reason we can't just create a
> ActiveRecord adapter for document-oriented databases
> (ActiveRecord::Base generates SQL statements that are passed to the
> database adapters).
>
> We should use something like Ezra's ez_where[2] DSL for query
> abstraction, so for a very simple example instead of:
>
> User.find(:first, :conditions => ['login:? AND status:active',
> 'mylogin'])
>
> It could be:
>
> User.find(:first, :conditions => { login == 'mylogin' && status ==
> 'active' })
>
> Then we could inflect on that block to discover what kind of query it
> is, but actually pass the block into the adapter itself. The adapter
> can worry about converting it from the DSL to that database's specific
> query logic (lucene for ThruDB, Javascript functions for CouchDB,
> etc.).
I agree with this point, strongly. I like the idea of abstracting out
the exact query language syntax for a couple reasons:
1. developers don't have to learn another query language syntax in
order to use lucene, or couchdb, and so on.
2. it makes switching data storage backends easier as your actual
model code is agnostic
I brought this up in my comment on Sebastian's blog. Have you looked
at ActiveRecord::Extensions? I like the syntactical approach used
there:
Post.find( :all, :conditions=>{
:title => "Title", # title='Title'
:author_contains => "Zach", # author like '%Zach%'
:author_starts_with => "Zach", # author like 'Zach%'
:author_ends_with => "Dennis", # author like '%Zach'
:published_at => (Date.now-30 .. Date.now), # published_at BETWEEN
xxx AND xxx
:rating => [ 4, 5, 6 ], # rating IN ( 4, 5, 6 )
:rating_not_in => [ 7, 8, 9 ] # rating NOT IN( 4, 5, 6 )
:rating_ne => 4, # rating != 4
:rating_gt => 4, # rating > 4
:rating_lt => 4, # rating < 4
:content => /(a|b|c)/ # REGEXP '(a|b|c)'
)
the above is from: http://www.rubyinside.com/advent2006/17-extendingar.html
Instead of translating the hash to SQL we would of course translate it
to Lucene syntax, and so on. I'm curious what you guys think of the
section on the above URL under the header "Database Adapter Support
and Compatibility" that Zach wrote. I haven't looked too far into AR.
I got about 75% of the way through writing an adapter for a SQL
proxying type app called SQL Relay then ended up not needing it, and
got busy, the usual, but perhaps you're more well versed in it.
Back to rearranging the apartment:) ttyl,
Jacqui
I'd rather build it as a layer on top of the rest of the system (like
ez_where does with AR) once we've had more experience using the query
languages themselves and we can better understand the usage patterns.
your point about "the same conventions that ActiveRecord uses" is
exactly what I'm concerned about. We don't need to impose those
conventions on a database system that is not table-oriented.
It's incorrect to say "you're only ever querying or retrieving a
single model record". On your "Media Library" site, you can have a
query for "title:Rails" returning Links, Books, ScreenCasts, etc.
Another danger of the current ThruDB implementation (or rather,
suggestion) of using thrift structures is that if you were to lose the
"original pointer", you can end up with a bunch of binary orphans in
your database for which you have no clue what to do with. How can you
retrieve all your "User" objects if you were to lose your lucene
index? The data will be there, but it won't be possible (or at least
easy) to figure out which blobs are users and which are other objects.
Just that "data recoverability" feature alone makes it worth
implementing an automatic class_name field, at least in my opinion.
class User
attribute :username, :string
attribute :memberships, :Membership
end
Then you could just have the code for the "attribute" method handle
the marshaling. Just an idea.
On an unrelated note, what do you guys think of having a hack session
on Wednesday afternoon/evening?
Best,
Paul
The only issue then becomes schema changes, but I thought one of the
points of the document model was to lessen the problem with schema
changes. Just my $.02.
On the hack session: What about Wednesday afternoon / evening? I'm
available from 2:30 EST on.
Cheers,
Paul
On Jan 14, 2008 7:34 AM, rick <techno...@gmail.com> wrote:
You can easily retrieve Users belonging to a given group, by looking
for the ids in the "group members" attribute of the group, or by
querying lucene for users with a "member_of" field containing the
group id.
But to retrieve members of any group with a name that ends in
"friends", you will have to first get all the groups matching that
name, and then retrieve the users for each group.
If we cannot do a join and have to do several queries, then network
latency has a larger effect, so caching closer to the client (thrudb
client, not the browser) becomes more important. Or we might want to
make sure thrudb provides efficient multi-object retrieval in a single
call (it might already).
I was thinking something like:
class Foo < ActiveDocument::Base
schema do
string :bar
end
end
Through the power of ruby open classes, you can easily partition this
off to another file (which is what I'll be doing). Or you could just
keep it all in one large file. Everyone wins :)
Also, it might be nice to keep the current module format of the
library. This way you can include the module into any ruby class that
has #serialize and #deserialize.
--
Rick Olson
http://lighthouseapp.com
http://weblog.techno-weenie.net
http://mephistoblog.com
> Thanks for the further explanation. I had assumed that you'd have to
> pay the latency penalty for the query you described, but multi-object
> retrieval (does someone who knows better know if this is possible in
> ThruDB?) could certainly be a handy tool in the box.
thrudb doesn't know anything about the data more than key and value.
the actual data in them is opaque so it doesn't really support the
idea of functions like count, sum, max, min as it's unclear what
they'd be on.
in the past in buliding systems on top of things like thrudb and
thrudex (thrucene) i've used a "metrics" service for things similar to
what you'd normally do with count(foo) as cnt ... where ... sort by
cnt desc. something like this might fit in to the thrudb suite and is
pretty straightforward, but doesn't fulfill all of the use cases a
database could.
-rm