Google Groups Home
Help | Sign in
Recent pages and files
WRFS - An Idea    

From: http://cowbell.floe.tv/WRFS.html

 

WRFS

by Josh Patterson and Josh Lewis
Todo:
  • Add example uses
  • Add Images
  • Add References

When Can I Use That

During development of any application, typically the team sits down with users and does a testing session, asking questions and gauging responses to see how well the given utility presented satisfies market demand, or --- does it "do the job?". While developing floe.tv, a media mashup system, one day I was demoing the beta to 2 videographers, showing off features, and asking their opinion about what was important to them. We came to the subject of data storage, local hardrives, and getting media online, and just as a thought exercise I asked "well, what if floe.tv just knew about all your online media by your login name, and referenced it automatically in your libraries the first time you logged in --- just as if it was an app installed locally on your hd?" and immediately both of them became excited and one asked "can I do that right now? when can I use that?" and I knew from experience that the market was speaking very loudly and clearly in my direction, and that I had better listen very closely.

The very next meeting I posed this question to our team:

What if our app was "inherently installed" in the internet? What if someone logged in, and the app just acted like a desktop app that "knew" about your flickr images, your youtube videos, it knew about your myspace friends, facebook friends, and automatically treated them as one logical database, one logical social graph? And someone started right into an app tutorial right off the bat with their contacts, files, and assets already referenced (but fully respecting privacy, control, etc)?

So the next question naturally becomes that all sounds really great, but ... how do we get there?

In order to even think about "getting there", we need to map out and discuss a few things first. Ideally we need have to at least sketch out or develop a reasonable strategy for:
  1. What are some of the ways we can execute a strategy that satisfies the stated needs?
  2. What are some existing technologies that exhibit the properties we want?
  3. Condensing a layered abstraction of the utility from the vapor of its required properties
  4. Defining which current technologies fit the properties of the current abstraction so that we dont reinvent the wheel.
  5. Being able to describe how a prototype of the proposed system might execute
  6. Define the obstacles, opportunities, strengths, and weaknesses for a proposed system such as this

A Rough Sketch

So exactly what are we trying to construct here?

From a user's perspective we want this system to:

  • Allow them to use their data in a {web, desktop} application regardless of where the data is stored (in one place, in multiple locations, etc)
  • Not force the user to configure each data store, each application, or identity provider for each application. It simply manages the permissions for the user, securely, and allows data { identity, social graph, other } aggregation seamlessly behind the scenes.
  • Keep things simple. Let the user do what the user wants to do, as opposed to trying to keep the user in a "walled garden". Let the market decide how data should be used.

As developers we want a system that:

  • Makes all of a user's data accessible from any application on the internet while being stored in a variety of containers across many different sites.
  • Makes a user's data accessible from a query language or a proxy-api
  • Implicitly knows how to "discover" where a user has data stored.
  • Knows how to aggregate a user's data together across multiple containers and identities.
  • Allows the user to control who can see and query what on their behalf
  • Presents a universal data api for web data allowing data to be queried against as if it were in a single filesystem or database.

Hey, while we're dreaming big, let's go for it --- We might also want it to:

  • Allow an application to view the data from the perspective of a filesystem and from the perspective of a database
  • Potentially act as a "Data Cloud" drive on a portable device (example: an iPhone thinks its I: drive is a local disk, but its actually the system we are proposing)

So exactly what are you saying here?

Basically, we want an api that allows us to view, query, and aggregate a user's data, regardless of location (restriction: data must be accessible through a webserver) basically in the same way that we do with a local disk based filesystem (at least in most ways). Say, how does a filesystem work, anyway? That sounds like a good place to go for a start on our model!

That Same Ol Song

So we said we want to be able to:

  • Store
  • Aggregate
  • Protect
  • Relate
  • Query
our data? Huh. What other systems do these same things?
  • A Filesystem
  • A Database
  • DNS
So really we want to do some things that have already been done quite well in computer science. So Let's take a look at how they do it, and build a roadmap/model of how we might create an abstraction.

The Filesystem as a Metaphor

What are some interesting properties of a filesystem that are very applicable to our situation?
  • A filesystem abstracts away the details of storing bits on a storage system from the user or programmer so they can focus on working with files, as, well "files". The programmer simply says "I want a stream to read from a file called "foo.txt" in my "/usr" directory, go get it and return me a data structure or stream". The programmer doesn't worry about inodes, bmap, or the size of a disk block (on average), because at the application level of the abstraction model those underlying details should simply be "taken care of". Just think about this --- if you had to worry about managing free disk blocks in a linked list everytime you wanted to open a text file, you might start thinking about changing professions. Abstractions are your friend.
  • So exactly what happens when we open a file to read? A file is stored all in one spot, right? Not quite. A Hard drive has a disk that is a series of "disk blocks", which are all the same size (generally 4KB), which is considerably smaller than your average mp3 file. So how does a file get read from disk if its store in all those 4KB chunks? It goes roughly like:
    1. --- Translation of filename and directory to inode ---
    2. --- Translation of inode and offset into disk block using the "bmap" system call ---
    3. --- file reader/writer is returned at offset in block ---
    Where:
    • Inode - a data structure on disk that represents a file and has pointers to all of the disk blocks on disk that contain the actual bytes for the file.
  • So now you say "great, you've told us how a basic filesystem works, but that doesnt help flesh out this grand distribute file system..." --- and to that I say "hold on, lemme finish, I'm going somewhere with this". What property of the filesystem can we apply to our design goal abstraction?

    Store and Aggregate.We have to be able to store our data in whatever container we want on the internet, and we need to be able to aggregate that data back together again, right? Well, what if we said:
    • The concept of a disk block could be related to a web server itself to satisfy the storage requirement. The system call Bmap() is mapped to a standard web api { soap, rest, json } on the webserver for exposing data.
    • The concept of an inode could be related to a data index store to satisfy the aggregation requirement. This online store would be needed to tell a program/agent "hey, userID X has images in flickr, smugmug, and myspace", which would then take those uris and make the proper data queries via the standard web data apis on the respective data containers. This eliminates the need to actually cache someone's data in a third party site, the data should be able to be "discovered" and aggregated at runtime. The data aggregation layer might do something like take an openID identifier and query a data index store (possibly referenced by the openID provider itself) to get the list of data containers, query each one, and then return a data structure (or recordset) of relevant data stubs+uris for specific user files to the application layer.

The Database as a Metaphor

I think really the aspects of a database that are interesting in this context are Protect, Relate, and Query.
  • Protect - The data stubs returned from the data containers might not be accessible from all applications or third parties. A database can set who can view and update a table down to the record level. Our system might want that level of granularity.
  • Relate - The data stubs returned from the data containers might have relational properties and have transformations performed against them before being returned to the application layer.
  • Query - just like how a SELECT statement is parsed into a query tree before being executed against a database table/view, we might want to execute similar type functions/filters against the returned aggregated data.

DNS as a Metaphor

DNS allows us to take a domain name and translate it into an IP address; This is interesting from the standpoint of our need to resolve an openID-like token into a set of data container uris for a given data type { social graph, images, videos, ? }

Abstract Art

So at this point we've talked about what we want, listed some properties of those requirements, and talked about existing technology that performs similar function. It might be a good time to try and develop an abstraction of what we want so we can get a more clear understanding of how it might work.

Application Layer

At the top of most stacks or abstraction layers is the application layer. It is simply the endpoint of where the request begins (most of the time) and the result comes back to. (example: A sql query in a winform application is executed via an ADO.NET connection at the application layer, gets send to the database, checked for validity, translated into a query tree, executed against tables, contructs and relates the data into a set of records, and then returned back to the application layer.)

With something like we are proposing, our application layer could be a number of platforms, but a good example might be a flash application that wants to construct a slide show of all images for openID = 'joe@oid.floe.tv' regardless of if they are stored on smugmug or flickr or wherever.

Query and Aggregation Layer

Processes data, relates data, aggregates data together, returns it back to the application layer.

Needs to be accessible either as an API or as a SQL-Like query language.

Discovery and Translation Layer

Knows how to find data based on an openID-type identifier. This layer might query the Identity provider for the location of the data-indexing service for this user, which for now we will call the "WebInode Server", and its interface might look like this (should return dummy stub data). This mechanism simply tells the discovery layer "hey, this user has data in X, Y, and Z data providers". These results are then used to query each data store for what data the user might have in each of them, and the results are returned to the parent layer for aggregation.

Storage Layer

So now we've hit the bottom layer of our model, where the data itself is actually stored. This layer needs to represent a place where any type of data is stored and be able to be queried about its contents for a specific userID. It also should protect the data so as to protect the user and the data provider, but allow for access, given the correct credentials, to the actual data/file itself to the proper parties.

Filling in the Blanks

So now we've got these nice layers to break things up, abstract away the details, and let us focus on managing the process. But really, if you leave those layers, well, abstact, then all they really do is end up in a research paper, and we are trying to make something work (sooner than later) here. So let's see what we have laying around that might work for our model.

For the time being, we'll call the system "WRFS" as in "Web Relational File System"

Application Layer

Pretty much everything { web-app, flash app, winform, shell script } can be an application, so we don't have to do much here.

Query and Aggregation Layer

To make things simple, let's say javascript is our api language, since this is aimed initially at the web2.0 world. Let's say that we might make a call like:

var oProxy = new WRFSProxy();
oProxy.GetDataFor( strOpenIDIdentifier, WRFS.DATA_TYPE_IMAGE, MyCallbackFunction );


Under the hood of that javascript call this layer would use a http-calling mechanism (depending on implementation) such as:
  • SOAP
  • REST
  • JSON
Let's say for the sake of this exercise that we are using openID to manage our identity, and we send off a quick webservice call to the domain in the openID identifier (along with any security tokens) to say "hey, where does this person store their data index?", and let's assume that at this point openID had a mechanism or field in place that pointed to that server with a URI.

Discovery and Translation Layer

The system then calls a service that would return a list of indexed data storage URIs for a given openID identifier. This api might look like this. (Notice the name of the sample server? Inode.asmx --- We make the allusion that the data indexing service is essentially the "inode" of the web file system.) The system then takes each URL/URI and queries that server for a list of relevant files, which are returned possibly as xml in the RDF dialect. There are some issues here, though. Is everyone going to let us coming tromping through their front door just looking for random people's data? More than likely not, but maybe, just maybe, if we knock the right way, they might let us in.

Data Storage Layer

Now, we just established that we look at the data-indexing service as a sort of filesystem "inode", so, if we are storing the actual user's data on a webserver, what does that make it? A disk block, in a lot of ways. Just like with a normal filesystem in linux, a single jpeg image might be stored in 50 different disk blocks scattered around the hard drive, and a single inode points to all the disk blocks. Here, the data index service, or inode, points to all of the servers that a user has data in. And just like how the linux filesystem uses the call "bmap" to take an inode and find disk blocks, our "javascript system call" uses the data-indexing service to find the user's stored data "blocks".

But whoa now --- Who can do what with who's data, and how? Thats a very big issue that a lot of people are looking at, and we believe the emerging OAuth spec might just be the solution for this. OAuth does a lot of cool things, but in the end, it essentially says "yes, the application at floe.tv can mashup the image data at flickr for user X". Sounds like that is very, very handy for what we would like to do here. What if each data storage service, like say a flickr, exposes an open, standard web API that allows for automated OAuth mechanisms (generally, the way we understand it, is that currently a user has to actually be redirected to the data storage page, click "yes", etc, and then be redirected back to the 3rd party application site.) So let's pretend for a moment that OAuth worked with our "Web Inode" and allowed for caching of its security tokens, and allowed us to quickly query the data storage layer via the js/webservice mechanism to get the list of relevant data stored on that particular web server for that particular person. This stub data is then returned to the "Discovery and Translation Layer" via { SOAP, REST, JSON } and then sent back up the stack to the "Query and Aggregation Layer" to be combined or aggregated into a single recordset or data structure { RDF, SIOC, XML } to be passed back to the Application Layer. The one missing piece in this sequence is "well, how did the so-called "web inode" KNOW in the first place that there was data in flickr for openID X"? and to that I'd suggest simply adding a web-method like this to the "web inode" and then requiring each data storage container to register the fact that the user has data in their container. Just a simple, single web-method call the first time a user enters some data in their service.

Example Use Case

Let's say we want to tackle another problem, say like getting all friends of a user regardless of social network (hey! wait a minute! that sounds an awfully like that whole Open Social Graph nonsense. You cant do that! --- but oh wait, we can). In our as3 code we might use an api call, or we might use some sort of WebSql call like

SELECT * FROM [Global Social Graph] WHERE [openID] = 'joe'

The Query and Aggregation Layer would break this query down into a query tree, and then pass on the query to the discovery layer. The discovery layer might find out that userID = 'joe' has social graph data in myspace, and then it sends a REST request to myspace.com (with cached OAuth tokens handling security and permissions), possibly with parameters, to find out which friends 'joe' has stored there, which might be returned as RDF data. This data would then be passed up to the Query and Aggregation layer, to be recombined with the data the process also got from ning.com, and presented to the Application layer as a unified recordset.

Color By Number

So really, if you think about it, most everything we need is either already in use, or sitting in a final spec stage (ok, we still need some standardization, and some minor extensions to openID and OAuth, but those aren't particularly monster obstacles). It would really only take a group of open-minded people that said "you know, that sounds cool, and we can be a part of a sum that is greater than its parts. Let's give that a shot!" and form up into a sort of "open data alliance" and get to work. But ...

Why would they? Where does that takes us? What are the consequences?

No one is ever going to open their data up into some crazy united disk system. How do you make money? You mean we can't lock the user into our site? What? This is completely LUDICROUS. Simply crazy talk.

Markets and Equilibrium

The Funny Thing Is...

There are always pressures on any market, especially markets that are far from mature, and especially one like the internet and data storage.

Let's think about a few things first...
  • Why would any sane commercial company allow their data to be pulled into another web site? Now think about how that is phrased for a moment, and then replace "web site" with "application". We'll come back to that in a minute.
  • Why would they allow that? That's a trick question, really --- the data does not belong to them, because after a certain point of "market equilibrium" users simply will not allow a "silo" to tell them how to use their data. Oh, you dont think so? Let's go back a few years.
  • What if, back in 1996, you had a desktop application like Quicken, and you could never get your data out of that application for backup, or transfer, or in any type of format; What would you do? I'm going to guess that you might find another application to manage your finances, but that might not hold for everyone. However, at this point, for desktop applications that simply would not be tollerated.
  • So let's ask another question: Does Intuit "own" your Quicken data? If they told you that you could not export your data, and it was locked into one machine, at this point what would your reaction be? Over time, the currently held notion that a "silo" can control your data will melt away due to market forces. Why, you ask?

The Equilibrium of the Economics of Data Storage

Because users will tend towards who gives them the best deal, and they can and will move their data to a system that gives them the maximum value for their data, and being able to inter-relate data makes data more valuable. In the end, since the commodity of hosting images, social graphs, and videos is easily setup, the "containers" hold no real power in the long run. To illustrate this, let's take a look at the convenience store business.

Myspace as the Gas-N-Go of the internet

Convenience stores make nearly no money off of selling gasoline, maybe a penny a gallon. However, they continue to sell gasoline --- why? Because it attracts motorists to their location, and provides them with the opportunity to sell milk, candy, bread, etc. Carrying the gas is essentially the overhead of marketing their location and attracting customers. The store owners are simply a third party to the gasoline transaction yet are able to make a profit in the percentage of motorists who also drop into the store. Who holds the power? Ultimately the consumer does, since they can get gas elsewhere if the terms of the station do not meet their needs, although it could be argued that gas prices are controlled the Gas Companies themselves, but I digress. So what does this have to do with the economics of storing, aggregating, and protecting data?
  • Users, just like motorists, can take their business elsewhere. they can. really.
  • The user is essentially the motorist AND the gas company (crazy, I know) in this metaphor. The user "creates" the product, and also "accesses" the product. If a gas station/data container wants to begin to restrict how / where the user can use their data, they can simply "go to a different gas station", or push their data into a new "container" that supports the open data alliance. At some point the business of being a "data container" will come down to being essentially a "convenience store" --- the overhead of data storage and network costs is offset by the opportunity to get people to look at your ads (which is essentially the bread, milk, and candy of the store metaphor). Not everyone will go to the site to look at the ads, but just like with the gas station --- they dont have to go inside to buy things, they can just get gas. The total of all the user's data storage and network costs is simply overhead to deal with in order to get the opportunity to serve a percentage of the users a few ads.
  • Some very large internet companies seem to really have an issue with the idea of open data, and possibly, if I was on their board, I might be concerned with its threat to a very profitable business model. Don't some of these sites also have positions that might not align with old business models of entities like the MPAA and the RIAA? Just like they are telling these "old media" companies their attitudes need to change, some of the new media companies might need to change their attitudes as well.

But in the end

It all comes down to the perception of value in the eyes of the customer, what they will bear in terms of restrictions, and what alternative choices they have.

I for one would like to give them a new choice and let the market decide.

Josh Patterson
(email: jpatterson @ [insert floe.tv here ] )
floe.tv


 

Version: 
Latest 3 messages about this page (18 total) - view full discussion
Jan 2 2008 by Zef Hemel
Hi Josh,
Ok great, I'm glad to hear we haven't been doing exactly the same
thing -- that would have been a waste.
You are writing code? Aren't we merely coming up with protocols here,
which first have to be designed? Or is most of that work done and have
you moved on to a reference implementation? If you simply need a
Dec 31 2007 by Chris Saad
That does sound like a good strategy Josh - in terms of Zef focusing
on the 'disk access' side since we have focused so much on discovery
so far.

Also, while code is critical and important, I think with this sort of
thing it will also be critical to make sure the ideas are documented
publicly as we go for everyone to see movement and momentum and to get
Dec 31 2007 by Josh Patterson
It looks like you've done quite a bit looking at the "disk block"
level, and thats good. For now, our engineering focus is own the first
draft of WRFS, with openID -> Yadis -> web inode, which points to the
locations of data around the web for that particular user (critical
tenant of WRFS). I'd say the best way for you to get involved at an
15 more messages »
Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2008 Google