Hi, I read most of it and I also read your posting on mongodb-users ;)
Basically, most of it is also what I would have in my mind ... just some notes:
1. I wouldn't do a "isAdmin" property for users. Rather, create one or
more groups that are marked as isAdmin and then add the users to that
group. This is basically how it is done nowdays in linux via the
/etc/sudoers file where a group "admin" is marked as being special and
the sudo command checks if the user is in the group admin.
2. The permissions, I don't really understand it. Why are they in each
group? I understand that each worksheet has a list of users (owners)
with permissions and groups with permissions, but at the group level I
don't get it. What does db.groups.users.{ perms: 1 or 0 } actually do?
I think, just a list of usernames is enough.
Do you know ACLs [1]? Maybe compare your mix of usernames and
permissions with the approach over there and do it this way!
Therefore, I would attach such a ACL list to each worksheet.
3. Worksheets reference to a collection of cells? Each document has a
4mb limit ... I know, that's a lot and it will probably never be hit,
but if there is some crazy long output it might happen. Second,
updates on worksheets only happen on the cell level, never on the
whole document. I know, mongodb has the ability to update a part of a
document via the update command, but I think it's easier to have a
collection of all cells and reference to them.
But still, when a cell is updated, only it's "out" field is modified.
Therefore, I propose to define a worksheet as a list of cells [or
later and more advanced, also as a nested list of worksheets ... i.e.
to make it possible to reference to another worksheet, to do "dynamic"
embeddings, sections/parts, meta-worksheet-documents ...].
Additionally, I could also envision cases, where each of those cells
get additional permissions, e.g. a "lock" that adds a "all: ---"
permission, so that nobody is able to edit a cell. (editing
permissions is of course still possible by the users and users in the
associated groups if they are allowed)
4. something trivial, instead of
out: [{ t:"stdout", data: "..."} , {t:"stderr", data: "..."}]
please just do
out: { stdout: "...", stderr: "..." }
Mongodb allows to list all keys in such an associative list and no
need for this {t: "..."} thing.
(or even better, get rid of "out" and just a stdout and stderr key is
good enough since their relative ordering doesn't matter.)
5. Images might probably be referenced explicitly, i.e. out: { img:
<file-id-reference> }
6. same as 5. for data files attached to worksheets. That's probably
what you mean with db.files anyways ... but it might be nice to know
when attached files are no longer needed to be able to run a
background process that removes unreferenced files.
h
[1] http://linux.die.net/man/5/acl
first, look at the "long text form", then read the algorithm for checking it.
I'm really glad you're thinking about architecture already, and even
put the effort in to write something this detailed up.
That said, I am personally not going to do any work on something where
the goal is just to "get something up and running during bug days".
Whatever I do, the goal will be to do something that it is possible to
_finish_ by Jan 14, and be genuinely usable. Implementing something
with functionality the same as the notebook but scalable -- basically
from scratch -- is hard.
But the actual work you're doing (e.g., database schemas, diagrams to
explain how things should work) is all very applicable to doing
additional work directly to the notebook that will make it much more
scalable.
-- William
> (The document is also available
> at https://docs.google.com/document/d/1uYJXPAWypGgb92QStJ19cW-29y4-hn5hi8oXMR-11TU/edit?hl=en&authkey=CISp9cQB
> )
--
William Stein
Professor of Mathematics
University of Washington
http://wstein.org
Alex -- can you also post your document to the wiki (or a link to it)?
http://wiki.sagemath.org/Notebook%20scalability
>
> Hi, I read most of it and I also read your posting on mongodb-users ;)
>
> Basically, most of it is also what I would have in my mind ... just some notes:
>
> 1. I wouldn't do a "isAdmin" property for users. Rather, create one or
> more groups that are marked as isAdmin and then add the users to that
> group. This is basically how it is done nowdays in linux via the
> /etc/sudoers file where a group "admin" is marked as being special and
> the sudo command checks if the user is in the group admin.
>
> 2. The permissions, I don't really understand it. Why are they in each
> group? I understand that each worksheet has a list of users (owners)
> with permissions and groups with permissions, but at the group level I
> don't get it. What does db.groups.users.{ perms: 1 or 0 } actually do?
> I think, just a list of usernames is enough.
>
> Do you know ACLs [1]? Maybe compare your mix of usernames and
> permissions with the approach over there and do it this way!
> Therefore, I would attach such a ACL list to each worksheet.
Harald, any chance you could create some example MongoDB documents
that illustrate use of ACL's? This would be very helpful.
> 3. Worksheets reference to a collection of cells? Each document has a
> 4mb limit ... I know, that's a lot and it will probably never be hit,
> but if there is some crazy long output it might happen.
Two comments:
* the 4MB limit is officially going to be raised to 16MB soon.
* all this database stuff is aimed (in my mind) mainly to be used
by large notebook server deployments like sagenb.org, which will have
say 100,000 users. Having a <=16MB limit per worksheet is a really
good idea no matter what, even if mongodb didn't enforce it. So I
have no problem with having such a limit in our database (per
worksheet). We really really don't want people trivially making
1terabyte worksheets (right now with sagenb.org, it would be possible
for somebody to do that!).
> Second,
> updates on worksheets only happen on the cell level, never on the
> whole document. I know, mongodb has the ability to update a part of a
> document via the update command, but I think it's easier to have a
> collection of all cells and reference to them.
I'm not sure. If you read mongodb documentation/books, the way Alex
laid things (with all cells in a single document) out is repeatedly
recommended by them as the recommended way to go. The updating on
parts of documents with mongodb is very robust, in my experience.
Also, the data locality (having all the cells in the same document) is
evidently a big win efficiency wise.
> But still, when a cell is updated, only it's "out" field is modified.
It's "in" field can also be modified, right, e.g., when you modify the
input? And somebody maybe even the type (why not?).
> Therefore, I propose to define a worksheet as a list of cells [or
> later and more advanced, also as a nested list of worksheets ... i.e.
> to make it possible to reference to another worksheet, to do "dynamic"
> embeddings, sections/parts, meta-worksheet-documents ...].
A worksheet can't be defined to be a list of cells, since there is
lots of other meta data, e.g., the title, owner, etc.
One can certainly have references to other worksheets, etc., with
Alex's proposed schema, right? (via the _id field).
> Additionally, I could also envision cases, where each of those cells
> get additional permissions, e.g. a "lock" that adds a "all: ---"
> permission, so that nobody is able to edit a cell. (editing
> permissions is of course still possible by the users and users in the
> associated groups if they are allowed)
That would already fit fine with Alex's proposal. It would be good to
add it as an example to his document though. It's just another
key:value in one of the cells.
Alex, I don't think you should use an _id field in the individual
cells though. They aren't complete mongodb documents themselves, so
don't have to have an "_id" field, and if they do it isn't treated
specially like the _id of a complete monogodb document (which is
forced to be unique, etc.). Thus using _id could be misleading.
>
>
> 4. something trivial, instead of
> out: [{ t:"stdout", data: "..."} , {t:"stderr", data: "..."}]
> please just do
> out: { stdout: "...", stderr: "..." }
> Mongodb allows to list all keys in such an associative list and no
> need for this {t: "..."} thing.
> (or even better, get rid of "out" and just a stdout and stderr key is
> good enough since their relative ordering doesn't matter.)
+1 -- very good idea.
> 5. Images might probably be referenced explicitly, i.e. out: { img:
> <file-id-reference> }
There should be no actual disk-based files. The images should
themselves be stored in mongodb. It can store binary data (like
images) just fine -- mongodb's main target domain is web applications,
where multimedia data is very common.
> 6. same as 5. for data files attached to worksheets. That's probably
> what you mean with db.files anyways ... but it might be nice to know
> when attached files are no longer needed to be able to run a
> background process that removes unreferenced files.
db.files uses "GridFS" which is something mongodb provides for storing
large binary data (which can be much bigger than 4MB).
I'm fine with attached files also having a 4MB (or later 16MB) limit,
again at least for big servers like sagenb.org. Thus using GridFS
for this application (the Sage notebook) isn't (in my mind) necessary.
> h
>
>
> [1] http://linux.die.net/man/5/acl
> first, look at the "long text form", then read the algorithm for checking it.
>
--
Alex -- can you also post your document to the wiki (or a link to it)?http://wiki.sagemath.org/Notebook%20scalability
> 1. I wouldn't do a "isAdmin" property for users. Rather, create one or
> more groups that are marked as isAdmin and then add the users to that
> group. This is basically how it is done nowdays in linux via the
> /etc/sudoers file where a group "admin" is marked as being special and
> the sudo command checks if the user is in the group admin.
> 2. The permissions, I don't really understand it. Why are they in each
> group?
> but if there is some crazy long output it might happen.
> Second,
> updates on worksheets only happen on the cell level, never on the
> whole document. I know, mongodb has the ability to update a part of a
> document via the update command, but I think it's easier to have a
> collection of all cells and reference to them.
I'm not sure. If you read mongodb documentation/books, the way Alex
laid things (with all cells in a single document) out is repeatedly
recommended by them as the recommended way to go. The updating on
parts of documents with mongodb is very robust, in my experience.
Also, the data locality (having all the cells in the same document) is
evidently a big win efficiency wise.
> But still, when a cell is updated, only it's "out" field is modified.
It's "in" field can also be modified, right, e.g., when you modify the
input? And somebody maybe even the type (why not?).
Alex, I don't think you should use an _id field in the individual
cells though. They aren't complete mongodb documents themselves, so
don't have to have an "_id" field, and if they do it isn't treated
specially like the _id of a complete monogodb document (which is
forced to be unique, etc.). Thus using _id could be misleading.
> 4. something trivial, instead of
> out: [{ t:"stdout", data: "..."} , {t:"stderr", data: "..."}]
> please just do
> out: { stdout: "...", stderr: "..." }
> Mongodb allows to list all keys in such an associative list and no
> need for this {t: "..."} thing.
> (or even better, get rid of "out" and just a stdout and stderr key is
> good enough since their relative ordering doesn't matter.)
+1 -- very good idea.
> 5. Images might probably be referenced explicitly, i.e. out: { img:
> <file-id-reference> }
A compromise between your two suggestions is:
out: [{stdout:"out1"}, {stderr:"err1"}, {stdout:"out2"},
{stderr:"err2"}, {image:"foo.png"}]
>
>
>> > 5. Images might probably be referenced explicitly, i.e. out: { img:
>> > <file-id-reference> }
>
> I was thinking that there would be a Plot(...) message, a JMol(,,,) message,
> etc, which would reference files.
out: [{stdout:"out1"}, {stderr:"err1"}, ..., {image:"foo.png"},
{jmol:"foo.jmol"}, ...]
?
In some cases it might make sense to be able to specify coordinates or
other rich data:
{image:"foo.png", position:[3,7]}
This argues for making the output document have a type like you
suggested above, e.g.,
{t:'image', data:'foo.png', position:[3,7]}
> Currently in the notebook, any computation output is just a stream of bytes.
> But that stream contains different kinds of data - stdout, stderr, latex,
> plots, html tables, jmol plots, references to data files that the cell
> created, etc. So why not have the computation output be that series of
> "messages"?
> - Alex
That does make sense.
William
other rich data:In some cases it might make sense to be able to specify coordinates or
{image:"foo.png", position:[3,7]}
This argues for making the output document have a type like you
suggested above, e.g.,
{t:'image', data:'foo.png', position:[3,7]}
Thanks for the clarification. But that's already exactly how I
interpreted what you are proposing.
Ok, here is my idea how access control in a general way could be
implemented for worksheets. I understood that you wanted to optimize
it, but I think it's not possible in the way you did it (or I didn't
understand it)
First, let me define a permissions-datatype I'm comfortable with, you
can change it to whatever you like. I think it's just better to store
it in one entry as a string for easy understanding. It's a string of
three characters: "rwd" where r=read, w=write, d=delete. If a
permission is set, the letter is in the string - if the permission is
not set, it is a dash ("-"). i.e. "r--" means "read-only". There are
no partial permissions, that means, it is always a three-letter string
of r, w, t or "-". And yes, that's a bitmask.
Basic elements:
1. There are users, they have a UID.
2. There are groups, identified by a GID, each of them are a list of
UIDs. All UIDs in such group are the members of the group and a user
can be a member of any group.
3. The worksheets, that's where the permissions are set and they
correspond to files in the ACL scheme of unix filesystems.
A worksheet has three entries for permissions:
1. A list of users:
example: [ UID1 : "rwt"] .. this means user with UID1 has rw and t rights.
example: [ UID1: "rwt", UID2: "---" ] ... this means the same as
above, but additionally (no matter what there is also defined in the
worksheet) UID2 has no access for anything! (This can be used to
explicitly block access for UID2 although he/she might be a memeber of
a group GID1 that has access)
2. A list of groups (basically the same as above...)
example: [GID1: "rwt"] ... if current_user in GID1 -> wrt permissions.
3. An "other" permission, that is checked last if non of the above
match. This is "---" by default and if it is set to "r--" the
worksheet is "published" and "rwt" makes it "wiki-like".
The algorithm starts with the users, first to last one, then groups,
first to last one then "others". If there is a match, it applies it's
permissions and returns. That's the way how users can be excluded
although they are a member of a given group and so on.
Classroom example: teacher gives read-access to it's worksheet to
his/her students:
users = [ "teacherA": "rwt" ]
groups = [ "classA" : "r--" ]
other = "---"
(note, that teacherA is a memeber of classA but he/she is the only one
with full access)
"Inverted Classroom": student gives access to his/her worksheet to
others, but not to the teacher (who is also member of group "classA"):
users = [ "student1" : "rwt", "teacherA" : "---" ]
groups = ["classA": "r--" ]
other = "---"
I hope that everything is more or less clear?
My proposal for the default permission:
users = [ "current user" : "rwt" ]
groups = []
other = "---"
The remaining question is, who is allowed to add users and groups? I
think they are exactly those who have write access, but that could be
a separate flag, too. For the beginning, I would start simple and add
more permissions later.
What happens on cloning a worksheet? Cloning means to copy the
worksheet and users = [ current_user : "rwt"]. Additionally, there
should be a dialog asking the user if the permissions should also be
copied (I think, default should be to not copy the permissions, but an
option box to enable copying them)
Last word about admins, I propose that there should be a "special"
group called "admin" and all users who are in that group override
those permissions. Maybe, there should also be a central
configuration, explicitly defining admin rights (like the /etc/sudoers
in linux) - but that's something for later ...
H