Like Justin said, I have put some thought toward bridging the impasse between shared and personal trees. What follows are my thoughts on how to accomplish this goal. This represents a work in progress, and I don't claim to have all of the answers. I will try to explain my thinking as I go along, and due to the length of my thoughts I will break it up into several parts. Also note that while I will attempt to keep my explanations as simple as possible, I may make reference to computer science theory and/or leave things unstated because they are well understood in my field. Feel free to ask questions, make comments, or vehemently disagree.
The parts I have planned out so far are as follows, with more(?) to come:
Part N=1
- Requirements (What are they?)
Parts N>1:
- Anti-Requirements (What are they?)
- Why use a graph (A case for representing the data as a graph)
- Graph Requirements (Now that we are using a graph, what requirements do we have?)
- Public and private repositories (How do we represent repositories, both public and private, in a way that scales)
- Modeling the tree on the graph (How can we best represent a tree, sources, events, places, etc)
- Abstracting for End Users (How we leverage the power of the system to simplify things for users)
- More parts as necessary...
Part 1 - Requirements:
Before I begin I want to define a few things:
- Element - A discreet set of properties that we care about in the abstract. It could be a person, it could be a person's birth date, it could be a marriage or family. The point is that it doesn't matter what they are when working in the abstract. (As long as our proofs hold in the abstract, real numbers will also work. Go algebra!)
- Atomic Change - A set of changes(create, update, delete) on a set of Elements (in the abstract of course :) ).
- User - An individual who uses the resulting product(s). For the purposes of these discussions, developers are NOT users.
CRUD
We have to Create, Read, Update, and Delete Elements.
Note: You may be thinking "But wait. we also need merge, split, etc." The User does, but we don't need them at a low level. They are merely an abstraction. For example, a merge is just a set of updates on one element and a delete of another element grouped into one Atomic Change. A split is just a fancy way of saying "read all of the properties of element X and create element Y with the same properties". Repeat ad nauseam.
Why do we need this: No CRUD, no data, no point.
Full Version History
Any representation of a tree must include a FULL version history. A list of required properties follows:
- Infinite undo - Every Atomic Change must be able to be undone (redo is a nice to have, but not a requirement). What the user sees as an Atomic Change is up for discussion. It may be one date change, or a set of changes, or an import of 1000 people with their sources, "conclusions", etc. All that matters is that we can define an atomic change and undo until the beginning of time.
- Who performed the change - Every Atomic Change must be able to be associated with a user
- When - When was this Atomic Change performed. Standard UTC timestamp in milliseconds should suffice.
- Explanation - A note detailing what/why/etc should be able to be attached to every Atomic Change. UTF-8 string should be sufficient. This may come from the user (probably not), or may be set by the User facing system to enable programatic manipulation of changes beyond undo/redo. For example, you could store JSON in here.
- UUID - Every Atomic Change should be able to be universally identified. UUID v4 or v5 should suffice.
Why do we need this: 2 reasons. The first is that we need to see who did what. The second is to allow programatic undo. I will not go over why version control is necessary here, and leave that as an exercise to you, the reader.
Public vs Private
Any Element (or any combination of Elements) of the tree must be able to be made "private". I will define private as the ability to allow N users to access an Element while simultaneously denying all others access. If done correctly and simply, this also serves as a foundation for collaboration, in which N Users. Also, wether or not "access" implies read vs read write is not terribly important, as that problem has many known solutions.
Why do we need this: One good reason is privacy. You can probably think of other good reasons.
Disagreements
There must exist a mechanism that allows 2 Elements to simultaneously co-exist, be considered the "same", and hold different values. This does NOT mean that a user needs to see and operate on both simultaneously, but that user A can have Element.x=1 and user B have Element.x=2.
Why do we need this: A case has been made by several people that most (if not all) differences can and should be worked out. In this case, that is irrelevant. During the time in which the two parties are trying to agree on the proper disposition of the data, both parties' data needs to be represented and available to view. This is best done natively within the system. It is also good to note that there are a subset of users who will never agree for one reason or another, and that disagreement must be natively supported. Allowing disagreements has the advantage of reducing edit thrashing.
Scaling
Any model must allow for
scaling OUT (not up). This means that it must be trivial to "shard" or segregate the data in a way that allows a near-linear scaling by adding additional resources and redistributing the data.
Why do we need this: Google for permutations of "why scale out instead of up" or "scale up vs scale out".
Power vs Simplicity
The power provided by the system must NOT be exposed directly to the average consumer, but must be made available to the developer. Much like the relationship between git and
github.com, the underlying tool and data model must be powerful and flexible enough to support the simplified interface that is presented to the end user. Additional lessons may be drawn from the
TCP/IP stack, in which each layer only knows about the layer directly above or below it, with an increasing level of abstraction as one travels up the stack.
Note: This does NOT mean that we should limit ourselves to a simplified data model, only that we must be able to simplify the data model conceptually in order to communicate with end users.
Why do we need this: The system we are designing requires a fairly high level of complexity due to the requirements. While this complexity is necessary, any product built on it will fail if that complexity is exposed to a User. In short, we need the power to design the system, but we had better not expose it to Users so they can actually use the product.
Questions and insightful comments are welcomed and greatly appreciated.