The Tree Problem: Part 1 - Requirements

64 views
Skip to first unread message

John Clark

unread,
Feb 11, 2014, 8:50:37 PM2/11/14
to root...@googlegroups.com
Like Justin said, I have put some thought toward bridging the impasse between shared and personal trees. What follows are my thoughts on how to accomplish this goal. This represents a work in progress, and I don't claim to have all of the answers. I will try to explain my thinking as I go along, and due to the length of my thoughts I will break it up into several parts. Also note that while I will attempt to keep my explanations as simple as possible, I may make reference to computer science theory and/or leave things unstated because they are well understood in my field. Feel free to ask questions, make comments, or vehemently disagree.

The parts I have planned out so far are as follows, with more(?) to come:

Part N=1
- Requirements (What are they?)
Parts N>1:
- Anti-Requirements (What are they?)
- Why use a graph (A case for representing the data as a graph)
- Graph Requirements (Now that we are using a graph, what requirements do we have?)
- Public and private repositories (How do we represent repositories, both public and private, in a way that scales)
- Modeling the tree on the graph (How can we best represent a tree, sources, events, places, etc)
- Abstracting for End Users (How we leverage the power of the system to simplify things for users)
- More parts as necessary...

Part 1 - Requirements:

Before I begin I want to define a few things:
  • Element - A discreet set of properties that we care about in the abstract. It could be a person, it could be a person's birth date, it could be a marriage or family. The point is that it doesn't matter what they are when working in the abstract. (As long as our proofs hold in the abstract, real numbers will also work. Go algebra!)
  • Atomic Change - A set of changes(create, update, delete) on a set of Elements (in the abstract of course :) ).
  • User - An individual who uses the resulting product(s). For the purposes of these discussions, developers are NOT users.

CRUD
We have to Create, Read, Update, and Delete Elements.

Note: You may be thinking "But wait. we also need merge, split, etc." The User does, but we don't need them at a low level. They are merely an abstraction. For example, a merge is just a set of updates on one element and a delete of another element grouped into one Atomic Change. A split is just a fancy way of saying "read all of the properties of element X and create element Y with the same properties". Repeat ad nauseam.

Why do we need this: No CRUD, no data, no point.

Full Version History
Any representation of a tree must include a FULL version history. A list of required properties follows:
  • Infinite undo - Every Atomic Change must be able to be undone (redo is a nice to have, but not a requirement). What the user sees as an Atomic Change is up for discussion. It may be one date change, or a set of changes, or an import of 1000 people with their sources, "conclusions", etc. All that matters is that we can define an atomic change and undo until the beginning of time.
  • Who performed the change - Every Atomic Change must be able to be associated with a user
  • When - When was this Atomic Change performed. Standard UTC timestamp in milliseconds should suffice.
  • Explanation - A note detailing what/why/etc should be able to be attached to every Atomic Change. UTF-8 string should be sufficient. This may come from the user (probably not), or may be set by the User facing system to enable programatic manipulation of changes beyond undo/redo. For example, you could store JSON in here.
  • UUID - Every Atomic Change should be able to be universally identified. UUID v4 or v5 should suffice.
Why do we need this: 2 reasons. The first is that we need to see who did what. The second is to allow programatic undo. I will not go over why version control is necessary here, and leave that as an exercise to you, the reader. 

Public vs Private
Any Element (or any combination of Elements) of the tree must be able to be made "private". I will define private as the ability to allow N users to access an Element while simultaneously denying all others access. If done correctly and simply, this also serves as a foundation for collaboration, in which N Users. Also, wether or not "access" implies read vs read write is not terribly important, as that problem has many known solutions.

Why do we need this: One good reason is privacy. You can probably think of other good reasons. 

Disagreements
There must exist a mechanism that allows 2 Elements to simultaneously co-exist, be considered the "same", and hold different values. This does NOT mean that a user needs to see and operate on both simultaneously, but that user A can have Element.x=1 and user B have Element.x=2.

Why do we need this: A case has been made by several people that most (if not all) differences can and should be worked out. In this case, that is irrelevant. During the time in which the two parties are trying to agree on the proper disposition of the data, both parties' data needs to be represented and available to view. This is best done natively within the system. It is also good to note that there are a subset of users who will never agree for one reason or another, and that disagreement must be natively supported. Allowing disagreements has the advantage of reducing edit thrashing.

Scaling
Any model must allow for scaling OUT (not up). This means that it must be trivial to "shard" or segregate the data in a way that allows a near-linear scaling by adding additional resources and redistributing the data.

Why do we need this: Google for permutations of "why scale out instead of up" or "scale up vs scale out".

Power vs Simplicity
The power provided by the system must NOT be exposed directly to the average consumer, but must be made available to the developer. Much like the relationship between git and github.com, the underlying tool and data model must be powerful and flexible enough to support the simplified interface that is presented to the end user. Additional lessons may be drawn from the TCP/IP stack, in which each layer only knows about the layer directly above or below it, with an increasing level of abstraction as one travels up the stack.

Note: This does NOT mean that we should limit ourselves to a simplified data model, only that we must be able to simplify the data model conceptually in order to communicate with end users.

Why do we need this: The system we are designing requires a fairly high level of complexity due to the requirements. While this complexity is necessary, any product built on it will fail if that complexity is exposed to a User. In short, we need the power to design the system, but we had better not expose it to Users so they can actually use the product.

Questions and insightful comments are welcomed and greatly appreciated.

- John Clark

Justin York

unread,
Feb 11, 2014, 10:20:55 PM2/11/14
to root...@googlegroups.com
I agree that it's very useful to have a version history. However, it's difficult for me to articulate why, so I want to explore that.

One advantage of versioning is the ability to rollback (undo the last N actions). Some versioning systems (like git; does anybody else?) take it a step further and let you undo any given action in the past. However, it's not necessarily easy nor pretty. In fact the difficulty of doing that is so high that it's hardly ever done. Ive done it once, ever. Have any of you ever tried? I'm more likely to browse the commit history, copy the old code, and then paste it in and recommit.

FamilySearch has a model we can learn from. They version each fact/event about a person separately so undo is more like "restore this previous value" which is fairly trivial.

Being able to undo all of a given user's actions, or undo an entire gedcom upload, is very compelling. But I'm not convinced we need it if the model becomes too complex. Does versioning enable disagreements? Or is that for a later conversation?


--
 
---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Wayne Pearson

unread,
Feb 11, 2014, 10:30:57 PM2/11/14
to root...@googlegroups.com
I've played with git undos in an artificial manner -- purposefully made a variety of interesting, overlapping changes, and undone one in the middle, just to see how well it went. It went fine, as much as I tried to trick it, but that's because git boils the CRUDding down to patch(1)/diff(1)-level processes, which has a formulaic method to its madness when you chain a bunch together.

Admittedly, however, browse back and copy old code is a hard habit to break.

Versioning doesn't enable disagreements, but branching does. Then we're back to deciding how much of the git model is appropriate... does everyone keep rebasing with the latest (agreed-upon) data until their disagreements are resolved? What if I have three different disagreements - branches of branches, merging them back, and rebasing... luckily, it's only the software developers, and not the users, that ever have to understand how that's done under the hood.

Dallan Quass

unread,
Feb 12, 2014, 10:28:49 AM2/12/14
to root...@googlegroups.com
It all boils down to 3-way diff and patch/merge as Wayne points out.  You mainly need to make sure that the objects you're comparing share a common ancestor version. As for the diff'ing, you could stringify json and use google-diff-match-patch to do this, or you could try one that's json-aware: https://github.com/benjamine/JsonDiffPatch

Ryan Heaton

unread,
Feb 12, 2014, 2:24:12 PM2/12/14
to root...@googlegroups.com
John, I was getting along fine with what you were saying until I got to your "Power vs. Simplicity" section. All the talk about system requirements seemed pretty straightforward (not much there that hasn't been considered before), but when I got to this part:

The power provided by the system must NOT be exposed directly to the average consumer, but must be made available to the developer

...you lost me.

If the power of the system is going to be hidden to the end-user, why even do it at all? That's like saying: "the underlying VCS system uses Git, but you can only use SVN commands to access it." Umm... okay... then why don't I just use SVN?

What I'm looking for is a system that does expose the power provided by the system but in a way that end-users can grok it. And if someone can figure out how to do that, can we have them talk to the git guys, too? 'Cause I answer a lot of questions about git. And the questions come from fellow developers; trying to explain git to my mom would be an exercise in futility.







John Clark

unread,
Feb 12, 2014, 3:16:03 PM2/12/14
to root...@googlegroups.com
Ryan,

A few examples to illustrate what's going through my mind.

1. Commits/Atomic Changes - The underlying system has a very abstract definition of an Atomic Change. You can essentially call any set of CRUD operations an Atomic Change. As a developer that is great. If I am designing an import mechanism, I may group all of those changes in to one Atomic Change so it can be undone simply. Another place I may have the ability to create the existence of a birth. In this case I may want to define an Atomic Change as: a) the creation of the birth element, b) the association of birth element with a person, and c) the attachment of a source to support that birth element. The developer needs the power to define Atomic Changes as s/he sees fit, but the User should never have to know of or make that decision. It is just to abstract and needs to be constrained by the UI.

2. Implementation Details - This will make more sense when I can fully communicate my vision of a graph-based data model, but consider the example of a cross-table in a relational database. You never want your user's exposed to the fact that you need to create a record in that cross table to associate two pieces of data. All they should know is that A -> B, and should never know that A -> X -> B. In a graph model, we may not store the relationship between a parent and child directly. We may store a relationship between the child and "Birth Event A" that s/he participated in. We may then store a relationship between the mother and "Birth Event A" that s/he also participated in. The details of this model should not be presented to the User directly, especially when presenting a person and their parents in the UI. We should just show Person X with parents Y and Z, etc. The main point is that we may need to model the data in a certain way to enable a powerful flexible system, but the UI should not (and I dare say cannot) mirror the model exactly. 

Ryan, you are absolutely right that the User should have the ability to leverage the power of the system. I should have phrased it closer to "As much as possible, simplify the User's experience so that they are not required to understand the power of the system in order to use it" (Or something along those lines). Additionally, the developer should have the power and tools necessary to create a superior User experience, along with the flexibility to adapt the system and model to what s/he needs.

Another point I should have made is the distinction between the the system that the User interacts with and the system that the developer interacts with. In my head there are several layers that look somewhat like this.

UX/UI - What the user sees and does

UX/UI API - A simplified API to increase the speed of UI development. (has functions like "get parents", "add child", "find x", etc.)

DATA API - A comprehensive API that is tightly coupled to the CVS/Data Model (Think CRUD, commit, rollback, push, pull, etc.)

VCS/Data model - Concerned with and enables Versioning, Public/Private/Scaling 

Database - A backing for the VCS/Data model

Each layer in this system should only know about and interact with the one(s) directly above and below it. I have been mainly referring to the DATA API and downward in my designs. My thinking is that if you get those pieces right, you can use them to develop many UX/UI specific APIs and allow a common back end.

- John C.

John Clark

unread,
Feb 12, 2014, 3:23:34 PM2/12/14
to root...@googlegroups.com
Justin,

As for your question of "Does versioning enable disagreements" I would have to say, not really. Versioning becomes more important when thinking about collaboration, as it gives discreet sets of changes that can be operated on. The size and complexity of each Atomic Change/commit depends on how granular you want to be. You could version at a very fine level, almost like the Journaling that is present right now in FamilySearch's Family Tree. When you are "syncing" or "collaborating" versioning gives you few things. One is a batching of changes ordered by time and annotated in such a way that in the ideal world someone could "merge" all of the changes in a commit at once. Another advantage is that enabling versioning can give you a common point of reference in a branching model, which Wayne and Dallan touched on earlier.


- John C.


On Tue, Feb 11, 2014 at 8:20 PM, Justin York <justi...@gmail.com> wrote:

Robert Hoare

unread,
Feb 13, 2014, 3:14:44 AM2/13/14
to root...@googlegroups.com
On Tuesday, February 11, 2014 6:50:37 PM UTC-7, John Clark wrote:

Disagreements
There must exist a mechanism that allows 2 Elements to simultaneously co-exist, be considered the "same", and hold different values. This does NOT mean that a user needs to see and operate on both simultaneously, but that user A can have Element.x=1 and user B have Element.x=2.

I don't quite see how "disagreements" becomes an important requirement, at this abstract level, it's too narrow.  What is needed, as a basic requirement, is a way to show relationships (or connections) between elements.  That will cover disagreements, as well as normal usage.

For example, showing that two (or more) elements are the same is one type of connection.  Showing that they are proved NOT to be the same is another.  And maybe different users each use different connections between certain elements (that's the disagreement).  Also, one element may be contained in another.  Another element could be the successor to another (such as a name change).

In other words:

Connections
There must exist a mechanism to connect elements, and describe the type of connection.  These connections could be for a user, or globally.  The connection is itself an element (so has full versioning).

It would be great if that could avoid RDF, and I've been looking for a way to, but it's really easy to get sucked into that over-complex morass and end up with a verbose mess ...

Rob

John Clark

unread,
Feb 13, 2014, 10:54:00 AM2/13/14
to root...@googlegroups.com
Rob,

To make sure I understood what you are saying, let me rephrase and you can tell me if I'm correct.

First, there exist multiple levels of disagreements, all of which need to be accounted for. The levels as I understand them are as follows:
  • Inter-Element disagreements. These are represented by the connections you described. For example, I have 2 elements, person X and person Y. User 1 may connect them together stating that they are the same, while User 2 may connect them together stating that they are NOT the same.
  • Intra-Element disagreements. Another example: Element X represents person X, and a person is allowed only 1 birth date under an example model. User 1 has '1900-01-01' as the birth date for person X, and User 2 has '1900-03-13' as the birth date for person X
If you disallow Intra-Element disagreements in your model, then you are correct, Disagreements as I have defined them do not need to exist. But as a result of that, the model is susceptible to n "versions" or "copies" of an Element (where n is the number of Elements where at least 1 property differs), along with n(n-1)/2 "Same" connections made between them all. 
 

Also, you are correct that Elements need to be able to be connected to each other. In my head I made the assumption that Elements could and should be related together. I probably should have added that so that the Element definition looks more like this:
  • Element - A discreet set of properties that we care about in the abstract. It could be a person, it could be a person's birth date, it could be a marriage or family. Elements may also contain references to other Elements. The point is that it doesn't matter what they are when working in the abstract. (As long as our proofs hold in the abstract, real numbers will also work. Go algebra!)


- John C.


--

Robert Hoare

unread,
Feb 13, 2014, 2:35:17 PM2/13/14
to root...@googlegroups.com
On Thursday, February 13, 2014 8:54:00 AM UTC-7, John Clark wrote:

First, there exist multiple levels of disagreements, all of which need to be accounted for. The levels as I understand them are as follows:
  • Inter-Element disagreements. These are represented by the connections you described. For example, I have 2 elements, person X and person Y. User 1 may connect them together stating that they are the same, while User 2 may connect them together stating that they are NOT the same.
  • Intra-Element disagreements. Another example: Element X represents person X, and a person is allowed only 1 birth date under an example model. User 1 has '1900-01-01' as the birth date for person X, and User 2 has '1900-03-13' as the birth date for person X
This is where we're talking about slightly different things.  I understood an element is a data value (a field) on a source record, or a data value derived from one or more of those (a combination, correction or conclusion).  So, in the second case, it's still an inter-element disagreement - one person connects from an assembly of elements (a person) to one particular birth date element, others may connect to different birth date elements.  Or, more likely, there are multiple birth dates connected, and different people (or different programs) connect different ones as their preferred primary birth date.  

In other words, an element is not a fact like "birth date", it's an occurance of birth date (maybe estimated from age) in a source (or conclusion), so it would be very likely that most people have multiple birth date elements, and the area for disagreement is which one(s) to choose.   Birth and death dates might be something where you expect only one (your intra definition) but they're very much the exception and if you expand it to "evidence of birth" then they're something that has multiple opinions and often no one correct answer.  So I wouldn't want any model to restrict any fact to just one "correct" answer, but a "best" answer alongside others is fine.

If you disallow Intra-Element disagreements in your model, then you are correct, Disagreements as I have defined them do not need to exist. But as a result of that, the model is susceptible to n "versions" or "copies" of an Element (where n is the number of Elements where at least 1 property differs), along with n(n-1)/2 "Same" connections made between them all.  

I misread your description of elements.  To me, an element would be a single fact reported in a single place, and (for example) a census record would be an element that is a collection of such elements.  If you start with a base element that is a "set" of properties, then  I can see how it would be hard to make sure it always had the same set.  There is some of that when collecting elements into a group (a record) as well of course.

Problems with starting with an element that is a set of properties: 

    1. where do you draw the line?  Is the element a line of a census record (which then loses the household relationship), or a household, or a page?  How can different users be consistent on this, over a wide range of record types?
    2. you'll need some way to refer to a property within the element, which is an exception to the connecting of elements

If you start with an element being a totally atomic fact (such as an age as reported in one place), this can be grouped in a connection with other elements on that line, and then up to household, page, and so on (with places being connected to standard versions - while keeping the as-transcribed original).  Most of these connections could be auto-created by transcription software.  So each element is a fact, connected to others by elements that are describe connection types.  Sources can be recreated (including into sets of properties) by following the connections.

Rob

Luther Tychonievich

unread,
Feb 14, 2014, 1:40:51 PM2/14/14
to root...@googlegroups.com
Re Version History:
A slight adjustment of perspective can move us from "version history" to "persistent data structure." A PDS can be thought of as "write-once" or "immutable" or "purely-functional". Each change creates a new pointer that indicates the newly adjusted data, typically designed to share most of the unchanged nodes as well as a pointer to the previous version. They cannot avoid storing history (it is intrinsic to their design) and offer a variety of advantages in scaling out as well.

Another very important reason to maintain history is it is archival (and we are engaged in history, right? Archival matters). If I cite a portion of the data in a conversation, explanation, etc, I want that portion of the data to be the same no matter when someone reads my text: that is, I want to cite the version that exists when I write, not whatever version happens to exist when someone reads it.


I personally believe we ought to add one more category:

Handle Disagreement and Uncertainty
Research is, almost by definition, not given to clear and unambiguous conclusions. Any conscientious researcher has extended periods of time where, somewhere in their "tree," they are of two minds, uncertain which of a set of possibilities is true. Once any reasonably large number of researchers come together this state of disagreement is continuous. If the data model does not allow researchers to store disagreeing and uncertain opinions then other avenues will be used to store them (i.e., some other data model, likely on paper) and express them (i.e., edit wars, arguments, and general bad feeling).



--

John Clark

unread,
Feb 14, 2014, 2:46:02 PM2/14/14
to root...@googlegroups.com
Luther,

Interesting perspective on the Version History tweak. It sounds familiar :). Would you mind helping me understand the reasons why one would use a PDS instead of a full version history system? They seem to solve the same set of problems based on my limited understanding of PDSs. Maybe an example or two would help me understand.

As to the "Handle Disagreement and Uncertainty", I don't understand how that is different than my "Disagreements" section? Maybe I'm missing something, but it sounds like we are saying the same thing.

- John C.

Luther Tychonievich

unread,
Feb 14, 2014, 3:32:55 PM2/14/14
to root...@googlegroups.com
Ah, I somehow missed noticing your disagreements section altogether. Embarrassing.

From an end-user perspective PDS and version history are basically the same. But from a developer perspective they have a few differences:
1. All versions of the data exist at all times in a first-class way. Roll-back is as simple as updating a single pointer.
2. Versioning in a PDS is inherent in the API and enforced the the type system; there is no additional modules to worry about and possibly forget and, for most PDSs, no time overhead.
3. Because it is immutable, all of the optimizations and parallelism of purely functional languages come to bear.
4. Because it is write-once, it can be copied, distributed, cached, etc, without restriction.

Admittedly all of these things can be realized by mutable change-logging structures too, but with some work. As with so many language features, it's more a difference of feel and ease.

Reply all
Reply to author
Forward
0 new messages