The Tree Problem: Part 2 - Why Use a Graph

John Clark

unread,

Feb 13, 2014, 10:23:23 AM2/13/14

to root...@googlegroups.com

Representations

When thinking about representations for Trees and genealogical data the first thing that came to my mind was a binary tree. Its seems simple - you have an individual, which has 2 parents (mother and father, male and female), and each of them has two parents, etc. But that model breaks down rather quickly when you factor in things like adoption (which may happen multiple times), someone having more than two parents, or when there are two parents of the same gender (which means that there are technically 3 parents total, the couple, and the individual of the opposite gender required to conceive and/or give birth). This essentially means that binary trees are out.

The other problem with using any type of tree is where do you store the associated places, sources, and collaboration based information? As properties in the tree? As references to an entirely different data structure (relational database, NoSQL, etc)? If we are seeking a unified approach, it better be unified. It would be nice to have all of the information represented in the same type of data structure.

The current model (as far as I can tell) is some type of normalized relational database. MyHeritage has a nice picture of their Family Graph here. Normalization is really nice to work with until you introduce the concept of versioning. You can version each individual table fairly easily, but it is rather messy to batch multiple changes across many tables into an Atomic Change, and then be able to roll back and reply them at will (doable, but rather messy).

Another issue arises when you need to disagree. Since you can disagree on anything, each table (and cross table, shudder) would need to support this disagreement model unless you allowed each person to have his/her own copy of the database. And while this is possible, that on top of versioning in a relational database is a level of complexity I don't want.

What we need is a system where we are free to model any structure we want, but in a way that is easily versioned, and can support disagreements.

Enter the Graph

The type of graph I am going to talk about is a Property Graph. Tinkerpop has an excellent primer available here. Specifically, we are talking about a undirected or directed graph that allows properties to be stored both in nodes and edges.

Now lets look that the issues we were running into above in the context of a graph.

Modeling

Because we have properties, we can mimic a relational database if we really want to. But the true power in graphs lies in their edges and the types of queries they expose (more on that later). But for now all we need to know is that we can model what we want to model in a flexible way. Ie, there exists some model that allows us to do what we want.

Versioning

Now that we have only two types of concepts to worry about, we can devise a general solution for versioning propertied nodes and edges. With that in place as a transparent layer on top of the graph, we can operate using our regular CRUD operations and not have to worry about integrating versioning into our model. Think of writing code in git. We don't care about how the versioning happens when we are writing our app, we just know that git will transparently manage that and that we don't have to add markers or any such nonsense to our files to denote anything. I will go into more detail on versioning a graph later.

Disagreements

Because we now have a graph, all we have to concern ourselves with is the case where N properties on N nodes/edges are in conflict with one another. Again, we can devise a general solution that sits transparently atop the graph, and exposes the ability to see and resolve those differences. Diff is now trivial (does my Node.x equal your Node.x). It is important to note that we haven't specified the exact mechanism for allowing 2 simultaneously different Node.x's. I will talk about that later.

Summary

Lets review our original requirements:

CRUD - We have that and much more due to TimkerPop's excellent set of open source software (http://www.tinkerpop.com/)
Full Version History - If we can version propertied nodes and edges we've got this taken care of
Public vs Private - If we can come up with some way of forming "git repo-like" subgraphs, then we meet this requirement (more on this in another section)
Disagreements - If we can simultaneously allow 2 versions of Node.x, we've got this covered
Scaling - There has been some excellent progress in the last little bit on this. Neo4j scales well as long as you can shard (maybe using the same scheme as Public vs Private above wink wink), and Titan scales out really well.
Power vs Simplicity - A graph is a concept that is explainable to an end user, as most people understand the concept of a relationship network with people as nodes. another advantage is that graphs support a rich query language and come with a powerful set of algorithms.

We have now shown that a generic solution can be represented within a property graph. And now that we can generalize to a graph, we can consider that graph in future discussions and NOT visualize an actual family tree.

- John C.

Thomas Wetmore

unread,

Feb 13, 2014, 11:59:47 AM2/13/14

to root...@googlegroups.com

John,

I think you should be VERY seriously be thinking about non-relational databases.

You can FORCE genealogical data into formal relational tables (as nearly EVERYONE has done for the past 30 years), but genealogical data is so sloppy, with so many special exceptions, that force-fitting everything you might wish to record into restrictively-formatted tables is essentially impossible. And relational databases REQUIRE the restrictions. This is why we get so D**MED frustrated with EVERY existing genealogical program and online system, when we discover we cannot record data (either a name, a place, a date, a fact of any type) in the form that would be correct based on the evidence. We are so USED to this frustration, that most of us probably figure that this is the way it MUST be, and we stopped griping about it long ago.

It is time for genealogical databases to climb up out of the relational mire. I seriously suggest you consider MongoDB or NeoJ and probably a number of other "no-SQL" databases. These are the only two that I have investigated for my own genealogical software.

In MongoDB there are analogs to relational tables, rows, and columns, but they are MUCH MORE flexible. In MongoDB you never have to "normalize" your data into a combinatorial explosion of relational tables. In MongoDB you can think of every one of your records exactly as what they are, as records, e.g., persons, families, events, places. The "native" representation language of MongoDB is JSON, but it is TRIVIAL to convert these JSON records to GEDCOM or XML (or GEDCOM-X) or whatever else. Note that MongoDB supports a QUERY language that is just as powerful as SQL, so you loose NOTHING by moving to MongoDB, and you GAIN as much flexibility as you WILL EVER NEED over relational databases. In my opinion the use of relational databases has been the largest and ugliest ANCHOR holding back improvement to almost every aspect of advancement in genealogical software.

NeoJ is already a graph-based database, in which relationships between nodes are directly inherent in the database structure. When you add records to a NeoJ database you establish those relationships. Graph databases accelerate all database operations where the RELATIONSHIPS between the elements are key to searching and other database operations.

I get the feeling that these issues of TREE structures versus GRAPH structures are getting ready to go out of control. There are only two concepts involved here. There are elements and there are relationships between objects. These two concepts are sufficient to build trees and graphs of any type. It is essentially trivial to represents these two concepts in every major programming language, every database format, and every major archival text format (e.g., GEDCOM, JSON, XML, Protocol Buffers, ...).

Excuse all the SHOUTING.

Tom Wetmore

Ryan Heaton

unread,

Feb 13, 2014, 12:14:07 PM2/13/14

to root...@googlegroups.com

I think John's leading up to something that is at a layer above the relational vs. nosql vs. graph database debate. At least that seems to be the direction given the focus on atomic updates and versioning.

I like non-relational databases, especially for this domain space. I'm confident that we'll all eventually get there. But I think it's terribly naive to think that our frustrations with "EVERY existing genealogical program and online system" is rooted in relational databases. For example, I have no doubt that after FamilySearch finishes their migration to Cassandra, you'll still be frustrated with what you can't record.

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thomas Wetmore

unread,

Feb 13, 2014, 12:48:23 PM2/13/14

to root...@googlegroups.com

Ryan,

> I think John's leading up to something that is at a layer above the relational vs. nosql vs. graph database debate. At least that seems to be the direction given the focus on atomic updates and versioning.

I will be negative here. I think all the current stress on GIT, versioning, and atomic updates is wrong. The stress should be on the genealogical process, and the model and the records required to support it. Then one can talk about any version control issues that may actually exist. I think complex version control in genealogical databases as so far over the top in terms of what is really needed now, and possibly forever, that I think they are preventing work that would get us somewhere. If I am wrong so be it, and I will be happy, as I always am, to admit my wrong-headedness. For me all this GIT stuff is just so much sophomoric exuberance that we will have to wait to get past.

> I like non-relational databases, especially for this domain space. I'm confident that we'll all eventually get there. But I think it's terribly naive to think that our frustrations with "EVERY existing genealogical program and online system" is rooted in relational databases. For example, I have no doubt that after FamilySearch finishes their migration to Cassandra, you'll still be frustrated with what you can't record.

I stand by my statement. I will, however, add the contrapositive you imply -- using a no-SQL database (e.g., Cassandra) will NOT guarantee that a genealogical system based on it will be flexible and not cause frustration. The use of relational databases PREVENTS the development of non-frustrating, flexible software implementations in the genealogical area; and, although the use of non-relational databases ENABLES non-frustrating, flexible software, it does not GUARANTEE it.

Tom

John Clark

unread,

Feb 13, 2014, 12:59:39 PM2/13/14

to root...@googlegroups.com

Tom,

I am actually advocating for using a No-SQL backend - a Property Graph. I also considered more traditional No-SQL backends such as couchdb or mongodb, but there are many advantages to be gained by using a Property Graph. Allow me to share some of my thoughts.

Property Graphs are inherently Schema-less at the data base level. They have mechanisms that allow the enforcement of a schema at the application level (Neo4j's Node Labels and Relationship types, Constraints, etc). And while Node and Edge Properties have a type, Property X on Node 1 can have a different type than Property X on Node 2.
Graphs that implement Gremlin has a rich traversal query language that simplifies data retrieval. Queries like "What married people within 5 generations have 1 or fewer children associated with them" or "Given any 2 people in the graph, what is the most direct lineage between them" or even "What birth events have fewer than 2 high confidence sources attached to them" (Please note that the data must be modeled in a way to enable these types of queries. I am merely pointing out what one could do, not what has already been done).

In regards to your frustration with the restrictive nature of the current systems, I absolutely agree with you that there is a huge area for improvement. I will be sharing my thoughts on data models a little later, but your comment points out a very real truth.

genealogical data is so sloppy, with so many special exceptions, that force-fitting everything you might wish to record into restrictively-formatted tables is essentially impossible.

In a later part I will be discussing the dissonance between designing a data model around what I want to display to a user vs designing a data model around what actually happened.

As to the focus on versioning, atomic updates, Git, etc, my thoughts are that we need a flexible system that natively deals with those concepts that we can build on. Adding them after the fact leads toward the same type of issues we have now with regards to collaboration, disagreements, and privacy.

- John C.

Wayne Pearson

unread,

Feb 13, 2014, 1:00:11 PM2/13/14

to root...@googlegroups.com

As usual, I find myself agreeing with our local curmudgeon (which I mean in the nicest way possible).

I took part in the git discussion more for the technical merits than any perceived need (yet) for such functionality in genealogy research. I feel that versioning is required for reversing imports or merges, but I feel that anything that needs reversing is something that belongs at the local genealogist's end, not at any common respository, and thus versioning can take place in the form of backups before such imports and merges. Yes, it does mean that I can't come back a few months later, after adding more of my own data, and decide to magically undo all the importing, but I think tagging such imported data, and allowing removal based on that, is a much easier (if less interesting) approach to an undo.

I was going to say the same thing re: Cassandra/no-SQL/relational, but knew Tom would do so as well or better than I.

Ryan Heaton

unread,

Feb 13, 2014, 1:15:02 PM2/13/14

to root...@googlegroups.com

I think all the current stress on GIT, versioning, and atomic updates is wrong.

Shrug. Fair enough. You might be right, but it's interesting to me anyway. At least interesting enough to listen to what other people are thinking about it.

I can tell you that moving from SVN to Git in my development work was a difficult transition, but it significantly changed for the better the way I do collaborative development, and I will never, ever go back. If someone can apply the same shift in collaborative development to the genealogical domain, I'm listening.

The use of relational databases PREVENTS the development of non-frustrating, flexible software implementations in the genealogical area

That hasn't been my experience.

I agree that using non-relational databases will make the needed development work easier when a new feature or element of the product is defined. It might even cut the development work down as much as 50%.

But in my experience, the work of providing a new feature or element in the product is about 10% development work and 90% feature definition, product management, and architecture. So cutting out half of that 10% doesn't buy you much.

Thomas Wetmore

unread,

Feb 13, 2014, 3:18:39 PM2/13/14

to root...@googlegroups.com

John,

Your obvious grasp of no-SQL, and seeing more of the results of your research, is VERY encouraging! Thanks for sharing your thoughts.

My overall concerns involve getting the genealogical data model to the level of excellence required for supporting genealogical (really historical) research. I've been involved with that issue for almost 30 years. My frustration with the GenTech model (I was a curmudgeon all the way back then too), then Better GEDCOM, and now FHISO (was there even a peep from them at RootsTech?), has led me to pretty much bow out. I bet I have a much longer failure rate than anyone else in this discussion! I spell doom to these efforts, and you really don't want me around. But I hate all the resetting to zero that happens with any new effort. I am happily retired, writing genealogical software in the manner I think best, using the models I think best, and am more or less content to let things unfold, or not unfold, as they will. It's a crying shame, though, isn't it, that genealogical software is so icky? Does it have to be to be that way, or are we all just collectively too dumb to figure out how to do it right?

Clearly SQL and relational databases are flashpoint issues for me. Sends me up the wall. So, by the way, is XML, especially when someone claims that it is inherently better than, say, GEDCOM (syntax!!) for archiving genealogical data.

Version control is a bit of a flashpoint for me, pretty obviously, also. For me the proper model of a person is a single summary record with an associated set (or multi-tiered tree) of persona (evidence person) records that provide all the explicit evidence serving to prove the existence of that person. The "primary operation of genealogical research", as it affects the structure of a properly designed genealogical database, is to create new evidence records (properly linked to source records), associate them with the right summary person (individual), and update the summary information about that individual from the new data. The glomming of the new evidence record to the individual is an explicit conclusion made by the researcher. It may be so obvious as to not require anything other than the glomming operation, or it may require an actual conclusion statement. The properties that exist at the individual level must be an intelligent summarization of all the facts from the evidence records. Each of those "intelligent summarizations" is obviously a conclusion, in the classical inductive and deductive senses, based on the evidence.

If version control is to be used I imagine that it would primarily be used in the context of the adding of evidence persons to individuals, removing evidence persons from individuals, partitioning the evidence persons from n individuals (n may be 1) into m individuals (m may be 1). These are certainly operations that it might be useful to track using version control.

How important do you (all of you, not just John) think it is to be able to reconstruct the history of every glom, unglom, rearrange, split, lump operation performed? Do you really think it is useful to keep around the history of all your conclusions and conclusion un-windings. Frankly, and maybe this applies only to me, I would never bother to do it. In almost every single case imaginable, I will glom the right evidence person (persona, please) into the right conclusion person, modify the conclusion person's summary information to reflect the new evidence, and leave things alone until more evidence shows up to add. Sure, once in awhile I will decide that I have glommed together evidence records from more than one real person into a single individual, and I will break the individual person into two individuals, but I can't imagine any productive reason that I would want to keep a formal trace of that partitioning operation. In the extremely rare event that I might decide I have to go back on one of these glomming, unglomming, partitioning operations, I will simply do it. Call me a luddite.

I don't think it is all that hard to find the theoretically right way for this model of individual and evidence records in a personal database to work alongside a "master" online database that is based on exactly the same constructs. Of course, if the two databases don't share the same evidence records and source records, all bets are off. Right there is the first almost insurmountable problem of bringing the worlds together. If there isn't a lingua franca for evidence and source records that all share, a seamless system is impossible. If you truly want this world, that is the first problem to solve. Do you have a spare thirty years lying around? To do this there must be an agreed upon standard for exchanging source and evidence records, agreed upon by the two big gorillas, all the yammering monkeys, and all little lemurs. There must be an agreed upon standard so that the 1860 census record for Joe Shmoe is the same, uniquely identifiable evidence record, in every place it shows up, and, of course, the source record for the 1860 census as a whole must be identical wherever it shows up. Does anyone honestly think we can make this happen? Does anyone honestly think we don't have to solve this problem to get a generic solution to the online/master, offline/personal database problem?

Tom

Wayne Pearson

unread,

Feb 13, 2014, 3:52:27 PM2/13/14

to root...@googlegroups.com

My concern for the "uniquely identifiable evidence record" is the fact that we have multiple large-scale sources already, which duplicate some of the references to some of the final, actual evidence records.

FamilySearch has an ID for the 1901 Canada Census: "1584577". We could seed our hypothetical centralized database with that ID as *THE* ID for the census.But Automated Genealogy also has records based on this census... but no ID, per se, for the census as a whole.

However, Page 1 of Subdistrict A-1 of District "1" (Burrard in British Columbia) of their census transcription *does* have an ID, 13754. But FamilySearch doesn't have IDs per-page (that I'm aware of) as a source.

And then there's the Ancestry reference, dbid 8826. Or the Canadian Government's ID of AMICUS 7196327.

Do you start from scratch with brand-new UUIDs for everything, requiring everyone who has ever cited anything to start from scratch? Perhaps you provide aliases, over time, to well-known and not-so-well-known alternate sources, to allow for software to remap those references?

I think this is the larger problem -- mapping untold existing references to the centralized source, not creating the source itself. Is this a problem for the genealogist to solve -- go back through untold records and re-cite the "proper" evidence; or is it a problem for the genealogy developer, with the genealogical community, the alias everything, and then provide conversions?

On Thu, Feb 13, 2014 at 1:18 PM, Thomas Wetmore <tt...@verizon.net> wrote:

John,

Your obvious grasp of no-SQL, and seeing more of the results of your research, is VERY encouraging! Thanks for sharing your thoughts.

My overall concerns involve getting the genealogical data model to the level of excellence required for supporting genealogical (really historical) research. I've been involved with that issue for almost 30 years. My frustration with the GenTech model (I was a curmudgeon all the way back then too), then Better GEDCOM, and now FHISO (was there even a peep from them at RootsTech?), has led me to pretty much bow out. I bet I have a much longer failure rate than anyone else in this discussion! I spell doom to these efforts, and you really don't want me around. But I hate all the resetting to zero that happens with any new effort. I am happily retired, writing genealogical software in the manner I think best, using the models I think best, and am more or less content to let things unfold, or not unfold, as they will. It's a crying shame, though, isn't it, that genealogical software is so icky? Does it have to be to be that way, or are we all just collectively too dumb to figure out how to do it right?

Clearly SQL and relational databases are flashpoint issues for me. Sends me up the wall. So, by the way, is XML, especially when someone claims that it is inherently better than, say, GEDCOM (syntax!!) for archiving genealogical data.

Version control is a bit of a flashpoint for me, pretty obviously, also. For me the proper model of a person is a single summary record with an associated set (or multi-tiered tree) of persona (evidence person) records that provide all the explicit evidence serving to prove the existence of that person. The "primary operation of genealogical research", as it affects the structure of a properly designed genealogical database, is to create new evidence records (properly linked to source records), associate them with the right summary person (individual), and update the summary information about that individual from the new data. The glomming of the new evidence record to the individual is an explicit conclusion made by the researcher. It may be so obvious as to not require anything other than the glomming operation, or it may require an actual conclusion statement. The properties that exist at the individual level must be an intelligent summarization of all the facts from the evidence records. Each of those "intelligent summarizations" is obviously a conclusion, in the classical inductive and deductive senses, based on the evidence.

If version control is to be used I imagine that it would primarily be used in the context of the adding of evidence persons to individuals, removing evidence persons from individuals, partitioning the evidence persons from n individuals (n may be 1) into m individuals (m may be 1). These are certainly operations that it might be useful to track using version control.

How important do you (all of you, not just John) think it is to be able to reconstruct the history of every glom, unglom, rearrange, split, lump operation performed? Do you really think it is useful to keep around the history of all your conclusions and conclusion un-windings. Frankly, and maybe this applies only to me, I would never bother to do it. In almost every single case imaginable, I will glom the right evidence person (persona, please) into the right conclusion person, modify the conclusion person's summary information to reflect the new evidence, and leave things alone until more evidence shows up to add. Sure, once in awhile I will decide that I have glommed together evidence records from more than one real person into a single individual, and I will break the individual person into two individuals, but I can't imagine any productive reason that I would want to keep a formal trace of that partitioning operation. In the extremely rare event that I might decide I have to go back on one of these glomming, unglomming, partitioning operations, I will simply do it. Call me a luddite.

I don't think it is all that hard to find the theoretically right way for this model of individual and evidence records in a personal database to work alongside a "master" online database that is based on exactly the same constructs. Of course, if the two databases don't share the same evidence records and source records, all bets are off. Right there is the first almost insurmountable problem of bringing the worlds together. If there isn't a lingua franca for evidence and source records that all share, a seamless system is impossible. If you truly want this world, that is the first problem to solve. Do you have a spare thirty years lying around? To do this there must be an agreed upon standard for exchanging source and evidence records, agreed upon by the two big gorillas, all the yammering monkeys, and all little lemurs. There must be an agreed upon standard so that the 1860 census record for Joe Shmoe is the same, uniquely identifiable evidence record, in every place it shows up, and, of course, the source record for the 1860 census as a whole must be identical wherever it shows up. Does anyone honestly think we can make this happen? Does anyone honestly think we don't have to solve this problem to get a generic solution to the online/master, offline/personal database problem?

Tom

On Feb 13, 2014, at 12:59 PM, John Clark <socra...@gmail.com> wrote:

> Tom,
>
> I am actually advocating for using a No-SQL backend - a Property Graph. I also considered more traditional No-SQL backends such as couchdb or mongodb, but there are many advantages to be gained by using a Property Graph. Allow me to share some of my thoughts.

> * Property Graphs are inherently Schema-less at the data base level. They have mechanisms that allow the enforcement of a schema at the application level (Neo4j's Node Labels and Relationship types, Constraints, etc). And while Node and Edge Properties have a type, Property X on Node 1 can have a different type than Property X on Node 2.
> * Graphs that implement Gremlin has a rich traversal query language that simplifies data retrieval. Queries like "What married people within 5 generations have 1 or fewer children associated with them" or "Given any 2 people in the graph, what is the most direct lineage between them" or even "What birth events have fewer than 2 high confidence sources attached to them" (Please note that the data must be modeled in a way to enable these types of queries. I am merely pointing out what one could do, not what has already been done).

> In regards to your frustration with the restrictive nature of the current systems, I absolutely agree with you that there is a huge area for improvement. I will be sharing my thoughts on data models a little later, but your comment points out a very real truth.
>
> genealogical data is so sloppy, with so many special exceptions, that force-fitting everything you might wish to record into restrictively-formatted tables is essentially impossible.
>
> In a later part I will be discussing the dissonance between designing a data model around what I want to display to a user vs designing a data model around what actually happened.
>
> As to the focus on versioning, atomic updates, Git, etc, my thoughts are that we need a flexible system that natively deals with those concepts that we can build on. Adding them after the fact leads toward the same type of issues we have now with regards to collaboration, disagreements, and privacy.
>
> - John C.
>
> On Thu, Feb 13, 2014 at 10:48 AM, Thomas Wetmore <tt...@verizon.net> wrote:
> Ryan,
>
> > I think John's leading up to something that is at a layer above the relational vs. nosql vs. graph database debate. At least that seems to be the direction given the focus on atomic updates and versioning.
>
> I will be negative here. I think all the current stress on GIT, versioning, and atomic updates is wrong. The stress should be on the genealogical process, and the model and the records required to support it. Then one can talk about any version control issues that may actually exist. I think complex version control in genealogical databases as so far over the top in terms of what is really needed now, and possibly forever, that I think they are preventing work that would get us somewhere. If I am wrong so be it, and I will be happy, as I always am, to admit my wrong-headedness. For me all this GIT stuff is just so much sophomoric exuberance that we will have to wait to get past.
>
> > I like non-relational databases, especially for this domain space. I'm confident that we'll all eventually get there. But I think it's terribly naive to think that our frustrations with "EVERY existing genealogical program and online system" is rooted in relational databases. For example, I have no doubt that after FamilySearch finishes their migration to Cassandra, you'll still be frustrated with what you can't record.
>
> I stand by my statement. I will, however, add the contrapositive you imply -- using a no-SQL database (e.g., Cassandra) will NOT guarantee that a genealogical system based on it will be flexible and not cause frustration. The use of relational databases PREVENTS the development of non-frustrating, flexible software implementations in the genealogical area; and, although the use of non-relational databases ENABLES non-frustrating, flexible software, it does not GUARANTEE it.
>
> Tom

Daniel Zappala

unread,

Feb 13, 2014, 4:15:24 PM2/13/14

to root...@googlegroups.com

On Thursday, February 13, 2014 1:18:39 PM UTC-7, Tom Wetmore wrote:

How important do you (all of you, not just John) think it is to be able to reconstruct the history of every glom, unglom, rearrange, split, lump operation performed? Do you really think it is useful to keep around the history of all your conclusions and conclusion un-windings. Frankly, and maybe this applies only to me, I would never bother to do it. In almost every single case imaginable, I will glom the right evidence person (persona, please) into the right conclusion person, modify the conclusion person's summary information to reflect the new evidence, and leave things alone until more evidence shows up to add. Sure, once in awhile I will decide that I have glommed together evidence records from more than one real person into a single individual, and I will break the individual person into two individuals, but I can't imagine any productive reason that I would want to keep a formal trace of that partitioning operation. In the extremely rare event that I might decide I have to go back on one of these glomming, unglomming, partitioning operations, I will simply do it. Call me a luddite.

Dear Luddite (you said to!):

Does your opinion change if this glomming is occurring in a shared database instead of a personal one? Wouldn't you then want a history so you could undo a disastrous glomming of more than one real person into a single individual?

I will admit that it is easier to think of versioning when your versioned objects are files, as they are in git. When instead your objects are *links* between files (evidence persona and conclusion person), then that seems trickier.

-- Daniel Zappala

Dallan Quass

unread,

Feb 13, 2014, 4:18:55 PM2/13/14

to root...@googlegroups.com

Wayne,

Coming up with unique IDs for evidence records would be awesome, but I don't know how much of an impact independent software developers could have in that area. But we may be able to achieve a smaller goal: getting people to cite some kind of evidence in the first place. Then we can address matching evidence records from different repositories in the future. Especially with FamilySearch promising to open up their "record hinting" API sometime soon, this smaller goal seems do-able.

--

Thomas Wetmore

unread,

Feb 13, 2014, 10:00:37 PM2/13/14

to root...@googlegroups.com

On Feb 13, 2014, at 4:15 PM, Daniel Zappala <daniel....@gmail.com> wrote:

> Dear Luddite (you said to!):
>
> Does your opinion change if this glomming is occurring in a shared database instead of a personal one? Wouldn't you then want a history so you could undo a disastrous glomming of more than one real person into a single individual?
>
> I will admit that it is easier to think of versioning when your versioned objects are files, as they are in git. When instead your objects are *links* between files (evidence persona and conclusion person), then that seems trickier.
>
> -- Daniel Zappala

Daniel,

Luddite and proud!

I will admit as far as a "maybe" to your question.

The question is whether I would want to do a bunch of "undo's" until I reach a past arrangement that might be easier to make the fixes from, and then move forward again to make the fix; or whether I would want to just fix the rats nest from the state in which I find it.

Note that if there were an undo feature, one of the steps we might easily undo through could be the addition of an evidence record. I would say that undoing past the addition of a new evidence record is a major issue, as this isn't just rearranging data to an earlier state, this is losing evidence. And note that if the engineer providing this undo feature put in a hack to prevent the actual loss from happening, it isn't an honest undo anymore. Here a correct, reversible, version control system causes problems instead of helping them! Ever seen one of those before?

I think in the long run I would just want to fix the mess as I found it. I have done this a number of times in existing on-line trees when I have found screwed up trees of ancestors I am intimately familiar with.

How prevalent are these edit wars on changing things back and forth? In the trees I have worked on it hasn't been much of a problem, but maybe I don't have enough experience. Even if there were horrible edit wars, how much would versioning help (other than that making it easier to speed up the edit wars!).

I do agree that it would be very nice and handy to have a way to graphically see exactly the evolution history of individuals and relationships as they have been built up over time with the adding and possible occasional removal of evidence. And it would be cool to see the splitting and lumping and rearranging of individuals over time. But, would I really want to go back in history as shown by that graph to a certain point, and then go off in a different direction; or would I just use that graph to help me understand the mistakes that have been made, and use it as a guide to help rearrange evidence to get things back in order.

I think I would definitely not do the former and definitely do do the latter.

Tom

Dallan Quass

unread,

Feb 13, 2014, 10:26:29 PM2/13/14

to root...@googlegroups.com

I think when you have multiple people editing the tree, a change history is valuable to see who made what changes when. A change history is also important when you a trying to find a common ancestor version for a 3-way diff. But I agree that it's unlikely most people would want to go back in time and create new branches from historical versions.

Git allows me to see a change history, and I do that once in awhile. It also shows me 3-way diffs, and I do that once in awhile. But I've never gone back in time and checked out an old commit and started a new branch from it. I'm sure people have, but it seems like a pretty niche use-case - certainly not something the average user would do.

Daniel Zappala

unread,

Feb 14, 2014, 12:33:31 PM2/14/14

to root...@googlegroups.com

On Thursday, February 13, 2014 8:00:37 PM UTC-7, Tom Wetmore wrote:

I do agree that it would be very nice and handy to have a way to graphically see exactly the evolution history of individuals and relationships as they have been built up over time with the adding and possible occasional removal of evidence. And it would be cool to see the splitting and lumping and rearranging of individuals over time. But, would I really want to go back in history as shown by that graph to a certain point, and then go off in a different direction; or would I just use that graph to help me understand the mistakes that have been made, and use it as a guide to help rearrange evidence to get things back in order.

Really good point. I tend to agree.

Reply all

Reply to author

Forward