Graphical Databases

59 views
Skip to first unread message

Thomas Wetmore

unread,
Aug 2, 2022, 9:12:52 PM8/2/22
to root...@googlegroups.com
A graph database seems to me just right for genealogy. The hierarchical structure of GEDCOM can be mapped directly to graphical structure. Ditto for any other genealogical data formats. Genealogical data is naturally graphical in many ways, not just because of the relationships between persons, but the relationships between persons and events, events with dates and places, events with their sources, and so on.

I tend to distinguish two types of genealogical databases:

1. Conclusion databases -- centered around persons, their biological relationships, and their basic vital events. They store family trees with the genealogists' best attempts to identify the persons in their pedigrees. They tend to be sloppy in the area of holding evidence or sources. In the arguments I used to make in the past I would call the persons in a conclusion database 'individuals'. Individuals are the genealogists' best guesses about the persons who lived in the past.

2. Research databases -- centered around the records genealogists collect in order "to do" genealogy. In those same arguments I used to call the persons mentioned in real evidence 'personas'. In my model of genealogy, researchers collect the best evidence about as many personas as possible, and their work consists of deciding which personas most properly combine into individuals; that is, partitioning the many personas they collect into the one, two or more individuals that the personas seem to best fit.

In reality most users combine aspects of both. For example Ancestry trees and the Family Search tree have this combined nature. They allow users to provide their own values for names and events, link to other persons, in other words, express their opinions, while also collecting together the various records that led them to choose those names, events and relations.

This is a big topic that goes off in many directions, many more than can be mentioned in a short email. But to me the main point is that both these types of databases, as well as the combined type, seem naturally well suited for a graphical structure.

I've been building genealogical data models and writing genealogical software for decades, and hope to experiment with reimplementing some of the things I have done using a graphical database. I think it's a good idea, but only some experimentation will show whether it is or not. The main question in my mind is whether the use of a graphical database will improve or simplify or speed up the performance of genealogical software based on other technologies. My own LifeLines uses a custom-written B-tree database that is extremely performant, and I frankly do not believe that a graphical database will improve on the performance already achieved.

So it's not clear yet whether a graphical database is important for genealogical software. After all, all the current programs run fine now. If that software were to have no new features, how would conversion to graphical databases help at all? So I want to know whether using a graphical database will enable other features that I'm not able to imagine right now.

Tom Wetmore



Wayne Pearson

unread,
Aug 3, 2022, 12:32:29 AM8/3/22
to root...@googlegroups.com
Hi Tom,

 I expect that a graph database has the potential to take the conclusion databases, the research databases, *and* the hybrid ones you mention, and represent them. With a bit/lot of work, multiple sources could theoretically be combined, if common citations between trees can be identified and "attached" to an existing graph database.

With a level of abstraction, individuals' trees could be stored/represented in a graph database without "polluting" the data, which I think adds value (under the assumption that such external trees can be merged in without complete hand-holding). Often I've taken the hybrid trees I've found in my searches, and wanted to have them loosely associated with my own trees, but no more than that.

I adopted your same individual/persona terminology years ago, after reading about it on this list; my short stint in genealogy + graph databases used it to good effect.

I feel the real power in the graph databases comes from the queries, which is closely tied to getting a good schema defined. Knowing the relationships between all the nodes that you have helps to refine the queries further. 


Another nice feature is that importing other graph databases can be done in a safe, additive way, so if external databases became a thing, updates would be simple enough to bring into one's own. 


I suppose whether they're "important" comes down to what's left to be solved, or improved upon. For many, genealogical software is for storing their conclusions, as you said, and not so much for the retrieval afterwards. However, if the myriad data sources out there now were to be converted, they could support queries that today's databases do not; and support importing fragments of other trees and sources better than many programs do today. I think whether that makes it worthwhile or important depends on the researcher.

--
  Wayne



--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rootsdev/4F8430AA-4722-4F1C-995D-0A8E8E035B52%40gmail.com.

Thomas Wetmore

unread,
Aug 3, 2022, 10:20:51 AM8/3/22
to root...@googlegroups.com


> On Aug 3, 2022, at 12:32 AM, Wayne Pearson <cry...@gmail.com> wrote:
>
> Hi Tom,
>
> I expect that a graph database has the potential to take the conclusion databases, the research databases, *and* the hybrid ones you mention, and represent them. With a bit/lot of work, multiple sources could theoretically be combined, if common citations between trees can be identified and "attached" to an existing graph database.

Wayne,

I agree. I believe that nearly all relationships and entities needed in genealogical databases are well-suited to be rootsnodes and edges in a graph. The mappings are usually obvious.
>
> With a level of abstraction, individuals' trees could be stored/represented in a graph database without "polluting" the data, which I think adds value (under the assumption that such external trees can be merged in without complete hand-holding). Often I've taken the hybrid trees I've found in my searches, and wanted to have them loosely associated with my own trees, but no more than that.
>
> I adopted your same individual/persona terminology years ago, after reading about it on this list; my short stint in genealogy + graph databases used it to good effect.
>
> I feel the real power in the graph databases comes from the queries, which is closely tied to getting a good schema defined. Knowing the relationships between all the nodes that you have helps to refine the queries further.

I also agree. This is where new features would probably come from. I have some experience with this. My LifeLines program has a programming feature where users can write programs to do almost anything with their data. I provide many operations, such as all-descendants-of, all-ancestors-of, all-spouses-of, all-children-of (all-descendants-of iterates all-children-of until there ain't no more). Traversing all descendants, and so on are supported. These would be prime fodder for any infrastructure using a graphical database. The algorithms for doing these operations use the graph-based structure of GEDCOM. Users can combine these graphical primitives to fairly quickly write programs that, say, find nearest common ancestors, end of line ancestors, and so on, exactly the kinds of features that the GraphConnect video on Graphs for Genealogists, also accentuates.

> I suppose whether they're "important" comes down to what's left to be solved, or improved upon. For many, genealogical software is for storing their conclusions, as you said, and not so much for the retrieval afterwards. However, if the myriad data sources out there now were to be converted, they could support queries that today's databases do not; and support importing fragments of other trees and sources better than many programs do today. I think whether that makes it worthwhile or important depends on the researcher.

Good point. I have often wondered about automating the research process, whether it is all feasible. That is, take a hybrid tree full of sources, events and personas, and make suggestions as to which personas to include in which individuals. Clearly Ancestry and Family Search have elements of this available in their hinting systems. Wouldn't it be nice if some algorithm could process through the research part of your database, and come up with suggestions as to what are the personas most likely to apply to your individuals, the most likely form a a person's name, most likely date of birth, most likely birth place, and so on. I would question whether such a feature would be simplified by using graphical database or whether the software would just be too ugly and ad hoc to organize around neat structures. I have to hope for the latter.
>
> Wayne

Tom Wetmore


Ken Finnigan

unread,
Aug 3, 2022, 7:55:09 PM8/3/22
to root...@googlegroups.com
I agree with what you're both saying Wayne and Tom.

With respect to the research process and multiple personas, I've been thinking through possible approaches to this where you can mark all possible personas for an individual. You can then compare them to determine the most appropriate fit based on sources, etc. It could then be possible to mark a persona, or set of sources, as being aligned with an individual, or creating an individual from a set of personas if it's not someone already present. All the personas would still be available through the individual for use as comparison or review in the future.

Ken

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.

Thomas Wetmore

unread,
Aug 3, 2022, 8:51:14 PM8/3/22
to root...@googlegroups.com
On Aug 3, 2022, at 7:54 PM, Ken Finnigan <k...@kenfinnigan.me> wrote:

With respect to the research process and multiple personas, I've been thinking through possible approaches to this where you can mark all possible personas for an individual. You can then compare them to determine the most appropriate fit based on sources, etc. It could then be possible to mark a persona, or set of sources, as being aligned with an individual, or creating an individual from a set of personas if it's not someone already present. All the personas would still be available through the individual for use as comparison or review in the future.

Ken

Ken, I really enjoyed reading this. The research process seems very well understood, both by big purveyors of family trees (e.g., Ancestry and Family Search), as well as many rank and family genealogists and family historians. The notions of persona as an evidence concept and an individual as a conclusion concept has advanced so much in the past 20 years. Now almost everyone understands the ideas behind them. I think that Ancestry and Family Search had a lot to do with advancing the concepts. Even though they do not talk about these ideas in these terms, we know that every 'hint' provides one or more personas along with source info, and gives us the option of 'merging' that one or more persona into individuals we are building. I used to blab and blab and blab about this; I should have just waited for Ancestry and Family Search to make it fait accompli.

I have to believe that since the big guys have software that searches out likely personas to merge, that similar software can be packaged into the hybrid databases themselves to take some of the load off the family historian to do all the comparisons. I do believe that such a process is going to require software to fuzzily match properties of personas to properties of individuals to make recommendations for combination. My belief and hope is that we can take advantage of graphical database techniques to do the fuzzy matching.

Best,

Tom Wetmore



Ken Finnigan

unread,
Aug 3, 2022, 10:46:55 PM8/3/22
to root...@googlegroups.com
I must admit Tom that I'd not heard of the concept of "persona" until this thread! Previously I'd been thinking of them as a "possibility". Though I've used Ancestry and Family Search for many years, I'd never made the connection. I appreciate the explanation you provided in the thread, as it really helped me solidify some of the thoughts I'd been having around the idea.

I would also agree that a level of fuzzy matching would be required to provide recommendations that are reasonable, without needing to be perfect. I do believe graph databases will provide interesting possibilities for this type of solution.

Regards
Ken

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.

Thomas Wetmore

unread,
Aug 4, 2022, 4:56:23 AM8/4/22
to root...@googlegroups.com
Ken,

In my opinion persona is an off-putting word. I don't remember where it came from. The idea is so simple. There are names found in evidence records and sources. We want to know if those names are of persons we are interested in and researching. To separate the concept of the names found in evidence and the persons we are researching, someone, somewhere came up with the idea of calling the names found in evidence, along with any associated properties, personas, and the persons being researched individuals. I think this idea is well understood, but the words get in the way.

When you use FamilySearch it is hard to tell the difference. Everything is just a person record. Some come from evidence, some are built up by genealogists. The goal of the FamilySearch single world tree is to slowly but surely have the world of genealogists slowly merge together all their billions of person (persona) records into a single world tree of persons (individuals).

In Ancestry, where every tree is unique, it can also be hard to tell the difference. Some persons you manually create, so they start out as unsourced individuals; some persons get added by adding records either by your own searching or by Ancestry's hints. Often a person gets into the tree because he/she is mentioned in evidence and you click on save. So the person starts out as a single persona. Then you typically find other records mentioning that person and you add/merge that record (persona) in. Soon you see conflicts between records (personas) in birth dates and other basic vital information, so then you directly edit the person to show the vital information you believe to be correct. As soon as you do that, your Ancestry person becomes an individual because it includes the evidence records, but it also includes some of your own research conclusions.

Best,

Tom Wetmore


On Aug 3, 2022, at 10:46 PM, Ken Finnigan <k...@kenfinnigan.me> wrote:

I must admit Tom that I'd not heard of the concept of "persona" until this thread! Previously I'd been thinking of them as a "possibility". Though I've used Ancestry and Family Search for many years, I'd never made the connection. I appreciate the explanation you provided in the thread, as it really helped me solidify some of the thoughts I'd been having around the idea.

I would also agree that a level of fuzzy matching would be required to provide recommendations that are reasonable, without needing to be perfect. I do believe graph databases will provide interesting possibilities for this type of solution.

Regards
Ken

Good birding,

Tom Wetmore, http://bartonstreet.com/tom/birds
Newburyport, Mass.
Think globally, bird locally.



paul...@gmail.com

unread,
Aug 4, 2022, 6:59:38 AM8/4/22
to rootsdev
Sorry for the amateur intrusion...

" In my model of genealogy, researchers collect the best evidence about as many personas as possible, and their work consists of deciding which personas most properly combine into individuals"

I have to assume that all the items of evidence and their personae in (any kind of) a database each have a unique identifier. Could anyone spare the time to enlighten me how these are created and made available for global reference? That's something bugging me for years.

Many thanks

Paul White

paul...@gmail.com

unread,
Aug 4, 2022, 7:21:04 AM8/4/22
to rootsdev
" You can then compare them to determine the most appropriate fit based on sources, etc. It could then be possible to mark a persona, or set of sources, as being aligned with an individual"

A piece of evidence (persona being just one attribute of it) is to be "linked" to an individual. How?
Well, it's one researcher's judgment/opinion about a fit. But in real life there is usually some kind of qualification, isn't there?
e.g. the birth certificate includes a middle name not found so far anywhere else in the records. Or (very often) the other way around.

Piling up elements of evidence leads to many such "minor" discrepancies - all fine, all in a day's work.
But there should be an argument weighing up the pros and cons as part of "judgment" about the assembly as a whole?

Ken Finnigan

unread,
Aug 4, 2022, 9:44:27 AM8/4/22
to root...@googlegroups.com
I was fairly loose in my description, but the idea is for a piece of software to provide a way to visualize/compare/examine all the different pieces of evidence to then make a judgement as to what is reasonably proven as likely, and what is not. What I didn't mention is I've also been thinking about the ability to assign weights to pieces of evidence to assist in distinguishing between primary, derived primary, and secondary, but in a way that acknowledges different pieces of information may have different levels of confidence from the same piece of evidence.

Regards
Ken

Thomas Wetmore

unread,
Aug 4, 2022, 10:33:06 AM8/4/22
to root...@googlegroups.com
Paul,

The big providers of data have accumulated databases from different sources. Different databases have different ways to identify records and persons. Most probably have unique identifiers for each persona. There were probably many headaches integrating those databases to make it appear as a single database when users search.

In the case of FamilySearch each persona, whether found in a source of evidence, or created by a user, has a short alphanumeric string that is shown along with the name on all screens. For example, the great-great-great grandfather whose descendants I work on the most assiduously has the code LHK6-JF3. He is 'composed' of 16 personas, mostly done by myself. All of them except a couple 'legacy' personas have codes with the same pattern. A couple years ago I asked FamilySearch how those codes were generated, and was told not to worry about it. I have done some checking and nearly every mention of every person in every FamilySearch record has one of these unique codes. For example this person shows up in the 1860 and 1870 US censuses and has a unique code for each of those appearances. I did point out to FamilySearch (naturally, as I am a geek) that they would run out of possible codes in the not too distant future. Again, I was told not to worry about it.

There are now UUIDs -- universally unique identifiers. Most software packages have functions to generate them on the fly. UUIDs are 'guaranteed' to ALWAYS be unique, no matter how many people are generating how many IDs on how many computer systems. There are so many bits in a UUID (128) that there are more unique IDs than there are atoms in the universe (I didn't actually check that, but it could be true; we may have to appeal that one to James Webb.). UUIDs are now used in many database applications to give each entity a unique id. Call UUID() and you're done. You've seen many of these UUIDs already in the form of 32 hex digits with a few hyphens inserted.

Tom Wetmore

Thomas Wetmore

unread,
Aug 4, 2022, 10:58:44 AM8/4/22
to root...@googlegroups.com

On Aug 4, 2022, at 7:21 AM, paul...@gmail.com <paul...@gmail.com> wrote:

A piece of evidence (persona being just one attribute of it) is to be "linked" to an individual. How?

Paul,

The 'how' is done by Ancestry and FamilySearch. I assume they add more links and connections to either our Ancestry trees or to the FamilySearch world family tree.

Well, it's one researcher's judgment/opinion about a fit. But in real life there is usually some kind of qualification, isn't there?
e.g. the birth certificate includes a middle name not found so far anywhere else in the records. Or (very often) the other way around.

The 'doing of genealogy' is up to each researcher. I think you can too aggressively over combine or too conservatively under combine. One of the great fun effects of combining personas is how a fuzzy picture of a person can come into focus. As you mention, one persona might unambiguously provide a middle name, another might unambiguously pin point the exact birth date or place. One of my favorite record types are the WWI and WWII draft registration cards. These are wonderful because they are filled out by a person about himself: name, birth date, birth place, address, place of employment, usually the name of a person 'who will always know where you are,' etc. If you pick up a full middle name from one of these you're in clover. There is nothing better than a source where a person himself or herself provides the information. Death records are the opposite. The person the record is about just died, and someone else is providing the information. If you were the 'informant' on your grandmother's death certificate, would you know her birth name, her birth place, who her parents were?

Piling up elements of evidence leads to many such "minor" discrepancies - all fine, all in a day's work.
But there should be an argument weighing up the pros and cons as part of "judgment" about the assembly as a whole?

As someone who has been doing genealogy for decades, these minor discrepancies are part of everyday life. I know that many of my conclusion individuals have errors because I have interpreted the evidence incorrectly, or don't yet have enough evidence (if it even exists). It is the true agony of doing genealogy, that we will never, ever really know. It can be so frustrating that I often find myself asking why do I subject myself to such pain.

In terms of their being a way to judge the quality of the assembly as a whole, it's a jungle out there. There are guidelines and best practices you can find that discuss this problem, but most genealogists don't seem to worry much about it.

Best,

Tom Wetmore


paul...@gmail.com

unread,
Aug 4, 2022, 2:26:31 PM8/4/22
to rootsdev
"In the case of FamilySearch each persona, whether found in a source of evidence, or created by a user, has a short alphanumeric string that is shown along with the name on all screens."

Many thanks, Tom, for that detailed reply. In one post you have covered rather a large amount of ground, so best if I stick to one point for now.
And, as a general remark, I should think my input would be a digression from the theme you so interestingly started. So I don't intend to pursue this here much further :).

That alphanumeric FamilySearch string is certainly not unique to any degree, simply because the transcribed details may be revised later and (as far as I know) simply update the same record. What *should* be unique is the combination of FSID with date retrieved (which, of course, is part of the proposed citation string).

Points I've never seen discussed are (a) how much we can trust the uniqueness of the collection name and (b) what is the exact meaning of "ark:/61903/1:2".

Sometimes, given access to images, we can tell there were different "paper" copies - besides repeated scans/photos. In such cases, for my own transcription, the problem has been how to uniquely identify the version I rely on. On FindMyPast, for example, there is usually no identifying string so how can I tell anyone specifically what I looked at? IMO that's vital because any kind of copy can differ from the original (BTs often miss out useful information).

I'd better stop.

Tanks and best wishes
Paul

paul...@gmail.com

unread,
Aug 4, 2022, 2:35:17 PM8/4/22
to rootsdev
Ahem, with apologies, couldn't let that one pass....

" There is nothing better than a source where a person himself or herself provides the information."

(a) My grandfather's age at enlistment was deliberately inflated.
(b) My current subject falsified his age substantially at marriage, blocking discovery of baptism. By a real stroke of luck, marriage reference to his regiment happened to turn up a pension record that completely changed the picture.
(c) Census ages are notoriously unreliable. Sometimes the informant at death finds birth or baptism papers and gets it right.
(d) ad nauseam...

But otherwise I agree, hehe.

Thomas Wetmore

unread,
Aug 4, 2022, 4:09:55 PM8/4/22
to root...@googlegroups.com
Mea culpa. Of course you are right in many cases!


On Aug 4, 2022, at 2:35 PM, paul...@gmail.com <paul...@gmail.com> wrote:

Ahem, with apologies, couldn't let that one pass....

" There is nothing better than a source where a person himself or herself provides the information."

(a) My grandfather's age at enlistment was deliberately inflated.
(b) My current subject falsified his age substantially at marriage, blocking discovery of baptism. By a real stroke of luck, marriage reference to his regiment happened to turn up a pension record that completely changed the picture.
(c) Census ages are notoriously unreliable. Sometimes the informant at death finds birth or baptism papers and gets it right.
(d) ad nauseam...

But otherwise I agree, hehe.

Jan Murphy

unread,
Aug 4, 2022, 4:30:58 PM8/4/22
to root...@googlegroups.com
On Thu, Aug 4, 2022 at 7:58 AM Thomas Wetmore <ttwetmore4@gmail.com> wrote:

The 'doing of genealogy' is up to each researcher. I think you can too aggressively over combine or too conservatively under combine. One of the great fun effects of combining personas is how a fuzzy picture of a person can come into focus. As you mention, one persona might unambiguously provide a middle name, another might unambiguously pin point the exact birth date or place. One of my favorite record types are the WWI and WWII draft registration cards. These are wonderful because they are filled out by a person about himself: name, birth date, birth place, address, place of employment, usually the name of a person 'who will always know where you are,' etc. If you pick up a full middle name from one of these you're in clover. There is nothing better than a source where a person himself or herself provides the information. Death records are the opposite. The person the record is about just died, and someone else is providing the information. If you were the 'informant' on your grandmother's death certificate, would you know her birth name, her birth place, who her parents were?

Tom,

I was trained as a linguist. One important part of data collection and analysis is to keep track of who the informant is for samples of language collected.  If someone decides to play a prank on the anthropological linguist, and gives out nonsense words or other suspect material, you need to be able to pull that data out of your corpus.  Because of this training, I am perhaps more aware than others that we do not know who the informant is for many historical records, and I am more interested than many on the process of how the records are created.

You say that the WWI and WWII draft registration cards "are filled out by a person about himself."  Strictly speaking, this is not the case. They were filled out by the registrar, and signed by the registrant.  Take a look at any local board and compare the handwriting of the cards which have the same registar, or compare the signature of the registrar on the back of the card to the handwriting in the main body of the card.  For example, here's a card that is indexed by Ancestry as belonging to "Sarrepan Kaeron".  Compare the registrar's signature to the front of the card. The initial S in the first name matches the registrar's writing, not the registrant's signature. 
 

The contrast between the US Censuses that are open to us, and the 1911 and 1921 Censuses, where the householders filled out the schedules are striking.  

I'm sorry to be pedantic about this, but as a historical linguist, the 'mistakes' that happen when a person dictates answers to a clerk or census enumerator ( especially when the two individuals are from different language groups) are our bread and butter. Having been trained to look for these things, and because of my experiences doing transliteration while taking my degree work in Classical Studies, I can't not see them.  

At least with the WWI and WW2 Draft cards, we know the registrants did at least get a chance to see the cards because they had to sign them, unlike US Census records, passenger lists, and so many other historical records we re-purpose for genealogy.

Your point about the death certificates is spot-on.  One of the languages studied by the professor who taught my anthropological linguistics classes has data source as a grammatical category, as English has singular and plural.  In addition to personal knowledge, "someone told me", and inferred knowledge, one of the options for data source  is "no one alive could know".  I think of this every time I look at a death certificate.  

Best,

Jan

Jan Murphy
Moderator Pro Tempore

 

Przemek Więch

unread,
Aug 5, 2022, 1:45:33 PM8/5/22
to root...@googlegroups.com
This is a very interesting discussion. Last year I attempted to start writing up my ideas about storing genealogy information on the Web and linking between different resources (post 1, post 2). Unfortunately, I didn't have the time to follow up.

I haven't heard the persona/individual terminology before but I can see the concepts fit very well into what genealogy research is about. Also, wouldn't an individual for one researcher become a persona for another? Say researcher A publishes his findings on their website and researcher B finds it and incorporates some information into their own database. This way, information about a specific person on the website is an individual for A but a persona for B.

One of the things that was mentioned in this discussion is unique identifiers. This is something that is badly needed by genealogy researchers. Being able to reference a source with a unique identifier is a necessity. URLs or URIs are the Web's unique identifiers. It doesn't really matter how Ancestry creates the identifiers as long as they're unique and stable in time.

If every piece of genealogy information on the Web had a unique identifier and references between them were expressed using these identifiers, then we could get all this information and fit it into a graph database. Conversely, you could also look at the Web as a giant distributed graph database where you can (at least theoretically) query information using graph queries. However, this is more of a fantasy than reality if the most basic requirement of unique and stable identifiers is not met by so many Internet resources.

— Przemek


--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.

Ken Finnigan

unread,
Aug 5, 2022, 1:54:57 PM8/5/22
to root...@googlegroups.com
Thanks for your thoughts Przemek!

I would agree that one researchers "individual" is another ones "persona". Until each researcher has satisfied their specific criteria for judging the source information sufficient enough to be confident a specific "persona" is that "individual", the second researcher needs to be able to consider them a "persona". Your comment got me thinking about world trees and whether they will ever be fruitful given everyone's very personal "definition of done", using engineering parlance.

I like the idea of unique identifiers for sources, as the same source can have different URLs/URIs depending on where that source is stored. Ancestry and FindMyPast have many of the same sources, but they're not available on the same URLs. In this case, does it mean they're different sources, or because it's from the same underlying data set they should be considered the same? It's an interesting question to think about in terms of a unique source identifier.

Ken

Enno Borgsteede

unread,
Aug 5, 2022, 2:22:50 PM8/5/22
to root...@googlegroups.com
Hello Przemek,

> If every piece of genealogy information on the Web had a unique
> identifier and references between them were expressed using these
> identifiers, then we could get all this information and fit it into a
> graph database. Conversely, you could also look at the Web as a giant
> distributed graph database where you can (at least theoretically)
> query information using graph queries. However, this is more of a
> fantasy than reality if the most basic requirement of unique and
> stable identifiers is not met by so many Internet resources.

You are very right that this is a fantasy, just as much as the idea of
all people using the same OS, instead of Linux, macOS, and Windows, and
a whole lot of others. Do you remember OS/2? :-)

Another thing, even older, I think, is Xanadu:

https://en.wikipedia.org/wiki/Project_Xanadu

And I remember seeing the documentary mentioned at the end of that
article. And I was quite amazed to see a person that actually believed
that this could be done.

Personally, I only invest time in FamilySearch, Geni, and Gramps. And I
hope that some time, we may have time to add personae in that.

But I'm afraid that is a fantasy too ...

Regards,

Enno


Richard Light

unread,
Aug 5, 2022, 2:37:57 PM8/5/22
to root...@googlegroups.com
Now joined the group - trying again!

Richard

-------- Forwarded Message --------
Subject: Re: [rootsdev] Graphical Databases
Date: Fri, 5 Aug 2022 19:33:46 +0100
From: Richard Light <richard...@gmail.com>
To: root...@googlegroups.com


On 05/08/2022 18:54, Ken Finnigan wrote:
Thanks for your thoughts Przemek!

I would agree that one researchers "individual" is another ones "persona". Until each researcher has satisfied their specific criteria for judging the source information sufficient enough to be confident a specific "persona" is that "individual", the second researcher needs to be able to consider them a "persona". Your comment got me thinking about world trees and whether they will ever be fruitful given everyone's very personal "definition of done", using engineering parlance.

Well, there is WikiTree, which offers the prospect of a single world tree for those who want to cooperate on their genealogical research. It's very keen on getting users to specify the sources for their assertions, though obviously people will offer sources with different levels of quality/precision. Anyone who has a record on WikiTree is considered (by one researcher, at least!) to be an "individual". They have a unique, persistent URL which dereferences (resolves) to their web page. It's all public; it's all free.

Obviously, as with any crowd-sourced web resource (think Wikipedia) you have to take the less good along with the excellent, but if you find that someone in your tree appears in someone else's, it's like being offered a whole block of pieces to fit into your jigsaw puzzle.

I like the idea of unique identifiers for sources, as the same source can have different URLs/URIs depending on where that source is stored. Ancestry and FindMyPast have many of the same sources, but they're not available on the same URLs. In this case, does it mean they're different sources, or because it's from the same underlying data set they should be considered the same? It's an interesting question to think about in terms of a unique source identifier.

The process of putting a source onto the web is itself an act of interpretation. So "the Ancestry view" of a given source may not be identical to "the FindMyPast view" of exactly the same source. In our FreeBMD project, we aspire to transcribe each GRO index entry twice, to give increased assurance that the data is correct. Transcribing from handwritten sources, especially sources from earlier times when literacy couldn't be taken for granted and spelling wasn't standardized (and typewriters hadn't been invented!) involves considerable intellectual input to decide (a) what letters they actually wrote and (b) what they meant by them.

Richard


Ken

On Fri, Aug 5, 2022 at 1:45 PM Przemek Więch <pwi...@gmail.com> wrote:
This is a very interesting discussion. Last year I attempted to start writing up my ideas about storing genealogy information on the Web and linking between different resources (post 1, post 2). Unfortunately, I didn't have the time to follow up.

I haven't heard the persona/individual terminology before but I can see the concepts fit very well into what genealogy research is about. Also, wouldn't an individual for one researcher become a persona for another? Say researcher A publishes his findings on their website and researcher B finds it and incorporates some information into their own database. This way, information about a specific person on the website is an individual for A but a persona for B.

One of the things that was mentioned in this discussion is unique identifiers. This is something that is badly needed by genealogy researchers. Being able to reference a source with a unique identifier is a necessity. URLs or URIs are the Web's unique identifiers. It doesn't really matter how Ancestry creates the identifiers as long as they're unique and stable in time.

If every piece of genealogy information on the Web had a unique identifier and references between them were expressed using these identifiers, then we could get all this information and fit it into a graph database. Conversely, you could also look at the Web as a giant distributed graph database where you can (at least theoretically) query information using graph queries. However, this is more of a fantasy than reality if the most basic requirement of unique and stable identifiers is not met by so many Internet resources.

— Przemek

--

---
You received this message because you are subscribed to a topic in the Google Groups "rootsdev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rootsdev/2KTMRb-GQQA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rootsdev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rootsdev/CAKeeVe6CoYncdcQx8Pf1VmE_E1DvfC7dwjyPX0vCLZEGOoaUZg%40mail.gmail.com.
--

Richard Light
richard...@gmail.com
@richardofsussex

Ken Finnigan

unread,
Aug 5, 2022, 4:35:25 PM8/5/22
to root...@googlegroups.com
Hi Richard,

I'm aware of WikiTree, and have spent a little time on there, but honestly it appears super confusing at first glance. Maybe that impression alters with time, will have to see. I would agree that they endeavor to be more rigorous with source requirements, but to your point, it's not infallible, and likely nothing is.

I agree with your comments on Ancestry and FindMyPast utilizing different approaches to digitization which can lead to transcription differences. I applaud every effort by groups to ensure transcriptions are error free, as much as is feasible between people and computers. Though there may be transcription differences between the two services, I consider there still being a benefit to a singular reference for the source data set each site defines. One approach is to have the same source data set unique identifier, but there's a sub identifier, also unique, to indicate a specific transcription thereof, one each for Ancestry and FIndMyPast. There are approaches for representing the singular nature of the source type, while also representing instance variations of it.

Ken

paul...@gmail.com

unread,
Aug 5, 2022, 7:01:49 PM8/5/22
to rootsdev
Hello, Ken, and many thanks  Przemek for the support for UUIDs.

You have gone close to the heart of my enduring fury at the utter sloppiness of every "genealogy" "source" I have ever come across during some years of amateur family history research. They are almost exclusively either commercially driven (and wouldn't waste money on structure or features that don't make more profit), or non-profit (and could not afford to do so).

Picking up on the Ancestry example, what is a usable source and what does the id point to?
  • The full text of a "result"?
  • Any associated image?
The former is essentially useless as the content is unstable (transcriptions can be updated in different ways at different times).
The latter is close to what I want as an "original" (uninterpreted) source.
Both are vulnerable to corporate death.

An image ID is all very well, but needs also to be explicit about its provenance. It can be exceedingly difficult to locate the text of interest in early manuscripts (or poorly rendered ones, or where handwriting or original entry date is ambiguous), so some kind of "highlighting" layer would be a massive help.

FreeBMD has similar transcription stability problems plus the added absurdity of multiple transcription "records" for one published "line record". I never had a reply to the request for "line identity".

You might think that something easy like the UK GRO quarterly indexes could be re-indexed with page and line number (except that handwritten pages and MS amendments need special attention), and a project like FreeBMD would start off creating this.

But, in general, we can never achieve this level of source specificity without the whole-hearted commitment of the archives themselves. I can't see that ever happening.

Finally, in an ideal world of original source elements/records, every researcher makes their own transcription and subsequent interpretation.

Which is how it should be because those of others have limited authority.

Paul White

Ken Finnigan

unread,
Aug 5, 2022, 8:19:54 PM8/5/22
to root...@googlegroups.com
Hi Paul,

You're raising great questions.

I can see the benefit in a unique ID representing the source data set, i.e. "Ireland, Church of Ireland Search Forms for Baptisms, Marriages and Burials, 1731-1870". Such an ID can be utilized by any tool or website to uniquely represent the physical source data set being referenced.

From a source data set unique ID, there could then be a unique ID variation of it for particular sites/locations. As you point out, this is where it gets murky, both in terms of what information to collect and how to link/refer to it. In some respects, one could argue the link to a full-text result or image is secondary to capturing the details of where in the source data set the source record of interest can be found. Links/URLs change, and sites shut down, but knowing the source data set details, and details of the page/line within the source data set, enables researchers to find the same result even if the original link/URL is no longer pointing at the correct location. Provided there are other means to find that particular source data set!

Is FamilySearch a suitable place for "image source", even though they offer transcriptions inline? I also like the idea of a pure image with the option of separately linking to the transcription a particular site provides. Ideally, and this is pie-in-the-sky thinking, it would be great for images of all sources/records to be freely available on a central site and sites like Ancestry offer transcriptions of them, for a fee, for anyone not wanting to learn about handwriting to transcribe for themselves.

Ken Finnigan

paul...@gmail.com

unread,
Aug 6, 2022, 12:58:58 AM8/6/22
to rootsdev
Ken, Hey!

I think of all the great ideas you mention, where I have the most trouble is reliance on the whim of a commercial data provider?

To put it bluntly, " Ireland, Church of Ireland Search Forms for Baptisms, Marriages and Burials, 1731-1870" means absolutely nothing to anybody outside Ancestry's data archive team. All it says it that they scanned a batch of images from goodness knows where and decided to call it that.

I think it's scandalous that there is no identification of repository and its library index "number", let alone a serious attempt to quote a meaningful page number (what the heck is an image number?).

And technology is well up to transcriber image mark-up to focus on the zone being transcribed. That becomes the "zone ID", to complete the hierarchy repository, Volume, Page, Zone. Oh Dear, so expensive.

But, that way, every researcher can raise a glass to professionally-presented source images.

Not to mention the time/money saved checking if the FindMyPast original is the same! And we would potentially become "supplier"-independent (as in the days before Ancestry & Co.) in the sense of knowing where to go for the original records. And, BTW, repositories ought to clean up their act and publish detailed contents of their holdings.

Yep, in my dreams.

Paul White

Ken Finnigan

unread,
Aug 6, 2022, 3:08:23 PM8/6/22
to root...@googlegroups.com
Thanks for opening my eyes to this Paul. For some reason I'd presumed the titles of a collection on Ancestry or FindMyPast were common between them, that's my mistake as I didn't try and verify it.

I really like your notion of a "zone id", but agree it will be difficult to get commercial vendors on board.

To that end, would there be any benefit in a community-driven catalog to capture "common" names and details of source data collections, with links to those collections on the various commercial and free sites? Is there already a site with such a catalog?

Ken

Richard Light

unread,
Aug 8, 2022, 8:25:57 AM8/8/22
to root...@googlegroups.com
On 06/08/2022 20:08, Ken Finnigan wrote:
Thanks for opening my eyes to this Paul. For some reason I'd presumed the titles of a collection on Ancestry or FindMyPast were common between them, that's my mistake as I didn't try and verify it.

I really like your notion of a "zone id", but agree it will be difficult to get commercial vendors on board.

To that end, would there be any benefit in a community-driven catalog to capture "common" names and details of source data collections, with links to those collections on the various commercial and free sites? Is there already a site with such a catalog?

Ken,

One possible platform for such a catalogue would be WikiData. This is a Linked Data-friendly platform, like Wikipedia, which anyone can contribute to. Each concept which is recorded there gets a unique, persistent URL. WikiData would allow us to specify a canonical title for a collection (in multiple languages, if we wish!), and to indicate what name is given to that collection by various providers.

I see from a quick search that there are already records on WikiData for both Ancestry (https://www.wikidata.org/wiki/Q26878196) and FindMyPast (https://www.wikidata.org/wiki/Q5449873).

If someone can come up with a good example or two, I would be happy to have a go at expressing its properties and relationships in a WikiData-compatible format, so we can get a sense of what we might be able to produce.

Richard

You received this message because you are subscribed to a topic in the Google Groups "rootsdev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rootsdev/2KTMRb-GQQA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rootsdev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rootsdev/CAKeeVe5BQ2j9%2Bg1fXAysMC30NWF5xi1P51omC6-mnn9%2BnM_EfQ%40mail.gmail.com.

paul...@gmail.com

unread,
Aug 8, 2022, 11:30:24 AM8/8/22
to rootsdev
Hi, Ken
Another point is that my ideal would be all source citations include "pedigree".
That way, every citation can be traced up the hierarchy to any point up to the original repository.
And the very act of citation should add "you" to the chain as a node owned by you.
And if every node was "rich", anyone could add comments.
The owner may then choose to deprecate that citation (and perhaps raise another).
That what I believe FreeBMD should do with its "erroneous" transcriptions, not just leave them intact as misleading "noise".
Similarly, Ancestry, on accepting a user correction, should deprecate the old and create one with new ID.
(You can see all sorts of benefit here - I could run a scan of everything I had used to check for possibly important updates)
I wish.

Paul White

On Saturday, 6 August 2022 at 00:19:54 UTC k...@kenfinnigan.me wrote:

Thomas Wetmore

unread,
Aug 8, 2022, 1:35:37 PM8/8/22
to root...@googlegroups.com
Back on the topic of graphical databases.

(No complatints about the discussions about unique ids and problems with sources, but it seems that those problems are too huge to expect any movement on anyone's part within two lifetimes. What is wrong with using the ole cite your sources techniques of old; other than you have to do it yourself; though there is bibliography software.)

Anyway. The GEDCOM FAMS, FAMC, WIFE, HUSB, CHIL links provide one set of lineage-linking relationships perfect for setting up a graphical databases. In my LifeLines I have a number of functions that perform graphical analysis based on that structure, so I have good experience with those graphs.

A simplistic start to a genealogical database, based directly on GEDCOM, would be to use Person and Family (another approach calls this a Union, to less emotionally endow it) as the key node types, and husband, wife (or just two spouses), child, familyAsChild, familyAsSpouse relationships to connect them. Augment these with Nodes for Event, Place and Date, with their associated relationships, and you've got a good family tree graph, very useful for many queries. Then add in the subgraph for Sources and Evidence and you've got full function conclusion system. Then add the sub-structure for personas, and you're approaching a research level database.

I've got Neo4j on my Mac. I can write a LifeLines program to generate a CSV file to import my master GEDOCOM into a Neo4j database. They I can try some Cypher (the 'SQL' of graphical databases) to try out some graph-based queries, e.g., find nearest common ancestors.
Reply all
Reply to author
Forward
0 new messages