The new GEDCOM parser

Ron Savage

unread,

Nov 5, 2012, 2:04:42 AM11/5/12

to perl-...@perl.org

Hi

The new GEDCOM parser
This document is a collection of ideas which have been percolating
in my mind for a long time.

Comments welcome.

Ideas
Module name
Genealogy::Gedcom::Parser.

A place-holder, Genealogy::Gedcom
<http://metacpan.org/release/Genealogy-Gedcom>, is already on CPAN.

Note: This module was written before the new, major tools now
available were released. See Tools below.

ETA
There is no ETA for the parser.

However, certain Perl-based tools are now available which will make
coding a simple task. See Tools below.

See also 'Famous Last Words' :-).

UTF-8
The code will accept input files in utf-8, and generate files
containing utf-8 characters.

Apache and mod_perl
These will not be required. I only mention these because references
to them appear in the Gedcom.pm distro.

Logging
The code will have a built-in logger, so debugging, e.g., can be
turned on with a parameter to new().

This logger will use Log::Handler. See Tools below.

Sub-classing
Sub-classing the main module will be trivial, and samples will be
provided.

Sub-classing will be done with Hash::FieldHash. See Tools and the
FAQ below.

Grammars and grammar generators
Like Gedcom.pm, the code will read a GEDCOM grammar in BNF from a file.
I'll run this phase before shipping the module, so you don't have to.
See Tools below, specifically Marpa::Rules::Simple.

Bascially, this means the startling complexity of the code in
Gedcom.pm is a thing of the past.

Operating the parser
Using Marpa, callbacks are triggered when input is recognized.

So, when lines like these are encountered:

1 @<XREF:FAM>@ FAM
2 RIN <AUTOMATED_RECORD_ID>

Marpa will automatically call the callback attached to each tag.

Callbacks will probably have names like 'do_fam' and 'do_rin', i.e.
of the format 'do_$tag'.

The parameters passed to the callback include the non-tag text on
the line.

Default callbacks for all tags will be provided, each one doing its
part in parsing the parameters to the tag, and storing the result.

The result will probably be stored in a tree. See Tools below,
specifically Tree::DAG_Node.

Database support
A DBD::SQLite database is possible.

Tools
o Hash::FieldHash
Simplifies class-building.

As for the obvious question, why not use Moose, see the FAQ below.

o Log::Handling
Simplifies logging.

o Marpa::R2
This is the modern way to do parsing.

Home page <http://jeffreykegler.github.com/Marpa-web-site/>.

Jeffrey's blog about Marpa
<http://jeffreykegler.github.com/Ocean-of-Awareness-blog/>.

My recent article about lexing and parsing with Marpa

<http://www.perl.com/pub/2012/10/an-overview-of-lexing-and-parsing.html>.

o MarpaX::Simple::Rules
This module reads a grammar in BNF and generates a Marpa grammar.

Hence it will read a BNF version of the GEDCOM spec and output
the matching Marpa grammar.

o Tree::DAG_Node
The most sophisticated tree-handling code on CPAN. I've recently
become co-maintainer of this module.

FAQ
Why did you choose Hash::FieldHash over Moose?
My policy is to use the light-weight Hash::FieldHash for stand-alone
modules and Moose for applications.

Why did you choose to store the data in a tree?
A GEDCOM file's structure can be viewed as a tree, so my initial
plan is to store the data likewise.

--
Ron Savage
http://savage.net.au/
Ph: 0421 920 622

Stephen Woodbridge

unread,

Nov 5, 2012, 8:54:49 AM11/5/12

to perl-...@perl.org

Hi Ron,

This all sounds great. I have a question on your choice of using a tree
structure, can you explain that more? Are you thinking of the file being
the root, then having leaves like: indi, fams, famc, etc and then each
of these have their respective data hanging off those objects? Or are
you thinking the tree would represent the family relationships? I don't
see how the later will work.

-Steve

Ron Savage

unread,

Nov 5, 2012, 4:36:01 PM11/5/12

to perl-...@perl.org

Hi Steve

On 06/11/12 00:54, Stephen Woodbridge wrote:
> Hi Ron,
>
> This all sounds great. I have a question on your choice of using a tree
> structure, can you explain that more? Are you thinking of the file being
> the root, then having leaves like: indi, fams, famc, etc and then each
> of these have their respective data hanging off those objects? Or are
> you thinking the tree would represent the family relationships? I don't
> see how the later will work.

I got the idea from the nested structure in the GEDCOM doc itself. Any
nesting in the doc could be represented by children of a node in a tree.

But frankly, I have not thought it through. I really mentioned it to
help clarify my thoughts, and I strongly suspect actual coding will
modify my plan.

But in a bit more detail: An individual can be seen as a node in a tree,
in which case:

o They have a list of (2) grandparents (IVF aside!)

o They have a list of (N) spouses

o They have a list of (N) children

o (As a child) They have a list of (N) care-givers

o (As a patient) They have a list of (N) donors or organs or whatever

And fundamentally, in a tree, every node has N links (normally 1 up, N
sideways and N down), and those links have metadata which would be used
to represent the type of link.

The most obvious problem with a tree is the cross-links due to
divorce/re-marriage/adoption/...

Still, whatever the data structure chosen, /something/ has to be chosen
just to hold the data in memory.

Ron Savage

unread,

Nov 5, 2012, 5:06:44 PM11/5/12

to perl-...@perl.org

Hi Rich

On 06/11/12 01:53, rich...@comcast.net wrote:
> I didn't want to clog the list with this question, unless you feel it it worth it.

Post all questions and ideas. I don't want to develop code in a vacuum.
I only that if I /really/ know what I'm doing. Not so in this case...

> But would this new parser have reporting capabilities?

The basic module - no. It would be purely to parse the file and build a
data structure.

Code for traversing a tree is standard, from which any imaginable output
can be produced.

In the simplest case, any node is the root of a sub-tree, and the tree
can be traversed and (say) printed from that point on.

What you're talking about I assume is formatting that output.

> For example, I really like the layout of the FamilyTree add-on for GRAMPS, but it looks like it is not longer being updated, and I wish it was a little more modular, that it could display more fields than it does.

I googled and found this link:
http://www.gramps-project.org/wiki/index.php?title=Family_Tree

I have to say that's a very fancy report. Amazingly, to some extent,
such a report is (almost) already available from a tree structure.
Peruse the 4 links on this page to get the idea:

http://savage.net.au/Graph-module-demos.html

Hence my preference for trees.

> Plus I wish it was written in perl ; )

Don't we all.

> Rich

Philip Durbin

unread,

Nov 5, 2012, 5:47:42 PM11/5/12

to perl-...@perl.org

On 11/5/12 5:06 PM, Ron Savage wrote:
> I googled and found this link:
> http://www.gramps-project.org/wiki/index.php?title=Family_Tree
>
> I have to say that's a very fancy report. Amazingly, to some extent,
> such a report is (almost) already available from a tree structure.
> Peruse the 4 links on this page to get the idea:
>
> http://savage.net.au/Graph-module-demos.html
>
> Hence my preference for trees.

I just want to say that those GraphViz2 demos look great.

Well done, Ron.

Phil

Ron Savage

unread,

Nov 5, 2012, 6:41:23 PM11/5/12

to perl-...@perl.org

Hi Phil

$many x $thanx;

AKA

$thanx x $many;

Stephen Woodbridge

unread,

Nov 5, 2012, 7:23:01 PM11/5/12

to perl-...@perl.org

Ron,

I think this is a graph not a tree or at best an interconnect forest of
trees. Given a focus node like an individual or a family you can view
look at the trees up or down from that node.

In graph theory you have nodes and edges, and you can use Dijkstra's
shortest path to find the shortest route through the graph between the
start and end node. It does this by converting the graph into a tree,
but doing so does not maintain all the node because many of them are
parallel paths. You can do the same thing with a gedcom file where nodes
are individuals and families and the relationships are the edges.

Nodes and edges for sure. Tree versus graph, probably better to think of
it as a graph. Somethings you need to be able to model are:

IVF as you mentioned.
divorce/re-marriage/adoption/... again as you mentioned.
marriage and offspring of relations.

If it is a graph, you can use graph theory to process it and these are
well defined algorithms for traversal and manipulation of graphs.

-Steve

Ron Savage

unread,

Nov 5, 2012, 8:08:34 PM11/5/12

to perl-...@perl.org

Hi Steve

On 06/11/12 11:23, Stephen Woodbridge wrote:
> Ron,
>
> I think this is a graph not a tree or at best an interconnect forest of
> trees. Given a focus node like an individual or a family you can view
> look at the trees up or down from that node.
>
> In graph theory you have nodes and edges, and you can use Dijkstra's
> shortest path to find the shortest route through the graph between the
> start and end node. It does this by converting the graph into a tree,
> but doing so does not maintain all the node because many of them are
> parallel paths. You can do the same thing with a gedcom file where nodes
> are individuals and families and the relationships are the edges.
>
> Nodes and edges for sure. Tree versus graph, probably better to think of
> it as a graph. Somethings you need to be able to model are:
>
> IVF as you mentioned.
> divorce/re-marriage/adoption/... again as you mentioned.
> marriage and offspring of relations.
>
> If it is a graph, you can use graph theory to process it and these are
> well defined algorithms for traversal and manipulation of graphs.

Agreed.

In the module I was planning to use, Tree::DAG_Node, DAG is in fact
Directed Acyclic Graph :-).

Of course it may turn out to not be the best choice in these circumstances.

I'm also trying somewhat to avoid graph theory terminology at this
stage, since it's all hand-waving so far.

Further reading for interested parties:

https://en.wikipedia.org/wiki/Graph_theory

John Washburn

unread,

Nov 5, 2012, 8:32:17 PM11/5/12

to Stephen Woodbridge, perl-...@perl.org

I agree with Mr. Woodbridge. A directed graph is a better model.

Consider adoption where the adoptee knows their biological parents. This
individual has four "parents" two adoptive and two biological; or more
precisely two mothers and two fathers.

This does not fit well into a Tree (with 1 up) but is no problem for the a
directed graph between these 5 individuals. Some are connected by an edge
named "birth" and some by an edge called "adoption." A directed graph
handles the extension of this situation where the adoptive parents have a
biological child as well and/or the biological parent has other children not
put up for adoption or adopted by another family. A strict tree structure
is at best messy for this situation.

A graph traversal though can report that this person is my brother by
adoption not blood and that this other person is my sister by blood only
since we were adopted by different families.

The problem you will encounter is to take this nuanced, graph structure and
"squashing" it down into a GEDCOM tree which has a design biased toward
bloodlines over other human connections which people cherish. The directed
graph lets model the human connections and ignore this bias until it is time
to export the data or create the report.

I could even see the edge having a truth value in the closed interval:
[0..1]. For example: I am 70% sure this Percilla Chase is my ancestor and
30% that this Prissy Chase born in the same year one town over is my
ancestor. The edge connecting my ancestor has two sets of birth edges; one
for the 70% connection and one set for the 30% connection. The only one up
nature of a tree makes such uncertain/tentative connections difficult to
model.

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2013.0.2742 / Virus Database: 2617/5876 - Release Date: 11/05/12

Ron Savage

unread,

Nov 5, 2012, 10:21:50 PM11/5/12

to perl-...@perl.org

Hi John

On 06/11/12 12:32, John Washburn wrote:
> I agree with Mr. Woodbridge. A directed graph is a better model.

I aggree. Conceptually it must be graph.

Sure. The question is: Is there a module on CPAN which will do the job?

I was a bit careless with the terminology. One issue to do with this
which I have not previously stated is that there is a Graph module on
CPAN: https://metacpan.org/release/Graph

I had a little play with it, and the results were incomprehensible.

Another issue is that the docs are a bit terse, and are (reasonably)
aimed at experts in the field. The author does refer to 'my fiendish code'.

I'm not an expert on graphs, and find the docs too difficult to follow
to make this module a possible contender.

So I still have the problem as to which CPAN module if any I adopt. I
have no intention to rewrite Graph, so I picked Tree::DAG_Node as a
first choice, not implying it's the best or most appropriate.

Solution: Don't know.

More below...

Weighting factors on edges are definitely nice-to-have, and I would do
nothing to preempt their implementation.

Stephen Woodbridge

unread,

Nov 6, 2012, 12:11:01 AM11/6/12

to perl-...@perl.org

Hi Ron,

I work with graphs for doing vehicle routing so have some familiarity
with them.

I think this is the family of graph tool to look at using:

http://search.cpan.org/~jhi/Graph/

and I think these work with:

Graph::Writer
Graph::Reader

And it is likely that this can trivially be integrated with graph
rendering tools like GraphViz and/or one of the other tools using this:

http://search.cpan.org/~neilb/Graph-ReadWrite-2.03/lib/Graph/Writer/Dot.pm

I also found a couple of modules that might be interesting to play with:

Graph::Similarity
Graph::Matching

If these can be applied to the task matching and merging overlaping
gedcom files.

And this module can be used to save and restore Graph structures in
relational databases.

Ok, this is starting to look very interesting. I guess it all starts
with being able to move data to/from Graph structures and GEDCOM files.

Ron, you got this coded yet? ;)

-Steve

Ron Savage

unread,

Nov 6, 2012, 12:42:15 AM11/6/12

to perl-...@perl.org

Hi

I should say that I think storing a GEDCOM file's content, sort of
literally, in a tree ought to work, because of the nested nature of the
defined structure of such file.

Creating another data structure to represent the linkage between people,
etc will be a separate issue.

Jeremy Slade

unread,

Nov 6, 2012, 1:54:49 AM11/6/12

to Ron Savage, perl-...@perl.org

You seem to be mixing two paradigms here: a stream parsing engine
(having all the callbacks per entity) and a document model builder. I
recommend to focus on making a solid GEDCOM stream parser (handling
UTF-8, well-defined grammar). There are many applications that don't
need to hold all the data in memory.

Once you have a solid *parser*, it's another matter to build object
models on top of that. The parser is universal functionality. Defining a
single object model that all consumers will be happy with is an
impossible task, so you'll be much better served in separating them.

Jeremy

Ron Savage

unread,

Nov 6, 2012, 2:14:56 AM11/6/12

to Jeremy Slade, perl-...@perl.org

Hi Jeremy

On 06/11/12 17:54, Jeremy Slade wrote:
> You seem to be mixing two paradigms here: a stream parsing engine
> (having all the callbacks per entity) and a document model builder. I
> recommend to focus on making a solid GEDCOM stream parser (handling
> UTF-8, well-defined grammar). There are many applications that don't
> need to hold all the data in memory.
>
> Once you have a solid *parser*, it's another matter to build object
> models on top of that. The parser is universal functionality. Defining a
> single object model that all consumers will be happy with is an
> impossible task, so you'll be much better served in separating them.

Agreed.

I've tried to clarify that with my last post:

> I should say that I think storing a GEDCOM file's content, sort of
> literally, in a tree ought to work, because of the nested nature of
> the defined structure of such file.
>
> Creating another data structure to represent the linkage between
> people, etc will be a separate issue.

Ron Savage

unread,

Nov 6, 2012, 4:49:36 AM11/6/12

to perl-...@perl.org

Hi Steve

On 06/11/12 16:11, Stephen Woodbridge wrote:
> Hi Ron,
>

> I work with graphs for doing vehicle routing so have some familiarity
> with them.

Good.

> I think this is the family of graph tool to look at using:
>
> http://search.cpan.org/~jhi/Graph/

Yes, that's the one I earlier said /I/ had trouble with. See a prevous msg.

Perhaps I will need to adopt it despite my recent experience!

> and I think these work with:
>
> Graph::Writer
> Graph::Reader

Yes they do. The author of Graph recommends them.

> And it is likely that this can trivially be integrated with graph
> rendering tools like GraphViz and/or one of the other tools using this:
>
> http://search.cpan.org/~neilb/Graph-ReadWrite-2.03/lib/Graph/Writer/Dot.pm
>
> I also found a couple of modules that might be interesting to play with:
>
> Graph::Similarity
> Graph::Matching

Clearly work has been done to write add-ons for Graph.

> If these can be applied to the task matching and merging overlaping
> gedcom files.

Interesting idea...

> And this module can be used to save and restore Graph structures in
> relational databases.

I did not check that deeply into their capabilities.

> Ok, this is starting to look very interesting. I guess it all starts
> with being able to move data to/from Graph structures and GEDCOM files.
>
> Ron, you got this coded yet? ;)

No. I seem to be spending way too much time answering emails. Hahahaha.

Perhaps the familiarity you mention above will provide you with the
wherewithal to beat me to it :-)). I look forward to your coming upload
to CPAN....

Nigel Horne

unread,

Nov 6, 2012, 6:12:43 AM11/6/12

to perl-...@perl.org

I'm putting together a website where you can upload a Gedcom
(temporarily, it won't be stored) and the site will then
give you links to the burial sites of your ancestors (where known). It
does this by using the Gedcom CPAN module to parse the data, then it
interrogates sites such as billiongraves.com,
findagrave.com and tombfinder.com. I have written the first draft
version of the engine which does the matching and am looking for Gedcom
data to test it with. You can upload data at http://nigelhorne.force9.co.uk/~njh,
or if you prefer you can e-mail a Gedcom to me.

Yes - the code will be open sourced (probably on Github) once I've put enough data through it to test it.

Thanks,

-Nigel Horne

--
Arranger, Adjudicator, Band Trainer and Clinician, Composer, Tutor, Typesetter.
NJH Music, ICQ#20252325, twitter: @nigelhorne @bbportal @brasscomposer
n...@bandsman.co.uk http://www.bandsman.co.uk

Stephen Woodbridge

unread,

Nov 6, 2012, 9:24:46 AM11/6/12

to perl-...@perl.org

Hi Ron,

I totally agree with Jeremy, that the parser is the key. Obviously you
need to supply it with test callbacks for verification as you build it.
But once you have that then it should be straight forward for people (or
you) to hook it into say Graph, or load it into a database or whatever.

I think having a reader and writer the conform to a grammar is the place
to start.

-Steve

Jeremy Slade

unread,

Nov 6, 2012, 1:17:50 PM11/6/12

to Ron Savage, perl-...@perl.org

I think you missed my point: The parser's job should not be to store
anything in memory. You deal with all the nitty-gritty of handling
different formats, understanding the grammar, etc, then provide a rich
set of callbacks so a client can implement whatever they need on top of
it. Parser != model.

Jeremy

Chris Clonch

unread,

Nov 6, 2012, 2:27:54 PM11/6/12

to perl-...@perl.org

Hi everybody!

With a graph theory connection, you could easily implement Randy
Wilson's ideas [1] on merging GEDCOMs. I've attempted to give thought
to how this could be implemented with Graph.pm (as seen in O'Reilly's
Algorithms with Perl) & Gedcom.pm but I couldn't wrap my brain around
the ins and outs of the data as a graph.

Like Jeremy stated, if a solid parser is written, then it would likely
be trivial (for others :) to extend using graph, tree, whatever for
advanced manipulation of the data. And the reverse could be true,
additional lexer's could be wrote to slurp in XML, FOAF, etc that comes
down the road...

Got me excited again!

-Chris

1. http://synapse.cs.byu.edu/~randy/gen/Remerge.html

> Scanned and tagged as non-SPAM with DSPAM 3.9.0 by Your ISP.com

Scanned and tagged as non-SPAM with DSPAM 3.9.0 by Your ISP.com

Ron Savage

unread,

Nov 6, 2012, 5:10:51 PM11/6/12

to perl-...@perl.org

Hi Chris

On 07/11/12 06:27, Chris Clonch wrote:
> Hi everybody!
>
> With a graph theory connection, you could easily implement Randy
> Wilson's ideas [1] on merging GEDCOMs. I've attempted to give thought to
> how this could be implemented with Graph.pm (as seen in O'Reilly's
> Algorithms with Perl) & Gedcom.pm but I couldn't wrap my brain around
> the ins and outs of the data as a graph.

Guess I'll have to reassess the appropriateness of Graph.pm. I may well
have to use it.

> Like Jeremy stated, if a solid parser is written, then it would likely
> be trivial (for others :) to extend using graph, tree, whatever for
> advanced manipulation of the data. And the reverse could be true,
> additional lexer's could be wrote to slurp in XML, FOAF, etc that comes
> down the road...
>
> Got me excited again!

Good!

> 1. http://synapse.cs.byu.edu/~randy/gen/Remerge.html

Thanx for the reference.

Ron Savage

unread,

Nov 6, 2012, 5:31:39 PM11/6/12

to perl-...@perl.org

Hi Jeremy

On 07/11/12 05:17, Jeremy Slade wrote:
> I think you missed my point: The parser's job should not be to store
> anything in memory. You deal with all the nitty-gritty of handling
> different formats, understanding the grammar, etc, then provide a rich
> set of callbacks so a client can implement whatever they need on top of
> it. Parser != model.

Have you considered sub-classing Genealogy::Gedcom, a module of mine on
CPAN?

Attached is a run with sample data which ships in the data/ dir:

As for new work, you /could/ use that module, since
Genealogy::Gedcom::Reader::Lexer already has a long list of methods
called tag_*() which you could override.

The input data is stored in an array for any further processing desired.

I encourage people to play with it, at least.

What I'm talking about is new code, which would have 1, but better 2,
data structures holding parsed data.

o A representation of the literal contents of the input file, perhaps as
an array (as above) but probably in a tree structure.

o Interpreted data, probably in the graph structure.

Basically, having a built-in set of callbacks, a la the tag_*() above
simply save everyone re-inventing the wheel.

And my policy would be to always store the data in memory.

Sub-classes can then use a documented interface to access that data to
copy it into other formats.

The reason for storing it is to provide a basic set of features such as:

o List all individuals (and optionally their data of course)
o List all families
o For a given individual:
- List all siblings
- List all descendants
- List all ancestors
- List all ancestors siblings
- Etc
o Likewise something for a given family
o Report (display, print) also

ged.report.log