What do you hope to see in the next GEDCOM version?

63 views
Skip to first unread message

Luther Tychonievich

unread,
Apr 13, 2020, 11:56:36 AM4/13/20
to root...@googlegroups.com
RootsDevers,

Although this list has gone quiet, I've still got my hands in half a dozen pots related to our interests. One of them is FamilySearch's early work on a new version of GEDCOM.

Do you want a new version of GEDCOM? If you could have your way, what would it look like? Any changes you've longed to see? Any changes you think others might suggest that you don't want to see, either because they are bad ideas or too involved to implement?

— Luther

Thomas Wetmore

unread,
Apr 13, 2020, 1:11:47 PM4/13/20
to root...@googlegroups.com, LifeLines System
Luther,

A topic I discussed vociferously for years. My only pleas would be the same as I have always had for genealogical data formats:

1. Simplicity.

2. Terseness.

3. Flexibility. FLEXIBILITY. FLEXIBILITY.

4. Human readability.

5. Get rid of as much kowtowing to "standards" as possible.

The move that occurred to standardize genealogical data with international standards (e.g., dates, geographic locations, personal names, ...) is what turned me against efforts to build the next generation formats. I have always been firmly convinced, that in the genealogical domain, dates, locations, and names (and many other things, as well) are inherently unstandardizeable, and efforts to do so are wrong-headed. The belief that standards are required to glean and process information, is, in my opinion, an excuse to not face up to and solve the problems of understanding rich information. And until this recognition occurs genealogical data standards, though they can be made to work in the majority of cases, will not rise to the level of its true possibility.

The incredible richness of the information that is of interest to genealogists must be embraced. The efforts I saw developing that turned me away were the obvious attempts, by very well-meaning yet very unwise persons, to try to legislate that richness away. And the polemics those unwise persons spread about. I have been very happy down here in my hole ever since!

A genealogical standard should be flexible in the face of nonstandard information. This was the design principle of the LifeLines program going on forty years ago now. The LifeLines design principles were to use a simple syntax (lineage-linked GEDCOM was chosen, twenty years later it would have been XML), but to allow that syntax to build structures with minimal restrictions.

Good luck,

Tom Wetmore

Louis Kessler

unread,
Apr 13, 2020, 3:58:03 PM4/13/20
to rootsdev
Luther,

Yes, we need to bring GEDCOM forward, getting rid of obsolete constructs, fixing some mistakes, and adding only a very few new features only if they are absolutely needed.

Current GEDCOM is very good. It's biggest problem is that many of the features it contains are mostly ignored ((e.g. PAGE, TYPE, ASSO tags) and are not being used as well as they should. Vendors are instead using their own non-standard alternatives.

Breaking changes should not happen. Every single change will affect all vendors, so each must be absolutely needed and be the simplest change possible. If it can be done from current GEDCOM syntax, then it should (e.g. EVEN.TYPE or FACT.TYPE instead of new tags)

We need to teach all vendors how to use current GEDCOM properly, and to get them to update their output and input to 5.5.1, eliminating as many of their own illegalities and custom tags as possible. The goal is that the data must be transferrable.

Louis

Dallan Quass

unread,
Apr 14, 2020, 1:58:13 AM4/14/20
to root...@googlegroups.com
I'd suggest being conservative in changes. Yes, GEDCOM has problems, but it's mostly water under the bridge at this point. People who process GEDCOMs have already written a lot of code to handle the inconsistencies. (It's like parsing HTML in the 90's.) Changes at this point aren't going to make GEDCOM parsing any easier because you still have to import the old file formats. Instead, changes are likely to result in two things:

1) Organizations who export and import GEDCOMs will have to spend development resources modifying their GEDCOM processing code (which may not have been touched for years). Most of these organizations are small shops and there is only so much time in the day, so we need to make sure that the organizations' customers are going to see value from the changes, otherwise the organizations may choose to ignore them.

2) Some of the organizations will implement some of the changes incorrectly, so every organization that imports GEDCOMs will now have to deal with yet more inconsistencies.

I think that a couple of changes to GEDCOM would certainly be worthwhile. 

1) An extension to allow people to export and import their media alongside their data in a zip file format would be fantastic. It would make it possible for us to be in compliance with the GDPR's right to data portability: https://gdpr-info.eu/art-20-gdpr/ 

2) Modifying the specification to be more inline with the GEDCOMs that organizations are actually generating, and then writing a validation program and creating a certification process that organizations can go through to 'certify' that their GEDCOM importer and exporter are able to correctly import and export GEDCOMs would be very welcome. In fact I'd say this should be a requirement if we are thinking of changing the specification, in order to avoid misunderstandings of the changed specifications leading to more inconsistencies in the future.

Dallan

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rootsdev/CAA%3D1x%3DWH9KvQSfCda27Mk70NhH76geO_QvgJd3Xr5bnWJ8nb%3Dw%40mail.gmail.com.

John Cardinal

unread,
Apr 14, 2020, 11:14:02 AM4/14/20
to root...@googlegroups.com

I agree that sweeping changes are not the correct next step. Still, some changes would help a lot. While many software publishers have already implemented code to handle some of GEDCOMs quirks and/or some of its characteristics that are based on 1995 technology, that doesn't help people or organizations who are starting from a blank slate.

 

I could list at least 50 items off the top of my head. I'd expect push-back on some (if not all) of them, and I'd also expect that few would be accepted and implemented.

 

Some changes would be changes/fixes to the spec only and wouldn't affect any commercial-quality products. For example, the OBJE.FILE value is limited to 30 characters in 5.5.1. Similarly, the MAP.LATI and MAP.LONG values are too short. The 5.5.1 spec says "{5:8}" for both, but even the examples in 5.5.1 for both LATI and LONG exceed the spec limit of 8 characters. (Too short, and internally inconsistent!) For FILE, LATI, LONG, and several others, implementors ignore the spec in order to transfer common, legitimate values. It's not good when a standard is routinely ignored.

 

Another change is to allow CHAR UTF-8 only. ANSEL is officially obsolete and not included as a built-in capability in development environments, so many programs don't support it, and those that do often implement it incorrectly. "UNICODE" is ambiguous and unnecessary given UTF-8 is unambiguous and does everything "UNICODE" can do. "ASCII" is too limited and has been prone to incorrect implementations, such as allowing characters > 0x7F in such files, using "ASCII" when the file is actually Windows-1252 (or another Windows code page), etc.

 

Viewing things more broadly, here are some areas of interest to me:

 

  • Changes to adapt to modern technology (such as UTF-8)
  • Changes to fix/clarify/disambiguate the spec
  • Changes to eliminate dormant/obsolete ideas from the spec
  • Additions to standardize some common extensions

 

I'd go further that that, too, but I suspect there would be a lot of pushback and I'd get shouted down. Still, it's 2020, and the last changes were in 1999. Technology has changed an incredible amount since 1999, as has the practice of genealogy and the software tools used by genealogists. The standard needs to be updated.

 

John

Robert Gardner

unread,
Apr 14, 2020, 11:15:51 AM4/14/20
to root...@googlegroups.com
To Dallan's point, if you are going to make changes to GEDCOM you will be more successful at getting them adopted if you provide libraries for importing/exporting GEDCOM in your new format. They would need to provide significant functionality that makes using the new format compelling, and dead simple to incorporate into an existing app.

Also, a large fraction of new development is moving to mobile, so providing mobile-friendly libraries might be a big deal. Yes, the developer can send the files to the server for processing, but the trend is to do more on the mobile device, and with a parser running on the device, functionality can be enabled that might otherwise be challenging.

A quick search for GEDCOM libraries shows ones in Python, C, and TypeScript. I'm not sure what would be most common server-side, but these certainly aren't that useful on mobile.

John Cardinal

unread,
Apr 14, 2020, 11:34:16 AM4/14/20
to root...@googlegroups.com

Robert,

 

I don't think it's practical to provide libraries and I am skeptical that any organization would accept that responsibility. I use C#, but others use Java, C++, Swift, Python, PHP, etc. A library written in C++ and called from other platforms is technically feasible, but such libraries are often problematic and often at odds with the strengths/idioms of the host language. Perhaps more importantly, would a shared library support my requirements while also supporting the requirements of other software publishers? I doubt it.

 

If an authorized organization published a reference program that implemented the standard and was open source with an MIT-type license, that would help new developers and would also help resolve issues with a new spec without necessarily providing components used in other software.

 

John

Robert Gardner

unread,
Apr 14, 2020, 12:16:58 PM4/14/20
to root...@googlegroups.com
I used to have that same attitude until I started doing work that relies on npm, which hosts tens of thousands of open source reusable JS and TS libraries. They have made an enormous impact on my productivity and development enjoyment. 

I would think that a well thought out interface with an implementation in a small number of languages would greatly improve your chances of adoption. 

On the mobile front the problem is even easier since you only have 2 platforms to worry about and well-defined mechanisms for delivering libraries (such as Pods on iOS). My current mobile app project uses about a dozen of these native libraries and hundreds of npm modules. 

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.

John Cardinal

unread,
Apr 14, 2020, 2:25:10 PM4/14/20
to root...@googlegroups.com

Robert,

 

We'll have to agree to disagree. Restricting implementations to "a small number of languages" would never be satisfactory to me, and I doubt that any library--even in my preferred language/platform--that would satisfy Ancestry.com or RootsMagic would ever satisfy me, and vice-versa.

 

I've been working in JavaScript a lot lately and npm has its uses. Half the reason for its existence, however, is the lack of an adequate run-time library for JavaScript. Overall, developing JavaScript-based solutions is better than it has ever been, but still lags far, far behind truly integrated IDEs.

 

John

Gordon Clarke

unread,
Apr 14, 2020, 2:31:06 PM4/14/20
to rootsdev
Everyone,

It's great to see this activity on RootsDev !!  I need to get back in touch with Tom and Robert.  It's been too long.  Everyone keep the conversation going.

Gordon

Thomas Wetmore

unread,
Apr 14, 2020, 4:14:01 PM4/14/20
to root...@googlegroups.com
Luther,

Some questions back to you.

Are there real plans to come up with a new version or GEDCOM? Or are you on an exploratory mission? If there are real plans, is it the church that's behind it, or some other group?

If there are plans, would the new version be based on the GEDCOM syntax of old, or would there be an evolution to another structure type?

How important is backward compatibility?

In my earlier response I made the tacit assumption that any effort for a new version of GEDCOM, this many years after 5.5.1, would be to a whole new format, probably based on XML or JSON, and with more up to date internal models for genealogical data. But this need not be the case. My comments only make sense in the context of a redesign. If you are talking about tweaking 5.5.1, then I wish you good luck, but I wouldn't see enough point behind that effort to be involved.

Best,

Tom Wetmore

On Apr 13, 2020, at 11:56 AM, Luther Tychonievich <tychon...@gmail.com> wrote:

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.

Ryan Heaton

unread,
Apr 14, 2020, 4:47:59 PM4/14/20
to root...@googlegroups.com
Hi Tom.

The initiative for which Luther is gathering input in this thread is a new version of GEDCOM. Yes, there are real plans being initiated and driven by FamilySearch. It is a new version "based on the GEDCOM syntax of old."

FamilySearch's initiative for "whole new format [...] based on XML or JSON [...] with more up to date internal models for genealogical data" is GEDCOM X, which is still an active project being used in a number of different applications and ever welcomes your input and involvement.

Thanks!

-Ryan

P.S. Nice to hear from you. It's been awhile. I'm still hoping to someday meet you face to face.


Luther Tychonievich

unread,
Apr 14, 2020, 4:48:47 PM4/14/20
to root...@googlegroups.com, Gordon Clarke
On Tue, Apr 14, 2020 at 4:14 PM Thomas Wetmore <ttwet...@gmail.com> wrote:
Luther,

Some questions back to you.

Are there real plans to come up with a new version or GEDCOM? Or are you on an exploratory mission? If there are real plans, is it the church that's behind it, or some other group?

I am not a representative of the church nor in a position to comment on its plans, but I can say that the church has assigned at least some FamilySearch personnel time to preparing for a new version of GEDCOM, and that I have been in contact with them about this. Perhaps Gordon or other FamilySearch employees on this list care to share more?
 
If there are plans, would the new version be based on the GEDCOM syntax of old, or would there be an evolution to another structure type?

I have been in discussions on the pros and cons of each view. As I understand it, there is a camp for "let's make a few housekeeping fixes" and a camp for "let's start from a clean slate" and not yet enough opinions collected from people who'd have to implement whatever decision is made to know which is the more palatable decision.
 
How important is backward compatibility?

A fascinating question. My personal stance is that if we create suitable bridge between GEDCOM-5.3/5.4/5.5/5.5.1/X and whatever comes next, we can get the best of both worlds: old files remain useful and new files don't have baggage. But such a bridge is more easily mentioned than specified, and more easily specified than realized, especially given the 20+ GEDCOM 5.5.1 dialects that exist given the various extension tag sets and non-standard implementations out there. Not clear if others in the conversation share my view on this.

Of course, if we go the "less standard, more flexible" route a bridge may that becomes easier, depending on how it works out in practice.
 

Luther Tychonievich

unread,
Apr 14, 2020, 4:51:49 PM4/14/20
to root...@googlegroups.com
On Tue, Apr 14, 2020 at 4:48 PM Ryan Heaton <ry...@webcohesion.com> wrote:
Hi Tom.

The initiative for which Luther is gathering input in this thread is a new version of GEDCOM. Yes, there are real plans being initiated and driven by FamilySearch.

If I had just waited two more minutes to press send...  Thanks, Ryan!


Thomas Wetmore

unread,
Apr 14, 2020, 5:23:12 PM4/14/20
to root...@googlegroups.com
Luther,

Quoting Louis:

> Luther,
>
> Yes, we need to bring GEDCOM forward, getting rid of obsolete constructs, fixing some mistakes, and adding only a very few new features only if they are absolutely needed.
>
> Current GEDCOM is very good. It's biggest problem is that many of the features it contains are mostly ignored ((e.g. PAGE, TYPE, ASSO tags) and are not being used as well as they should. Vendors are instead using their own non-standard alternatives.
>
> Breaking changes should not happen. Every single change will affect all vendors, so each must be absolutely needed and be the simplest change possible. If it can be done from current GEDCOM syntax, then it should (e.g. EVEN.TYPE or FACT.TYPE instead of new tags)
>
> We need to teach all vendors how to use current GEDCOM properly, and to get them to update their output and input to 5.5.1, eliminating as many of their own illegalities and custom tags as possible. The goal is that the data must be transferrable.
>
> Louis

If you are sticking with GEDCOM syntax, and want backward compatibility, I think Louis's comments are where to begin.

Beyond what Louis says, I'd add two things:

1. Absolutely insist that UTF-8 is the one and only supported character format. There is no excuse for any developer, anywhere in the world, not embracing this wholeheartedly.

2. Provide detailed "semantic" specifications for exactly what each tag and tag structure means in every context in which they can occur. Misinterpretation is usually cited as the overarching problem with current GEDCOM, vis-à-vis vendors, so this problem should be addressed from the beginning.

Other ideas that I think would be very useful are:

1. Provide a reference tool to show proper interpretation of all features. A vendor could pass it a file, and the tool report on validity, and if valid, report on exact semantics found in the file, maybe even output a graphical diagram of exactly what the file contains. This would be a lot of fun to program.

2. Provide a reference library that does the parsing, validation, and even the conversion of a parsed file into an internal "DOM" structure. Vendors around the world use libraries like this for XML and JSON, and it's not at all far-fetched to imagine genealogical vendors recognizing the value and doing the same. If this library made its debut at the same time as the new standard, that could be the quantum leap point for vendors to come into line. Imagine if all vendors were able to read files and build up exactly the same internal DOMs for those files. There is current discussion on this point here. I'd be in favor of choosing a few languages (obviously C++, Java, and Swift) on server side, and one or two client side languages. I can't choose those as I have only been a server side programmer (which used to be called just a programmer) for the past 50 years. But this implies either real bucks or great volunteers.

(If such a reference library existed, it would make up the input side of the reference tool, if that isn't already obvious.)

Best,

Tom Wetmore



Doug Kennard

unread,
Apr 14, 2020, 7:26:42 PM4/14/20
to rootsdev
I agree that UTF-8 should be the only character encoding allowed. I also like Dallan's suggestion of a standard way to export/import media alongside data in a .zip

I would also suggest getting rid of continuing lines (CONC/CONT) and arbitrary limits on lengths (255 wide characters for gedcom_line, 32K record limit, etc.). They were necessary in the past, but in a day when even mobile devices have Gigabytes of memory, splitting and recombining individual lines based on a short, arbitrary limit is just unnecessary complexity that should be completely removed. Going forward, lengths and sizes should really not be limited by the specification itself, only by practical considerations of the software and systems using the data: "Sorry, there is not enough free memory to load the 83-TB file 'entire_human_race.ged.'"

Thomas Wetmore

unread,
Apr 14, 2020, 10:00:41 PM4/14/20
to root...@googlegroups.com
Good points here. The CONC/CONT issue is notorious in GEDCOM lore. I agree that there should be no limits on the length of a GEDCOM line's value. However, I don't like the idea of having no way to break lines when there is good reason for doing so. CONC/CONT isn't the answer, but I'd hope that there is one.

I agree that there should be no limit on the size of GEDCOM "records". My own contribution to genealogical software, LifeLines, allows GEDCOM values of any length, GEDCOM "records" of any size, and sub-tagging to any depth. It's hard to imagine that a new official GEDCOM standard would actually have semantics allowing substructures to a depth deeper than, I don't know, ten or so, but LifeLines has so few restrictions that depth of tags to 10, 100, 1000, ..., work fine. It's HARDER to write software to enforce these limitations.

Tom Wetmore

Louis Kessler

unread,
Apr 14, 2020, 11:36:12 PM4/14/20
to rootsdev
In the days of BetterGEDCOM, Tom and I and others had many excellent discussions, ad infinitum, leading to nothing being built. FHISO was created out of that which had many excellent discussions and papers, leading to nothing being built/ 

We have many differing opinions here. There are already some things said just in this thread that I do not fully agree with, and we could start discussion that can go on, ad infinitum, with nothing being built.  Doing that once again is not worth my time and effort. If progress is really desired to update GEDCOM, then it must be done in a manner where all involved are after the same goal and are making progress together, without running off onto interruptive tangents and innumerable trivialities. We're talking about advancing what is now a practical standard that must be understandable and easily implementable by developers who currently have imperfectly implemented something resembling GEDCOM.

Divergent views will need to be quickly adjudicated based on some concrete criteria, such as.:
(a) Will the data that needs to transfer correctly transfer?
(b) How can this be done with GEDCOM now?
(c) How can this be done that would be easiest for developers to implement and will cause them the fewest changes to their programs. 

Every change, no matter how trivial, will cause developers pain. It will require that developer support the new change, or it will result in data not transferring.

With regards to other formats, Tom and I both agreed in the BetterGEDCOM days that the GEDCOM data representation grammar is excellent - simple and powerful. Anything written in it can be mechanically be transferred to XML or JSON with a trivial program. So those issues should not a the concern. Libraries of code happen after the standard is created and should not be the concern. Code checking programs and monitoring compliance can happen after the standard is created and should not be the concern. 

This is not an easy project. And it is an impossible project unless everyone involved stays focused.

Louis 


Thomas Wetmore

unread,
Apr 15, 2020, 12:55:47 AM4/15/20
to root...@googlegroups.com
Louis,

God, it's great to hear your voice again!

I think there is a big difference between approaching the next version of GEDCOM as either:

1. A relatively minor tweak to get the garbage removed, and get the specifications very well defined, with an underlying goal of impacting current developers as little as possible; and

2. Doing anything significant at all.

I think that any of the reasonable changes that "should" be made to GEDCOM would be significant enough that it would disrupt current developers quite a bit if they choose to embrace them. Frankly I would expect most NOT to take changes to GEDCOM seriously. I have never seen a future path that breaks up the balkanization of current developers who interpret GEDCOM differently. No arguments abut common good or standardization has had much impact in the past, and I don't see how a minor change to GEDCOM will make any difference. The small developer with their small clientele has little incentive to change their software; what good could it really do for them?

(But I have been away for so many years now, that I don't have any feel for what is really going on in current development, so what I've just written may be total poopoo.)

I have always been an advocate of going for significant change. Not change for change sake, but change for genealogy's sake.

Our major difference, I believe, is that you see current GEDCOM as pretty close to what it should be, and I've always thought of GEDCOM (as a semantic standard, NOT as a syntactic one, as you rightly pointed out) as pretty far from where it should be.

My opinion is that changes needed to GEDCOM are significant enough that they would force developers adopting them to have rewrite significant chunks of input feature of their software, anyway, so if you're going to make a change, you might as well do it all.

But you know how long this debate has raged with very little movement. Better GEDCOM burned through my passion and energy about the area. I'm too close to the end of my days to let this upset my digestion. My interests are my grandchildren, birding and slide rules (remember them?), with genealogy data standards and software in fourth place.

Hoping you are in great health and enjoying the beauties of your province.

All the best,

Tom Wetmore

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.

Sean Gates

unread,
Apr 15, 2020, 10:35:06 AM4/15/20
to rootsdev
It may be somewhat naive of me to respond with so many experts involved, but I'm hoping my comments will be helpful in some way. Albeit, very anecdotal.

I worked on an open source project building a design system library for my employer. We had a similar issue with backward compatibility, broken features in the old design system, and wanting a painless migration path to the new system for developers.

What we found was that we had to do both ideas suggested here: tweak the old system to *prepare* for the new system and create migration tools, as well as create the new system.

To be clear, though, we made updates to the old system which were “breaking changes” (in SEMVER they would be new version numbers) which developers had to upgrade to to prepare the jump to the new system. But it was way better than them having to fix everything in preparation for a jump to the new system.

Sean Gates

unread,
Apr 15, 2020, 10:36:35 AM4/15/20
to rootsdev
Also: UTF-8 is an absolute no brainer. Should have been done a decade ago.

Stephen Woodbridge

unread,
Apr 15, 2020, 11:24:52 AM4/15/20
to root...@googlegroups.com
I would like to build on a few points that Tom and others have already
stated because I think they are key to this discussion.

* someone mentioned needing a convertor/intermediate format to help with
the migration from old to new. By definition this already exists for all
products today and it is the vendor's own internal format. So be simply
keeping their old import/export to GEDCOM 5.x requires little to no
additional work so they can focus on adding a new import/export to the
betterGEDCOM and this part of the problem is solved.

* "Go big or not at all"! There needs to be a real value proposition for
betterGEDCOM or is will not be adopted. I don't think I would know what
that should be, but as a developer and product manager for both
OpenSource and commerical software products, I understand the issues of
legacy products and pain of migration for both the vendor and the users.
Minor changes have all the downside and little of the upside.

* If betterGEDCOM is successful and widely adopted, convertors from the
old format will materialize as/if needed.

I have been using with RootMagic for about a year and one of the most
valuable things to me it the linking to FamilySearch and other services.
I seems to me that preserving this linking in betterGEDCOM could be a
small addition to the value proposition.

Like Tom, I'm more of a user of genealogy software than a developer of
it these days and grandchildren and family are higher on my priority list.

Best regards,
  -Steve Woodbridge
>> <mailto:rootsdev+u...@googlegroups.com>.
>> <https://groups.google.com/d/msgid/rootsdev/5749bf46-3223-482b-9843-5cb2e3970a9b%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
>
> ---
> You received this message because you are subscribed to the Google
> Groups "rootsdev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to rootsdev+u...@googlegroups.com
> <mailto:rootsdev+u...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/rootsdev/04E6C426-BB4C-47DE-9EB5-55F6AD458DC7%40gmail.com
> <https://groups.google.com/d/msgid/rootsdev/04E6C426-BB4C-47DE-9EB5-55F6AD458DC7%40gmail.com?utm_medium=email&utm_source=footer>.

Louis Kessler

unread,
Apr 15, 2020, 11:33:10 AM4/15/20
to rootsdev
Good to see you here too, Tom. 

I've also lurkingly seen you muckin' around on the LifeLines maillist, and laughed out loud when I saw you offering them the "only design document" you ever made when writing LIfeLines:  just a single page.

Louis

--- Real programmers write scripts.


Luther Tychonievich

unread,
Apr 15, 2020, 12:16:38 PM4/15/20
to root...@googlegroups.com
Since I know these threads can get long and daunting for new readers to join, and to ensure I'm following the suggestions, I wanted to give what I think is a summary of the conversation so far:

One big disagreement:

A. Make a clean break from GEDCOM's data model (Wetmore, Gates, Woodbridge)
    A.i. focus on being terse, flexible, and human-readable (Wetmore)
    A.ii. later, not as first step (Gates)
    A.iii. focus on big new value-proposition (Woodbridge)
B. Avoid big changes (Kessler, Quass, Gates)
    B.i. and only change what cannot fit in current spec (Kessler)
    B.ii. at first, preparing for later big change (Gates)

And several suggestions assuming we go with "B" on that disagreement, some with dissenting opinions:

1. Clean up ambiguity (Kessler, Cardinal)
    1.a. and remove obsolete items (Cardinal)
    1.b. provide semantic specifications (Wetmore)
2. Match spec to current practice (Quass)
    2.a. including standardizing common extensions (Cardinal)
3. Encourage correct use of existing parts (Kessler)
    3.a. e.g., but certifying correctness (Quass)
    3.b. with a validator, perhaps visual (Wetmore)
    (see also item 8)
4. Support exporting media with GEDCOM (Quass, Kennard)
5. Increase or remove length limits (Cardinal, Kennard, Wetmore)
6. UTF-8 only (Cardinal, Wetmore, Kennard, Gates)
7. Remove CONC/CONT (Kennard, Wetmore)
8. Provide code (Gardner, Cardinal, Wetmore, Gates)
    8.a. in form of mobile and web libraries (Gardner)
    8.b. in form of permissive-license reference (Cardinal)
    8.c. in form of reference GEDCOM-to-DOM tool (Wetmore)
    8.d. including a migration tool (Gates)
        8.d. dissent: not needed, this will solve itself (Woodbridge)

If I have missed or mischaracterized anything (highly likely in such a short summary), please provide additions and/or corrections
And, of course, I welcome other thoughts, including more from the respondents so far and any other voices that want to chime in with a new thought or add their support to an item mentioned here.

Thomas Wetmore

unread,
Apr 15, 2020, 12:39:56 PM4/15/20
to root...@googlegroups.com
Luther,

Great summary. I'm looking forward to seeing it play out.

Best,

Tom Wetmore

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rootsdev/CAA%3D1x%3DXd8Kuv_qyaOcJWNviFZFWS%3DSr1ycz%2BPB%3DOeEzE0smD9g%40mail.gmail.com.

Dallan Quass

unread,
Apr 17, 2020, 6:03:52 AM4/17/20
to root...@googlegroups.com
Thank you Luther for starting a discussion that has revived this group! :-)

--

Colin Spencer

unread,
Apr 17, 2020, 6:17:15 AM4/17/20
to rootsdev

Sorry I am late to this recent discussion.

I include below where I agree with the comments and would like to add my weight to those

1. Clean up ambiguity (Kessler, Cardinal)
    1.a. and remove obsolete items (Cardinal)
     2.a. including standardizing common extensions (Cardinal)
3. Encourage correct use of existing parts (Kessler)  
4. Support exporting media with GEDCOM (Quass, Kennard)
5. Increase or remove length limits (Cardinal, Kennard, Wetmore)
6. UTF-8 only (Cardinal, Wetmore, Kennard, Gates)
7. Remove CONC/CONT (Kennard, Wetmore)
8. Provide code (Gardner, Cardinal, Wetmore, Gates)
    8.a. in form of mobile and web libraries (Gardner)
    8.b. in form of permissive-license reference (Cardinal)
    8.c. in form of reference GEDCOM-to-DOM tool (Wetmore)
    8.d. including a migration tool (Gates)
        8.d. dissent: not needed, this will solve itself (Woodbridge)

One final comment why stick with UTF-8 why not UTF-16 for better compatibility with non-english languages?

Thomas Wetmore

unread,
Apr 17, 2020, 7:20:06 AM4/17/20
to root...@googlegroups.com
Colin,

All UTF formats encode all Unicode characters, so all have the same compatibility for non-English languages. Choosing which is best (if using the least storage space is your criterion) is a statistics problem, based on the probability distribution of the characters required for a particular application. For applications dominated by the old ASCII character set then UTF-8 is best because it encodes ASCII characters as single bytes. For languages way outside the ASCII world, then UTF-16 would likely be better. I suppose there are some applications where UTF-32 might be best. UTF-16 requires at least 2 bytes per character and UTF-32 requires 4 bytes per character.

But note that arguments based on amount of storage space required no longer have the weight they once did. Because of Moore's law operating over at least five decades now the cost of storage is essentially zero today.

You really want to choose the one with the best software support, though even this is a bit spurious, since plenty of software exists to convert the characters every which way. Any software that would be used today to write genealogical software would understand all the UTF formats, right down at the programming language level, and be able to read and write all the formats transparently. No developer of genealogical software and no user of genealogical software will have to care or even know which encoding is being used.

A disadvantage of UTF-8 and UTF-16 is that their characters are not randomly accessible, because they both use variable-length formats. To get to the nth character in a string you have to read the n-1 characters before it in order to know exactly where and what it is. But applications that require random-access will read the string first, put the characters into arrays or some other internal, fixed size format, which then allows the random access. A good modern programming language will make this invisible to the developer.

Best,

Tom Wetmore

> On Apr 17, 2020, at 6:17 AM, Colin Spencer <col...@gmail.com> wrote:
>
> Sorry I am late to this recent discussion.
>
> I include below where I agree with the comments and would like to add my weight to those
>
> ....

John Cardinal

unread,
Apr 17, 2020, 8:00:02 AM4/17/20
to root...@googlegroups.com
Colin,

You asked, “One final comment why stick with UTF-8 why not UTF-16 for better compatibility with non-english languages?”

Tom Wetmore has already replied. I will provide some information copied from the UTF-8 Everywhere web site (https://utf8everywhere.org/):

———————————
  • In both UTF-8 and UTF-16 encodings, code points may take up to 4 bytes.
  • UTF-8 is endianness independent. UTF-16 comes in two flavors: UTF-16LE and UTF-16BE (for the two different byte orders, respectively). Here we name them collectively as UTF-16.
  • UTF-8 and UTF-32 yield the same order when sorted lexicographically. UTF-16 does not.
  • UTF-8 favors efficiency for English letters and other ASCII characters (one byte per character) while UTF-16 favors several Asian character sets (2 bytes instead of 3 in UTF-8). This is what made UTF-8 the favorite choice in the Web world, where English HTML/XML tags are intermixed with any-language text. Cyrillic, Hebrew and several other popular Unicode blocks are 2 bytes both in UTF-16 and UTF-8.
[...]

In the UNIX world, narrow strings are considered UTF-8 by default almost everywhere.

———————————

If better is defined as “using less bytes”, UTF-8 is better for English, as you mentioned, but it is also better for any language based on Latin characters. As described above, UTF-16 is better for several Asian character sets.

The “endianness” issue with UTF-16 is a minor issue but adds an annoying opportunity to really screw things up.

In Linux, text files are assumed to be UTF-8. That is also becoming the norm in Windows.

The web has adopted UTF-8.

The only time one can assume that character number n is stored in the nTh element of an array is when using UTF-32, so UTF-8  and UTF-16 are equal in that regard. Windows is hamstrung somewhat by the unfortunate timing of MS adopting “widechars” before Unicode moved to 32bit code points. There are some indications that MS is moving towards UTF-8 for in-memory API use but we shall see.

Overall, technology is moving to UTF-8.

John

Louis Kessler

unread,
Apr 17, 2020, 10:23:04 AM4/17/20
to rootsdev
Obviously, lots of different thoughts and opinions. This unstructured thread is not the best place to being discussing the specifics as, other than a summary like Luther's, all the details will get lost.

Louis 

Reply all
Reply to author
Forward
0 new messages