Feature Request - INDI, SOUR... Numbering

28 views
Skip to first unread message

David

unread,
Jun 27, 2024, 11:20:13 AM6/27/24
to GEDitCOM II Discussions
Looked through my ged file and found long blank intervals for the various ged tags i.e. SOURs were numbered into the 000s but there were only 00s of entries. It looks like new entries get the next available number. Is it possible that they could be numbered using 1st available so that deletions could be reused?

John Nairn

unread,
Jun 27, 2024, 8:32:21 PM6/27/24
to geditcom-ii-discussions@googlegroups.com geditcom-ii-discussions@googlegroups.com
Which numbers do you men? Do you mean the GEDCOM record IDs (such as @S45@)? The size of them are of no consequence. The official rule is maximum of 22 characters and the characters can be almost anything. I just used numbers for convenience. I will check on what code does. The only real need is that each one must be unique in the file.

John

On Jun 27, 2024, at 8:20 AM, David <dhw...@gmail.com> wrote:

Looked through my ged file and found long blank intervals for the various ged tags i.e. SOURs were numbered into the 000s but there were only 00s of entries. It looks like new entries get the next available number. Is it possible that they could be numbered using 1st available so that deletions could be reused?

--
You received this message because you are subscribed to the Google Groups "GEDitCOM II Discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geditcom-ii-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/geditcom-ii-discussions/e4a453c3-bfc6-47d5-a7ae-1a183e66ebe5n%40googlegroups.com.

David

unread,
Jun 28, 2024, 5:14:08 AM6/28/24
to GEDitCOM II Discussions
I looked at the ged file and the max SOUR id was @SR1206@. A lot more sources than in use. I went back  to @SR1@ and the interval @SR1@...to...@SR211@ was unused (NOT 0 citations, they are not present in the ged file). After that there are small islands of unused ids. If I generate a new Source then numbering id is at @SR917@.

Current Status
Whenever I need a new SOUR - it is generated as @SR917@ - I edit and update to the 1st available id. New SOUR @SR917@ with 0 citations would then be updated to @SR1@ or @SR2@ or etc etc. Currently at @SR122@ in the @SR1@ to @SR211@ group without an issue.

My maximum id is now@SR985@ instead of @SR1206@ as I slowly update and delete. Why a new SOUR should be @SR917@ and not @SR985@ +1 i.e. @SR986@ I cannot explain. Could it be that there are 916 SR entries in use and @SR917@ is available? Also after the new source is added and cited then @SR917@ is always available even though I have added a new source - the numbering does not extend but is filling in available blank entries?

There are no problems with the existing allocation of ids even though about one third of my SOURs are not present. It is personal preference to use the hundreds of available SOUR ids before generating a new SOUR.

SR or S - any difference?

Richard Blake

unread,
Jun 28, 2024, 5:22:21 AM6/28/24
to geditcom-ii...@googlegroups.com
I think the challenge would be that there is no standard for the ID of any record other than the length. Therefore every app uses a different notation to label sources. So if you have imported a GEDCOM into GEDitCOM II you may have different naming conventions.

For example, in my file the first two sources in my Sources list are:

0 @75196978@ SOUR

0 @SR87@ SOUR


The first generated before I moved to GEDitCOM and the second in GEDitCOM itself.


So decide how to re-use deleted source IDs is not a trivial task IMO.


Just my 2p worth.


Regards, Richard



David

unread,
Jun 28, 2024, 6:53:01 AM6/28/24
to GEDitCOM II Discussions
No imports. Any source deletions mostly from Individuals and some from the Sources Album. I use BBEdit to trackdown SOURs or their absence (an applescript with a topdown grep search) from the ged file and then the GEDitCOM II Source Editor. These comments will always relate to the Default.gfrmt.

I would update 0 @75196978@ SOUR to the GEDitCOM notation and at that point my question as regards allocation of the SR id would be relevant. The existing allocation will always give a unique SR id but why have hundreds of unused SR ids.

John Nairn

unread,
Jun 28, 2024, 1:03:24 PM6/28/24
to geditcom-ii-discussions@googlegroups.com geditcom-ii-discussions@googlegroups.com
As Richard pointed out, there is no standard for IDs and whether numbers are used or unused has no affect to performance. When I started GEDitCOM (an app before GEDItCOM II) I noticed that most software used numbers and sometimes some indication of record type. To minimize number of characters (I like to minimize file size when I can), I decided to create new records with the ID

     @T#@

where T is a one or two character indicator for type of record and # is a number. I picked “I” for individuals, “F” for families, etc. I picked “SR” for sources to distinguish from other records types beginning in “S” (submitter use “SM” and submission uses “SUBN”, four characters, but only one of these allowed).

To get the number, the code starts with the current number of sources (say 100) and then increments as much as needed until if finds an unique ID. If all records are created in GEDitCOM II, this process should quickly find unique numbers and all will be used. If you delete or import sources, however, numbers will increase with number of sources, but numbers of deleted records will usually not be found.

GEDitCOM II does not make any effort to fill in unused numbers because it is no benefit to performance. All IDs are treated as strings (in fact they could be all letters if one wanted), so whether ot not numbers are sequential does not matter.

I don’t see how code could be getting very large numbers (that would be a bug). Here is a test to check for problems:

1. Open a file and note the number of sources. I tried in one of my files and there are 40 sources.
2. Create a new source record and look at the ID in the raw GEDCOM data
3. If will be @SR#@ where # is greater than or equal to 40. In my case I got @SR44@ because I already had @SR40@ to @SR43@ (I most have deleted a few sources)

Having the code start at 1 to catch unused IDs would be inefficient because once you get a lot of records (especially individuals), it would time to recheck all numbers. 

It is usually not a good idea to change IDs in the GEDCOM data (GEDitCOM II warns you if you try). I think it would possible to write an extension to renumber IDs (it would be tricky to fix all links that refer to any record with a new ID), but it would not help anything.

John Nairn

Jim Eggert

unread,
Jun 28, 2024, 1:43:54 PM6/28/24
to GEDitCOM II Discussions
I actually change IDs on a regular basis in my GCII data, but only for one record type. I make use of the SOUR.ABBR field, and my application requires that these source abbreviations be unique. Since that is not a GEDCOM or a GCII requirement, I had to come up with a way to ensure uniqueness. The approach I use is when I enter a new source, I enter its ABBR field and then change the source ID to @SR-ABBR@ (whatever abbreviation I have chosen, not the letters ABBR!). GCII warns me that I am changing the ID, but I expect that. If I choose an abbreviation or source ID that is already in use, GCII doesn’t allow it, guaranteeing the sought uniqueness.

The only other minor gotcha is that if I have to avoid special characters in source IDs. So if my desired abbreviation includes a special character, I have to replace that character with a regular ASCII character or sequence in the source ID. Not a problem.

=Jim

David

unread,
Jun 29, 2024, 4:57:39 AM6/29/24
to GEDitCOM II Discussions
Re.
"To get the number, the code starts with the current number of sources (say 100) and then increments as much as needed until if finds an unique ID. If all records are created in GEDitCOM II, this process should quickly find unique numbers and all will be used. If you delete or import sources, however, numbers will increase with number of sources, but numbers of deleted records will usually not be found."

The above answers the question and I can see why there are gaps. My ged file started with the original GEDitCOM so lots of time for deletions and changes in formats. Speed is not an issue. I can remember  a similar discussion pre 2012!

Reply all
Reply to author
Forward
0 new messages