A discussion has arisen about the need for ids and for guids. This is an area of the standard that is a little rough around the edges, so please bear with me as I try to expand my understanding of it. Which has grown in recent discussions.
First a few definitions:
An id is a file unique string that is used when the item (entry or sense) is referenced from somewhere else in the file. Its only constraint is that it is file unique and theoretically, it can change between instantiations of the file.
A guid is an optional string that is guaranteed not to change once it has been set (despite only having a SHOULD on that requirement in 0.14).
LIFT in its original inception was never really designed to be easily mergeable in a version control sense. This need has grown and I am hearing that people want the mergeability requirements to be made part of the core standard. Is there anyone who does not want to see that happen?
For mergeability, the stability requirement of a guid is essential. But why do we need both a guid and an id? How about we just up the stability requirement of an id and we are done aren't we?
One problem that has been raised is that ids are designed to be somewhat human readable (they often consist of the lexical-unit followed by a guid (ugly, but effective). And if when you create a new entry (say), you want to be able to uniquely identify it, so you give it a guid, but you don't want to give it an id until the lexical-unit is set.
There are a number of aspects to this:
1. If the object is empty, then it is the same as any other empty object. So deleting it and recreating it makes no difference to the file.
2. If you want to cross refer to an empty element (or at least one without a lexical-unit) then perhaps we just take the confusion and have a more opaque id.
So I would be happy to increase the stability requirement on id and deprecate guid. Or we can leave things as they are. There is nothing to say that an application can't add guids on import. And if we tighten the SHOULD to MUST in the guid spec, we fulfill the merging requirement. By making no change at this time, but warning of future deprecation, applications can start to make ids stable and for the moment, hold both, but switch from identifying by guid and instead using id for identification.
Thoughts?
Yours,
Martin
> So, I will counter propose that we make GUID mandatory, and do away with id
> altogether. Not as an obviously better approach, but as the simplest of
> variously ok approaches. Maybe a compromise would be to call it "id", but
> say it must contain a guid? Less clear that way...
In effect, my proposal (as summarised by Cambell) would have the same effect. The only difference is that the standard doesn't say that it has to be a *G*UID. But I would be most surprised if people didn't use guids for them. Once they are set, it doesn't really matter what they are because they aren't going to change.
> Before I forget, we need to move towards unique ids on all complex
> merge-able things that aren't otherwise machine-distinguishable. Without
> this, for example, Example Sentences suffer during 3-way merging where both
> parties edited something. This team-collaboration of LIFT is now more than
> just a future thing, we've got one such dictionary project underway.
Can you give me a list of objects that you consider need such ids?
GB,
Martin
Right, but now imagine the job of the guy maintaining the FLEx importer
code. His model (which predates FLEx by many years) has real guids. So now
how does h map the incoming LIFT file to FLEx's internal model? The very
first time, no big deal, he just generates new objects and guids. But the
next day, he has to import the latest version of that lift file again. Now
what does he do? He can't just go through and rewrite all the ids that may
or may not be guids. Under the "any unique id" plan, I'll tell you what I
would do in his shoes. I'd introduce an attribute called "guid", write that
out to the lift, get other apps to round-trip it, and now we're back to
exactly the same situation/problem as we have today, with both and id and a
guid. Alternatively, I'd let my life get more complicated by keeping a
lookup table somewhere... but I'd be grumbling... "why do they put me
through this pain"?
John Hatton
SIL PNG, Palaso, & SIL International Software Development
Google Talk chat: hattonjohn
> Right, but now imagine the job of the guy maintaining the FLEx importer
> code. His model (which predates FLEx by many years) has real guids. So now
> how does h map the incoming LIFT file to FLEx's internal model? The very
> first time, no big deal, he just generates new objects and guids. But the
> next day, he has to import the latest version of that lift file again. Now
> what does he do? He can't just go through and rewrite all the ids that may
> or may not be guids. Under the "any unique id" plan, I'll tell you what I
> would do in his shoes. I'd introduce an attribute called "guid", write that
> out to the lift, get other apps to round-trip it, and now we're back to
> exactly the same situation/problem as we have today, with both and id and a
> guid. Alternatively, I'd let my life get more complicated by keeping a
> lookup table somewhere... but I'd be grumbling... "why do they put me
> through this pain"?
Given that no change in this area will happen without a version bump, that allows an import to tell what the meaning of id and guid is. Are you also telling me that Flex will fall over if the unique id is not a MS formatted 128 bit GUID?
I think you are also not hearing that the new proposed id has to be unique and can't change and one of the easiest ways of doing that is to use some kind of guid. So in effect the id becomes a guid. I'm just not saying it has to be a strict 128 bit number as formatted and generated by MS libraries.
Of course, if we are all happy with the status quo, then we can leave it as it is. But I have a feeling that people will want to change this at the next version bump (when they have to deal with other changes anyway).
GB,
Martin
Yup. GUID is a type in .net. They aren't just strings. So if I say
entry.Guid = Guid.Parse("somereallyreallyuniquestring");
then it's certainly not gonna be happy.
Martin Hosken wrote:Dear John,As I have previously explained on this list, saying that a process must be able to round-tip arbitrary id strings means that FieldWorks would need be rewritten to use string data structures, just to support this standard, or would need to maintain some form of mapping. As before, I find it unreasonable request this of them.So what you are saying is that instead of the id being an immutable string, you want it to be an immutable 128-bit number that can be rendered in any form that other people recognise as a UUID (e.g. with or without hyphens) I would like to hear from the Flex team on this. If we switch to saying this is a 128bit number which can have variable form, and then say Flex switches to using SQL server which doesn't have an atomic GUID type, then they are going to want to use a string as a key rather than a number.I think that would be MySql Server. At any rate, in code many of us currently use the .Net GUID class to model the guid attribute. This is compatible with the python UUID class.So which is it to be? Do you want to key off a number or a string? GB, Martin
The guid attributes should disappear, and the id attribute values should
be based purely on the object's guid value. Later, I'll provide a pair
of C# methods that would allow those id values to shrink to either 22 or
23 characters while fully representing the 128-bit guid and also being
totally valid XML ID values. (The normal hexadecimal representation of
a guid with embedded hyphens takes 36 characters, plus possibly a 37th
character if the first hexadecimal character is not a letter.)
Recognizing the desire behind Martin's proposal (and the current
behavior in practice although not specified), I further propose that
relation elements (those that provide actual links between entry and
sense elements) have an additional (optional) attribute to supplement
the required ref attribute. This attribute, which could be named "name"
or "target" or possibly have a target-specific name like "form" or
"gloss", would provide the most relevant human readable value from the
target object. For entries, this would be the lexical-unit form in the
(primary) vernacular language. For senses, this would be the gloss in
the (primary) analysis language (or possibly the definition if the gloss
is empty). This additional attribute would not be used by the program
since the ref attribute provides the formal link, but would allow a
human scanning the file to have at least some understanding of what the
target is, which is all that the alternative type of id could provide
anyway. The advantage of this approach is twofold: programs don't need
to remember anything for the id values but the guids that they are
already using, and the human readable values displayed in the relation
elements can change when the actual values do change. Admittedly, those
changes aren't necessarily frequent, but even some lexeme forms change
over time as different decisions are made for spelling choices, or what
are primary forms as opposed to allomorphs.
As promised earlier, here are a pair of C# methods that convert guids
into the shortest (UTF-8) strings that both represent all the bits in
the guid and form valid XML ID values. If the latter requirement is
dropped, then the methods could be simplified in obvious ways.
public static string ToLiftId(Guid guid)
{
string id = Convert.ToBase64String(guid.ToByteArray());
// -, ., and _ are valid ID chars, but only _ of those
can start an ID
id = id.Replace('+', '-');
id = id.Replace('/', '.');
id = id.Replace("=", ""); // eliminate useless
padding for now.
if (Char.IsDigit(id[0]) || id[0] == '-' || id[0] == '.')
id = '_' + id; // make it a valid XML ID
value
return id;
}
public static Guid FromLiftId(string id)
{
if (id.StartsWith("_"))
id = id.Substring(1); // strip leading nonsense
character
id = id.Replace('-', '+'); // convert back to native
Base64 value
id = id.Replace('.', '/');
byte[] data = Convert.FromBase64String(id + "=="); //
length must be multiple of 4
return new Guid(data);
}
John Hatton
SIL Papua New Guinea, Palaso, & SIL International Software Development
> The guid attributes should disappear, and the id attribute values should
> be based purely on the object's guid value. Later, I'll provide a pair
> of C# methods that would allow those id values to shrink to either 22 or
> 23 characters while fully representing the 128-bit guid and also being
> totally valid XML ID values. (The normal hexadecimal representation of
> a guid with embedded hyphens takes 36 characters, plus possibly a 37th
> character if the first hexadecimal character is not a letter.)
What you are saying, I think, and from our face to face discussions, is that you want an id to be some kind of representation of a 128 bit number. Thus arbitrary unique strings are not appropriate.
>
> Recognizing the desire behind Martin's proposal (and the current
> behavior in practice although not specified), I further propose that
> relation elements (those that provide actual links between entry and
> sense elements) have an additional (optional) attribute to supplement
> the required ref attribute. This attribute, which could be named "name"
> or "target" or possibly have a target-specific name like "form" or
> "gloss", would provide the most relevant human readable value from the
> target object.
Or is a comment sufficient for this? If we want an informative attribute here, I'm happy with that too. What do people think?
GB,
Martin
If we do agree on guid based ids, and on using the normal guid
hexadecimal-with-hyphens representation, could we agree on whether the
alphabetic characters are uppercase or lowercase? It makes no
difference for interpreting the string, but keeping the case consistent
facilitates searching when viewing the LIFT file in a text editor.
--
Steve McConnel
A comment would be sufficient. Doing it that way would facilitate
giving more specific information, so it might be even better.
--
Steve McConnel
Ok, for raw guids, my brief research suggests lowercase.