Important: Candidate for Approved Specification, Last call for comments

6 views
Skip to first unread message

Roberto Di Cosmo

unread,
Jun 13, 2023, 5:55:07 AM6/13/23
to swhid-...@googlegroups.com
Dear all,
   this is the last call for comments on the current status of the specification that incorporates the changes to fix the issues that have been identified during this review period (and many thanks to those of you who made contributions!).

This is now the official Candidate for Approved Specification: if no blockers are identified, it will become our Approved Specification, version 1.0, this Friday June 16th.

--
Roberto (on behalf of the SWHID specification core team)


On Tue, 30 May 2023 at 12:09, Roberto Di Cosmo <rob...@dicosmo.org> wrote:
Dear all,
   all the blocker issues have been resolved, thanks to all of you that contributed to make this possible!

As a consequence, the current status of the specification is now officially a Candidate for Approved Specification.

According to the governance document, we now call all of you to review the current document that will last for a two-week period, and if no major issues arise, we will then be able to announce our first “Approved Specification”.

Here is the planned timeline:
  • May 30tt 2023 (today): start of the review period, triggered by this email
  • Tuesday June 13th : end of the review period, last call for comments
  • Friday June 16h : if no blockers, formal announcement of the Approved Specification 1.0 
Looking forward to your help!

--
Roberto (on behalf or the SWHID specification core team)

Miguel Colom

unread,
Jun 14, 2023, 8:55:38 AM6/14/23
to swhid-...@googlegroups.com
Hi Roberto and all,

Great work!

- We write that "author (arbitrary byte sequence, mandatory)" and I understand from "arbitrary" that we can add spaces.
Then below we find:
  • the string of bytes provided for the author name and email, with each LF replaced by LF followed by an ASCII space
  • an ASCII space
My question is: won't we have a problem if in the name's arbitrary byte sequence the user adds a space? Shouldn't we escape the spaces?

Probably these questions are already answered somewhere (otherwise, SH wouldn't work!), but these are the only doubts I have.

Cheers,
Miguel


--
You received this message because you are subscribed to the Google Groups "SWHID (Software Heritage Identifiers) discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swhid-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/swhid-discuss/CAJBwKuU6UHHMpGD5H6eKrNHCv6OX36SxLq9WYDfBkYNomNGE2Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

robbie....@posteo.de

unread,
Jun 14, 2023, 10:43:40 AM6/14/23
to swhid-...@googlegroups.com
Just briefly and without reading the draft spec, does the current
wording impiy that email addresses can contain ASCII spaces? Robbie

On 14.06.2023 14:55, Miguel Colom wrote:
> Hi Roberto and all,
>
> Great work!
> I have a small question, from
> https://www.swhid.org/specification/5_Core_identifiers/ [4]
>
> - We write that "author (arbitrary byte sequence, mandatory)" and I
> understand from "arbitrary" that we can add spaces.
> Then below we find:
>
> * the string of bytes provided for the author name and email, with
> each LF replaced by LF followed by an ASCII space
> * an ASCII space
>
> My question is: won't we have a problem if in the name's arbitrary
> byte sequence the user adds a space? Shouldn't we escape the spaces?
>
> Probably these questions are already answered somewhere (otherwise, SH
> wouldn't work!), but these are the only doubts I have.
>
> Cheers,
> Miguel
>
> El mar, 13 jun 2023 a las 11:55, Roberto Di Cosmo
> (<rob...@dicosmo.org>) escribió:
>
>> Dear all,
>> this is the last call for comments on the current status of the
>> specification [1] that incorporates the changes to fix the issues
>> that have been identified during this review period (and many thanks
>> to those of you who made contributions!).
>>
>> This is now the official CANDIDATE FOR APPROVED SPECIFICATION [1]:
>> if no blockers are identified, it will become our Approved
>> Specification, version 1.0, this FRIDAY JUNE 16TH.
>>
>> --
>> Roberto (on behalf of the SWHID specification core team)
>>
>> On Tue, 30 May 2023 at 12:09, Roberto Di Cosmo <rob...@dicosmo.org>
>> wrote:
>>
>>> Dear all,
>>> all the blocker issues have been resolved, thanks to all of you
>>> that contributed to make this possible!
>>>
>>> As a consequence, the current status of the specification [1] is
>>> now officially a CANDIDATE FOR APPROVED SPECIFICATION.
>>>
>>> According to the governance document, we now call all of you to
>>> review the current document that will last for a two-week period,
>>> and if no major issues arise, we will then be able to announce our
>>> first “Approved Specification”.
>>>
>>> Here is the planned timeline:
>>>
>>> * MAY 30TT 2023 (today): start of the review period, triggered by
>>> this email
>>> * TUESDAY JUNE 13TH : end of the review period, last call for
>>> comments
>>> * FRIDAY JUNE 16H : if no blockers, formal announcement of the
>>> Approved Specification 1.0
>>>
>>> Looking forward to your help!
>>>
>>> --
>>> Roberto (on behalf or the SWHID specification core team)
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "SWHID (Software Heritage Identifiers) discussions" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to swhid-discus...@googlegroups.com.
>> To view this discussion on the web visit
>>
> https://groups.google.com/d/msgid/swhid-discuss/CAJBwKuU6UHHMpGD5H6eKrNHCv6OX36SxLq9WYDfBkYNomNGE2Q%40mail.gmail.com
>> [2].
>> For more options, visit https://groups.google.com/d/optout [3].
>
> --
> You received this message because you are subscribed to the Google
> Groups "SWHID (Software Heritage Identifiers) discussions" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to swhid-discus...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/swhid-discuss/CAMYPK-4QCaT3xxdCDAM660K45NM7QM9utLkjduFd-Nkc9bEeeA%40mail.gmail.com
> [5].
> For more options, visit https://groups.google.com/d/optout [3].
>
>
> Links:
> ------
> [1] https://www.swhid.org/specification/
> [2]
> https://groups.google.com/d/msgid/swhid-discuss/CAJBwKuU6UHHMpGD5H6eKrNHCv6OX36SxLq9WYDfBkYNomNGE2Q%40mail.gmail.com?utm_medium=email&amp;utm_source=footer
> [3] https://groups.google.com/d/optout
> [4] https://www.swhid.org/specification/5_Core_identifiers/
> [5]
> https://groups.google.com/d/msgid/swhid-discuss/CAMYPK-4QCaT3xxdCDAM660K45NM7QM9utLkjduFd-Nkc9bEeeA%40mail.gmail.com?utm_medium=email&utm_source=footer

Roberto Di Cosmo

unread,
Jun 17, 2023, 6:45:18 AM6/17/23
to robbie....@posteo.de, swhid-...@googlegroups.com
Hi Miguel and Robbie,
thank you for these questions.

There seem to be actually two points to clarify here:

1) in the specification the "author" field is intented to hold the information
related to the author that comes from a version control system revision/release;
the specification makes no assumptions on what this information is, even if,
in general, it is the name and the email.

2) the only provision about the content is that the field termination
character, the ASCII LF, must be escaped if present in the byte string
(otherwise, the end of the field may not be uniquely determined), and
the escaping mechanism is "replace any LF in the field by an LF followed by
an ASCII space". If the user add spaces in the byte string (e.g. puts a space
after an LF), that's totally fine, as that cannot convert this LF occurrence
into a field terminator. The LF will nonetheless be escaped (so,
two spaces ;-)): when re-reading the field, one space after the LF
will be suppressed, and we'll get back the original text.

I hope this helps

Have a great week-end

--
Roberto
> To view this discussion on the web visit https://groups.google.com/d/msgid/swhid-discuss/e9613add6c331cc74c201bed2b620fca%40posteo.de.
> For more options, visit https://groups.google.com/d/optout.
>

--

--
Roberto Di Cosmo

------------------------------------------------------------------
Computer Science Professor
(on leave at INRIA from IRIF/Université Paris Cité)

Director
Software Heritage https://www.softwareheritage.org
INRIA http://y2u.be/Ez4xKTKJO2o
Bureau C328 E-mail : rob...@dicosmo.org
2, Rue Simone Iff Web page : http://www.dicosmo.org
CS 42112 Twitter : http://twitter.com/rdicosmo
75589 Paris Cedex 12 Tel : +33 1 80 49 44 42
------------------------------------------------------------------
GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3

Miguel Colom

unread,
Jun 19, 2023, 5:13:28 AM6/19/23
to Roberto Di Cosmo, robbie....@posteo.de, swhid-...@googlegroups.com
Hi Roberto,

El sáb, 17 jun 2023 a las 12:45, Roberto Di Cosmo (<rob...@dicosmo.org>) escribió:
2) the only provision about the content is that the field termination
   character, the ASCII LF, must be escaped if present in the byte string
   (otherwise, the end of the field may not be uniquely determined), and
   the escaping mechanism is "replace any LF in the field by an LF followed by
   an ASCII space". If the user add spaces in the byte string (e.g. puts a space
   after an LF), that's totally fine, as that cannot convert this LF occurrence
   into a field terminator. The LF will nonetheless be escaped (so,
   two spaces ;-)): when re-reading the field, one space after the LF
   will be suppressed, and we'll get back the original text.

Thanks for your answer.
Perhaps I'm mistaken, but let's see: yes, the LFs are escaped, and there's no way that after unescaping the string one could find a misleading LF terminator.

However, in the document it seems that the terminator is not an LF, but a SPACE:
    * the string of bytes provided for the author name and email, with each LF replaced by LF followed by an ASCII space
    * an ASCII space

For example, if I want to add "Miguel Colom <em...@example.com>", I'd have to do:

autor
<SPACE>
Miguel<SPACE>Colom<SPACE><em...@example.com><LF> --> Miguel<SPACE>Colom<SPACE><em...@example.com><LF><SPACE> (LF escaped).
<SPACE>
 
If I assume that the terminator is <SPACE>, I'd wrongly stop after reading "Miguel" when decoding.

That's why I have the doubt whether the terminator <SPACE> in the specification shouldn't be a <LF> instead, given that this is after all what we're escaping.
In case the name + email is assumed to be LF-terminated then there's no ambiguity, but the specification states that it's an arbitrary sequence of bytes, reinforcing the idea that the terminator is the <SPACE> we're not escaping.

Best,
Miguel


 

Roberto Di Cosmo

unread,
Jun 19, 2023, 10:38:56 AM6/19/23
to Miguel Colom, robbie....@posteo.de, swhid-...@googlegroups.com
Hi Miguel,
    thanks a lot for your message: there was a formatting mistake in the core identifier specification, and it's possible that's what created the confusion. A fixed version with proper formatting is now online.

For the author, the right fragment is now this one, that should be more readable:

  • the author line:
    • the ASCII string "author" (6 bytes)
    • an ASCII space
    • the string of bytes provided for the author name and email, with each LF replaced by LF followed by an ASCII space
    • an ASCII space
    • the ASCII-encoded decimal representation of the author timestamp
    • an ASCII space
    • the string of bytes provided for the author timezone offset, with each LF replaced by LF followed by an ASCII space
    • a LF
here we can see that the author line is terminated by an LF (not followed by a space, it will be followed by the "committer" line), so there should be no ambiguity as to where the line ends. As for parsing this header line, on should go from right to left, starting from the LF terminator, and collecting the timezone, then the timestamp, then getting the author name and email by stripping "author" SPACE from what is left.

It is indeed not super elegant, but that's what is done in git (we're keeping compatibility with it for the moment) ;-)

Do you think this is more clear?

And thanks again for your careful reading

--
Roberto and Nicolas


--
You received this message because you are subscribed to the Google Groups "SWHID (Software Heritage Identifiers) discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swhid-discus...@googlegroups.com.

Miguel Colom

unread,
Jun 19, 2023, 11:19:58 AM6/19/23
to Roberto Di Cosmo, robbie....@posteo.de, swhid-...@googlegroups.com
Hi Roberto,

El lun, 19 jun 2023 a las 16:38, Roberto Di Cosmo (<rob...@dicosmo.org>) escribió:
Hi Miguel,
    thanks a lot for your message: there was a formatting mistake in the core identifier specification, and it's possible that's what created the confusion. A fixed version with proper formatting is now online.

For the author, the right fragment is now this one, that should be more readable:

  • the author line:
    • the ASCII string "author" (6 bytes)
    • an ASCII space
    • the string of bytes provided for the author name and email, with each LF replaced by LF followed by an ASCII space
    • an ASCII space
    • the ASCII-encoded decimal representation of the author timestamp
    • an ASCII space
    • the string of bytes provided for the author timezone offset, with each LF replaced by LF followed by an ASCII space
    • a LF
here we can see that the author line is terminated by an LF (not followed by a space, it will be followed by the "committer" line), so there should be no ambiguity as to where the line ends. As for parsing this header line, on should go from right to left, starting from the LF terminator, and collecting the timezone, then the timestamp, then getting the author name and email by stripping "author" SPACE from what is left.

Thanks, now I see how it works.
Indeed, if you parse it from the end up the beginning you can delimitate the author's name+email from the second and the first spaces.
OK, no ambiguity!

I don't know if the instructions on how to parse it properly are given somewhere else, but perhaps it'd be good to write it explicitly in this document.

It is indeed not super elegant, but that's what is done in git (we're keeping compatibility with it for the moment) ;-) 

Do you think this is more clear?

With your explanation, sure.
What I find a bit misleading is the fact that for the name+email it's the <LF> which is escaped, but the terminator is a <SPACE>.
Whereas for example for the timezone the <LF> are escaped, and the terminator is also an <LF>.

Anyway, it's clear now, no need to spend more time on it.
Thanks again!

Best,
Miguel

Roberto Di Cosmo

unread,
Jun 19, 2023, 4:16:25 PM6/19/23
to Miguel Colom, robbie....@posteo.de, swhid-...@googlegroups.com
Hi Miguel,
thanks a lot for your feedback, we should now be ready to move forward!

--
Roberto

On Mon, Jun 19, 2023 at 05:19:46PM +0200, Miguel Colom wrote:
> Hi Roberto,
> El lun, 19 jun 2023 a las 16:38, Roberto Di Cosmo
> (<[1]rob...@dicosmo.org>) escribió:
>
> Hi Miguel,
> thanks a lot for your message: there was a formatting mistake in
> the core identifier specification, and it's possible that's what
> created the confusion. A fixed version with proper formatting is now
> online.
> For the author, the right fragment is now this one, that should be more
> readable:
> * the author line:
> + the ASCII string "author" (6 bytes)
> + an ASCII space
> + the string of bytes provided for the author name and email,
> with each LF replaced by LF followed by an ASCII space
> + an ASCII space
> + the ASCII-encoded decimal representation of the author
> timestamp
> + an ASCII space
> + the string of bytes provided for the author timezone offset,
> with each LF replaced by LF followed by an ASCII space
> + a LF
> On Mon, 19 Jun 2023 at 11:13, Miguel Colom <[2]miguel...@gmail.com>
> wrote:
>
> Hi Roberto,
> El sáb, 17 jun 2023 a las 12:45, Roberto Di Cosmo
> (<[3]rob...@dicosmo.org>) escribió:
> For example, if I want to add "Miguel Colom <[4]em...@example.com>",
> I'd have to do:
> autor
> <SPACE>
> Miguel<SPACE>Colom<SPACE><[5]em...@example.com><LF> -->
> Miguel<SPACE>Colom<SPACE><[6]em...@example.com><LF><SPACE> (LF
> escaped).
> <SPACE>
>
> If I assume that the terminator is <SPACE>, I'd wrongly stop after
> reading "Miguel" when decoding.
> That's why I have the doubt whether the terminator <SPACE> in the
> specification shouldn't be a <LF> instead, given that this is after all
> what we're escaping.
> In case the name + email is assumed to be LF-terminated then there's no
> ambiguity, but the specification states that it's an arbitrary sequence
> of bytes, reinforcing the idea that the terminator is the <SPACE> we're
> not escaping.
> Best,
> Miguel
>
> --
> You received this message because you are subscribed to the Google
> Groups "SWHID (Software Heritage Identifiers) discussions" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to [7]swhid-discus...@googlegroups.com.
> To view this discussion on the web visit
> [8]https://groups.google.com/d/msgid/swhid-discuss/CAMYPK-7aLydnL7Jk
> rNOPUaZzJsu9uXsDP%3D%2BcT8tiXSD3AqgTQw%40mail.gmail.com.
> For more options, visit [9]https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google
> Groups "SWHID (Software Heritage Identifiers) discussions" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [10]swhid-discus...@googlegroups.com.
> To view this discussion on the web visit
> [11]https://groups.google.com/d/msgid/swhid-discuss/CAMYPK-5hE%2BMvcuvt
> hVTeJiScrmaoSrs6V%2B5YSVZy23hsENyasg%40mail.gmail.com.
> For more options, visit [12]https://groups.google.com/d/optout.
>
> References
>
> 1. mailto:rob...@dicosmo.org
> 2. mailto:miguel...@gmail.com
> 3. mailto:rob...@dicosmo.org
> 4. mailto:em...@example.com
> 5. mailto:em...@example.com
> 6. mailto:em...@example.com
> 7. mailto:swhid-discus...@googlegroups.com
> 8. https://groups.google.com/d/msgid/swhid-discuss/CAMYPK-7aLydnL7JkrNOPUaZzJsu9uXsDP=+cT8tiXS...@mail.gmail.com?utm_medium=email&utm_source=footer
> 9. https://groups.google.com/d/optout
> 10. mailto:swhid-discus...@googlegroups.com
> 11. https://groups.google.com/d/msgid/swhid-discuss/CAMYPK-5hE+MvcuvthVTeJiSc...@mail.gmail.com?utm_medium=email&utm_source=footer
> 12. https://groups.google.com/d/optout
Reply all
Reply to author
Forward
0 new messages