Unicode

285 views
Skip to first unread message

Ron Pressler

unread,
Mar 28, 2016, 8:11:35 AM3/28/16
to tlaplus

Hi.


This issue was mentioned here over a year ago, but I think its intent got lost in the rest of the discussion, so let me bring it up again.

The TLA+ pdf pretty-printer does a fantastic job, but its output is optimized for the printed page, while I look at a screen all day and don’t even own a printer (not that I remember, at least). After months of reading and writing TLA+ almost all day, every day, the ASCII representation of various symbols is now second nature to me, but I still find it less readable than proper symbols (and people less familiar with TLA+ find it far less readable), while the pdf output sits pretty much unused. 


All (or virtually all) TLA+ symbols have unicode characters. How about letting the Toolbox present the unicode representation of the escaped symbols, i.e., display when the file says \in ? Better yet, as nearly all modern text editors support unicode, how about supporting unicode as a legal, alternative representation in all tools, namely, the tla file could contain \cup, \union, or to represent union? The toolbox would then have an option of saving the symbols to the file as unicode or keeping the ASCII representation. Agda also works with unicode symbols (but I think that is the only valid representation).


Ron

Leslie Lamport

unread,
Mar 28, 2016, 11:45:44 AM3/28/16
to tlaplus
Hi Ron,


There is one potential problem that would make a Unicode version of TLA+ impossible: because of TLA+'s use of indentation, it would require a fixed-width Unicode font.   So I suggest that you investigate whether one exists and can be used by ordinary Java programs.   If it does, then this would not be too hard to implement, and I would be happy to help a volunteer do it.

Leslie

Ron Pressler

unread,
Mar 28, 2016, 12:10:26 PM3/28/16
to tlaplus
DejaVu Mono is a fixed-width TTF font that seems like it has all the necessary glyphs (see page 26 of this document), it is free, and while the Toolbox uses SWT and I don't know how that toolkit renders fonts, I see no reason why it shouldn't be possible.

Ron 

Ron Pressler

unread,
Mar 28, 2016, 12:57:27 PM3/28/16
to tlaplus
There's also PragmataPro, which isn't free, but if TLA+ supports unicode (and is shipped with, say, DejaVu Mono), then users will be free to purchase PragmataPro, as an Agda/Coq/HOL developer told me he'd done.

Leslie Lamport

unread,
Mar 28, 2016, 1:10:19 PM3/28/16
to tlaplus
That looks good.  The next problem is whether javacc 4.0, the version of JavaCC that produces the lexing code for SANY, works with Unicode.  (The odds of SANY's JavaCC input working with a newer version seem slim.)  The documentation I have for it says that it does handle Unicode characters, but the code it produces contains a lot of chars, so that claim needs to be tested.  If it does, then it shouldn't be too hard to modify SANY--assuming that the relevant code works with strings rather than chars.  TLC shouldn't need any changes.   Modifications to the Toolbox shouldn't be too difficult, if the appropriate fonts can be used and the Eclipse code can deal with them.   The interface for typing the non-ASCII characters needs to be designed.  And we'd need to decide if non-ASCII characters and things like Greek letters should be allowed in identifiers.  I don't know about TLAPS; but it shouldn't be a problem since I imagine OCaml handles Unicode gracefully and the new version of TLAPS will use the SANY parser.  Modifications to the pretty-printer should be straightforward.

It sounds like about two person-months of work.  As I wrote, I'd be happy to help guide a volunteer.  But, before any work is done on it, we should find out how many people would use it.  I expect that Unicode would introduce problems sharing specs that don't exist with ASCII.  I personally will not trust any non-ASCII files to be usable 20 years from now.

Leslie

Ron Pressler

unread,
Mar 28, 2016, 1:18:32 PM3/28/16
to tlaplus
Well, there's always the (less ideal, IMO, but still very good) option of implementing this only at the presentation layer, in the toolbox only. This is an orthogonal feature, and can be implemented regardless of whether or not SANY is made to support unicode or not (although that would require adding the option of saving as unicode).

BTW, I believe Java has supported unicode since the beginning (which is already twenty years), and a Java char is a UTF-16 character, which should support all the characters we need.

Ron

Leslie Lamport

unread,
Mar 28, 2016, 2:01:41 PM3/28/16
to tlaplus
Just using Unicode symbols to display an ASCII source would be significantly easier to implement.   Making the display read-only, so you'd have to toggle into ASCII mode to edit the module, should be fairly easy using code from the pretty-printer.   If you want to edit the view with symbols, I suggest you write a precise specification of how it should work. 

Leslie

Steve Glaser

unread,
Mar 28, 2016, 3:15:40 PM3/28/16
to tla...@googlegroups.com
Files should remain ASCII. That way they are portable to folks without Unicode fonts.

A simple implementation would be to create a 3rd window type for viewing code. This would be read only and show the same content as the editing window but using Unicode. The existing "duplicate my window” and "model / view support" support might be allow side-by-side viewing and live updates (where changing the ASCII text updated the Unicode view).

Steve Glaser



--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+u...@googlegroups.com.
To post to this group, send email to tla...@googlegroups.com.
Visit this group at https://groups.google.com/group/tlaplus.
For more options, visit https://groups.google.com/d/optout.

Ron Pressler

unread,
Mar 29, 2016, 9:28:46 AM3/29/16
to tlaplus
So I propose we split this into three subproblems, and I think this can work for everybody:

1. Write a program that can convert an ASCII tla file to a unicode one and vice-versa. This is where most of the logic is (indentation...), and the code can be based on the pretty printer's code.

2. Add an option in the toolbox so that it replaces, say, \in to ∈ as I type. Then, it uses program 1 to run SANY/TLC. I see no reason why this should be a read-only mode, and there is no complex indentation logic here (the file produced is unicode with correct indentation, and then program 1 converts it to ASCII with correct indentation).

3. Modify SANY so that it can parse unicode directly. Once this is done, program 1 would be used only for conversions from each representation for upgrade/support reasons.

I volunteer to work on 1 (I can start in a few weeks), and then maybe 3 when we get there. I think 2 should be done by someone with good familiarity with Eclipse.

Ron

Stephan Merz

unread,
Mar 29, 2016, 10:14:56 AM3/29/16
to tla...@googlegroups.com
2. Add an option in the toolbox so that it replaces, say, \in to ∈ as I type. Then, it uses program 1 to run SANY/TLC. I see no reason why this should be a read-only mode, and there is no complex indentation logic here (the file produced is unicode with correct indentation, and then program 1 converts it to ASCII with correct indentation).

That's a bit over-simplifying. For example, you could have

 ...
 x \in { s \in S : /\ \/ A(s)
                      \/ B(s)
                   /\ C(s) }


If you replace "\in" in the first line by the corresponding Unicode symbol, you break indentation.

Stephan

Ron Pressler

unread,
Mar 29, 2016, 11:06:44 AM3/29/16
to tlaplus


On Tuesday, March 29, 2016 at 5:14:56 PM UTC+3, Stephan Merz wrote:

That's a bit over-simplifying. For example, you could have

 ...
 x \in { s \in S : /\ \/ A(s)
                      \/ B(s)
                   /\ C(s) }


If you replace "\in" in the first line by the corresponding Unicode symbol, you break indentation.


That's not a problem because the text would be replaced as you type (i.e once you hit space after \in you get it replaced with ∈ etc.), so you'd really be typing this:

  x ∈ { s ∈ S : ∧  A(s)
                  ∨ B(s)
                ∧ C(s) }

which is aligned in unicode, and then program 1 would convert it to your ASCII version, fixing indentation to yield a version aligned in ASCII. So all the complicated logic is in program 1.


Leslie Lamport

unread,
Mar 29, 2016, 11:25:26 AM3/29/16
to tlaplus
   So all the complicated logic is in program 1

Indeed.  You will, of course, write a specification of the two translations for others to examine.  I think you will find it to be an interesting exercise.  The Toolbox's Decompose Proof command has to solve the problem of maintaining indentation while expanding definitions in a proof step.  It works right most of the time, but it's not perfect.  Program 1 will have to be perfect if people are to trust it. 

Ron Pressler

unread,
Mar 29, 2016, 11:39:39 AM3/29/16
to tlaplus

On Tuesday, March 29, 2016 at 6:25:26 PM UTC+3, Leslie Lamport wrote:

The Toolbox's Decompose Proof command has to solve the problem of maintaining indentation while expanding definitions in a proof step.

I don't know how the toolbox communicates with TLAPS, but assuming this decomposition work is done in the toolbox, then isn't it just a matter of doing it either entirely in unicode or entirely in ASCII? Program 1 can then work on complete files only (both input and output). Or did you just mention this as an example showing that maintaining indentation can be tricky?

Jaak Ristioja

unread,
Mar 29, 2016, 11:45:01 AM3/29/16
to tla...@googlegroups.com
Hi!

On 29.03.2016 18:06, Ron Pressler wrote:
> On Tuesday, March 29, 2016 at 5:14:56 PM UTC+3, Stephan Merz wrote:
>>
>>
>> That's a bit over-simplifying. For example, you could have
>>
>> ...
>> x \in { s \in S : /\ \/ A(s)
>> \/ B(s)
>> /\ C(s) }
>>
>>
>> If you replace "\in" in the first line by the corresponding Unicode
>> symbol, you break indentation.
>>
>>
> That's not a problem because the text would be replaced *as you type*

For users of the TLA+ Toolbox only. Not everybody fits into the proposed
box. Nor would the Unicode conversion box fit the following

pred(S) == \A value \in S: \/ x == 1
(*NoteThisNiceCommentHere*)\/ x == 2

One workaround would be to pad the replacements with spaces, e.g. "\in"
-> " ∈ ".

J

Ron Pressler

unread,
Mar 29, 2016, 11:59:41 AM3/29/16
to tlaplus, jaak.r...@cyber.ee


On Tuesday, March 29, 2016 at 6:45:01 PM UTC+3, Jaak Ristioja wrote:

For users of the TLA+ Toolbox only. Not everybody fits into the proposed
box.

Sure. Those not using the toolbox could either 1/ write in unicode using whatever editor they like and then feed it to the rest of the tools directly if 3 is implemented, or after running 1 if not, or, 2/ do everything in ASCII as they do today.

 
Nor would the Unicode conversion box fit the following

pred(S) == \A value \in S: \/ x = 1
(*NoteThisNiceCommentHere*)\/ x = 2


In the toolbox this wouldn't be a problem, as you'll edit the code in unicode and align it accordingly. In program 1, this would be a problem (I assume you've meant the disjunctions to align) which would have to be addressed.
 
One workaround would be to pad the replacements with spaces, e.g. "\in"
-> " ∈ ".

 
Note that running program 1 to convert ASCII -> Unicode is rare, and done not more than once (per file) when you decide to transition, so the padding could be cruder than what you suggest. For example, all padding could be added after the == or even at the beginning of the line. But you're right, I don't think the logic is trivial.

Ron

Ron Pressler

unread,
Mar 30, 2016, 5:08:31 AM3/30/16
to tlaplus
Before we get to the spec, I'd like to outline the goals and assumptions for the algorithm(s) in program 1:

1. The conversions must be correct in the following sense: a. the semantics of the specification must not change in any way, b. no syntax errors may be introduced, c. no syntax errors may be resolved, d. no exported symbols (i.e. names) must change (although probably no names -- even local ones -- will be changed).
2. The conversion from ASCII to Unicode (A->U) must be "pretty" in the sense that it should maintain the original formatting intent -- non-syntax or semantics related -- as much as possible. 
3. The A->U conversion will be rare, and so may require some manual re-beautification of the output. This mostly applies to the placement of left comments, as in Jaak Ristioja's example.
4. The U->A conversion need not be "pretty" (though it may be).
5. The algorithms should be stable, in the sense that applying conversions back and forth (without manual intervention) should eventually converge (to two character-by-character stable representations).
6. It is not the goal of the algorithms to preserve exact locations in errors reported by SANY/TLC. That will be addressed in changes to SANY made to support unicode.
7. The program should be as cheap as possible in terms of development effort, and should employ existing code as much as possible, thus possibly leading to:
8. The algorithms may be conservative, in the sense of imposing more constraints than necessary for complying with these goals for the sake of simplicity/cost.

As to the spec, I propose that it will be based on this specification of alignment from the pretty-printer's code, with the addition that left comments may be moved to become right comments in the contracting A->U conversion (see point 3 above) in order to preserve alignment. Also, I think that the following invariant would arise naturally, that every token's beginning column may only decrease (or stay the same) under the A->U conversion, and may only increase (or stay the same) under the U->A conversion.

Ron

Leslie Lamport

unread,
Mar 30, 2016, 1:19:23 PM3/30/16
to tlaplus
Ron,

What you are proposing to do involves modifying the core of SANY and
TLC. These programs were written on a shoestring and are neither well
structured nor well documented.  They now have very few unknown bugs
because they have been largely unchanged for 15 years.  Any plan to
make such extensive modifications needs to include plans for
extensively (and intelligently) testing the new version to make sure
that it doesn't introduce any bugs for ASCII specs.  (For example, I
have worked to avoid having to make even the smallest changes to the
syntax based on experience with the prover because I am afraid of any
change to the JavaCC code.)


Given that, I will be reluctant to incorporate such changes without
evidence that there is sufficient demand for handling unicode symbols.
Chris Newcombe has told me that engineers do not like the
pretty-printed version of the specs because they find the symbols
unfamiliar, and they are more comfortable with ASCII specs.  While I
like the pretty-printed versions, I find very little difference in
ease of reading between the two--the significant difference occurring
in symbols like \subseteq or \prec that don't occur very often.  I
expect the advantages of ASCII (for example, the ability to do serious
editing in Emacs) mean that I wouldn't use unicode symbols in my
specs.  So, I think you should find out how many users would find
your proposal a significant improvement.

Now to get to your proposal.  There seem to be two viable approaches:


1. Make the actual .tla file ASCII and only have the unicode symbols
displayed in the Toolbox module editor.  This means that there is no need
for the ASCII to be easy to read.  For example, it would be quite
all right for the spec displayed as follows (where & stands for the
\in symbol)

   pred(S) == \A value & S:   \/ x == 1
   (*NoteThisNiceCommentHere*)\/ x == 2

to produce this in the ASCII file

   pred(S) == \A value \in S:   \/ x == 1
   (*NoteThisNiceComment  Here*)\/ x == 2

Note that the ASCII file could have arbitrary information required to
display it encoded in special comments at the end of lines.

This would require almost no changes to the existing tools.  The
Toolbox could translate locations in the ASCII file to locations in
the displayed text for displaying locations of errors reported by TLC
and of proof obligations reported by TLAPS. I'm not sure what should
be done about diplaying proof obligations created by TLAPS in the
course of a proof.

2.  Put the Unicode in the .tla files.  This potentially requires
changes to code scattered throughout SANY and TLC, making it hard to
avoid introducing bugs.  One problem is in the communication between
the Toolbox and the tools.  This is all done with stdio and stderr.
Do those interfaces support Unicode?  If not, then thousands of print
and read statements in a hundred or so Java files need to be modified.


----


In any case, a lot of Toolbox code needs to be rewritten.  You will
need to find someone to do that, because neither Markus nor I will
have time to do more than answer questions.


Cheers,


Leslie


P.S. The alignment algorithm of the pretty-printer does not guarantee
correctness.  It's a heuristic that works only most of the time.  But in
Plan 2, correctness of the translation to unicode isn't necessary, and
it's OK if people who want ASCII are required to write the spec in ASCII. 


L.


Ron Pressler

unread,
Mar 30, 2016, 3:23:45 PM3/30/16
to tlaplus


On Wednesday, March 30, 2016 at 8:19:23 PM UTC+3, Leslie Lamport wrote:
Ron,

What you are proposing to do involves modifying the core of SANY and
TLC.


That's only required for precise error reporting, and I believe this not to be too hard anyway (e.g. SANY recognizes the \cup and \union tokens as a union; would it be hard to add the token ∪, too?). In any event, I can abort the minute it becomes non-trivial. Whatever the outcome, I think any project could only benefit from another set of eyes once in a while.
 
expect the advantages of ASCII (for example, the ability to do serious
editing in Emacs) mean that I wouldn't use unicode symbols in my
specs. 

Emacs (and vim) work very well with unicode. In fact, Emacs has modes that replace TeX escapes with their unicode equivalents as you type. Agda works like this. All of the popular editors nowadays support unicode, and Java has been unicode-only since 1996.

So, I think you should find out how many users would find
your proposal a significant improvement.

Yes. In any event, I can play around a bit to get a feel for just how hard this should be; you are always free to reject my code.
 

Now to get to your proposal.  There seem to be two viable approaches:


This means that there is no need for the ASCII to be easy to read. 

Right.
 
For example, it would be quite
all right for the spec displayed as follows (where & stands for the
\in symbol)

   pred(S) == \A value & S:   \/ x == 1
   (*NoteThisNiceCommentHere*)\/ x == 2

to produce this in the ASCII file

   pred(S) == \A value \in S:   \/ x == 1
   (*NoteThisNiceComment  Here*)\/ x == 2

Right. But the program should still support conversion in the opposite directions (without any comment "prgamas") for people who want to migrate, and that should be easy to read. In that case, I see no other way other than to move the left comment over to the end of the column, and then let the user decide what they want to do with it.
 

Note that the ASCII file could have arbitrary information required to
display it encoded in special comments at the end of lines.


That's a good idea, but I don't know if I'll need it before I start studying the issue more in depth.
 
This would require almost no changes to the existing tools.  The
Toolbox could translate locations in the ASCII file to locations in
the displayed text for displaying locations of errors reported by TLC
and of proof obligations reported by TLAPS. 

Oh, right. I guess that if that turns out to be much easier, then that's the way to go. 
 

2.  Put the Unicode in the .tla files.  This potentially requires
changes to code scattered throughout SANY and TLC, making it hard to
avoid introducing bugs. 


Right, but it all depends on how pervasive the changes need to be. I'm optimistic, but it's an optimism based largely on ignorance of the code and familiarity with Java, which is very unicode-friendly.
 

One problem is in the communication between
the Toolbox and the tools.  This is all done with stdio and stderr.
Do those interfaces support Unicode? 


Absolutely. They just transport bytes, and need to be given an encoding (a Charset in the Java parlance) to/from strings (which in Java are always unicode). They could either be told to work with UTF-8 by passing this charset as an extra parameter, or UTF-8 could be set as the default encoding for all streams at JVM startup. The UTF-8 encoding preserves the ASCII-ness of ASCII data, and only does new things for character codes beyond 255. I.e. it is completely backwards compatible with ASCII. 


In any case, a lot of Toolbox code needs to be rewritten.  You will
need to find someone to do that, because neither Markus nor I will
have time to do more than answer questions.


OK. 
 
P.S. The alignment algorithm of the pretty-printer does not guarantee
correctness. 

Hmm, that's good to know. 
 

Ron
Reply all
Reply to author
Forward
0 new messages