Unicode characters in variable names

111 views
Skip to first unread message

Jeffrey Arnold

unread,
Feb 21, 2016, 12:56:53 AM2/21/16
to stan...@googlegroups.com
Since there's been discussion of major changes to Stan's syntax in Stan 3, I thought I'd throw this idea out there if it hasn't already been mentioned: allow unicode characters in variable names. Then Greek letters like α, β, μ, and δ can be used as variable names. This is something that Julia allows in its variable names. It's a little gimmicky, but if it is sufficiently easy to alter the grammar and parser, then it would be cool to write models like this one:
data {
  int n;
  int p;
  vector[n] y;
  matrix[n, p] X;
}
parameters {
  real<lower = 0.0> σ;
  vector[p] β;
}
transformed parameters {
  vector[n] mu;
  μ <- X * β;
}
model {
  y ~ normal(μ, σ);
}

Bob Carpenter

unread,
Feb 21, 2016, 1:27:56 AM2/21/16
to stan...@googlegroups.com
I worry about corrupted character encodings if we do that.
I spent nearly 10 years working on natural language processing
and can't tell you how many hours I spent correcting
coding gaffes.

Just about everything can handle ASCII (other than most LaTeX fonts
choking on ~, that is).

I'm very very keen to keep Stan running the same way
across interfaces. So if we could demonstrate that it would
work on all the platforms (Windows/Linux/Mac) in all of
our interfaces (CmdStan/RStan/PyStan/...), then I'd be willing
to risk the users messing things up in cut-and-paste or on web
pages or whatever.

I think we can get LaTeX to take unicode input and print
a µ the right way in a Verbatim environment.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Michael Betancourt

unread,
Feb 21, 2016, 6:52:57 AM2/21/16
to stan...@googlegroups.com
I don't have anywhere near as much Unicode experience as Bob, but what I do have also makes me hesitant to move away from ASCII. I hate encodings with a righteous passion.

Jeffrey Arnold

unread,
Feb 21, 2016, 10:25:52 AM2/21/16
to stan...@googlegroups.com
The encodings problem is a good point, and definitely outweighs anything else. Also, after looking a little more at Julia, they also had to write a library to handle issues like glyphs that look similar. 

Allen B. Riddell

unread,
Feb 21, 2016, 11:12:30 AM2/21/16
to stan...@googlegroups.com
I think this is a good thing to think about. In the long-term, it would
be great if Stan allowed people in situations where the use of the Latin
alphabet is not dominant (e.g., Russia, China, Japan, etc..), to use
UTF-8 in their Stan program code.

I imagine at some point a standard will emerge for dealing cleanly with
UTF-8 in C++ (C++11 or C++14?).

Best,

Allen

Bob Carpenter

unread,
Feb 21, 2016, 11:38:56 AM2/21/16
to stan...@googlegroups.com
The problem isn't C++, it's unicode itself. And the
fact that unicode isn't the only game in town. So what
people will do is open an editor in Word or something,
get their default encoding, think they're writing "unicode"
because in their head that means "fancy characters", save
it as some Win-1152 variant, then go into encoding hell as
it renders as random characters when they run it through Stan.
Or it won't render in the Windows shell.
Then they'll have to figure out how to make LaTeX take
unicode inputs (not that hard, but a flag you need to set).

What Jeffrey Arnold's bringing up is that there are standard
forms. If you have o + umlauts, it can either be encoded
as a regular o followed by an umlaut combining character, or as the
single o+umlaut character. Then there's a whole copy of ASCII
at half-width in the Chinese character plane. And then a whole
bunch of characters in the Chinese name plane that go beyond
16 bits, which limits the number of systems that deal with them
properly. Then there's the fact that unicode's evolving, so we
could go with unicode A.B, but they keep adding more characters.

But like I said, as long as the I/O with proper encodings
can be shown to work on all platforms, we can try it.

- Bob

Andrew Gelman

unread,
Feb 21, 2016, 4:36:57 PM2/21/16
to stan...@googlegroups.com
Fuckin latex chokes on apostrophes. I type my apostrophes like this on keyboard: ‘
But when I put them in a blog post or latex to pdf they appear like this: ’
And then when I put _that_ character into my latex documents it becomes invisible.
Grrrr.

Bob Carpenter

unread,
Feb 21, 2016, 6:53:55 PM2/21/16
to stan...@googlegroups.com
And if you have two apostrophes, it looks like a double-quote: ”
The problem with escaped-based syntaxes, like LaTeX's, is why
this xkcd is funny: https://xkcd.com/1638/

You usually don't want a vertical quote ' anywhere other
than in typewriter text for code. LaTeX does the right thing
for you there with verbatim, but still messes up ~ with
most fonts (not Lucida!).

If you're in ordinary text and really want that vertical apostrophe,
you need to escape: \textquotesingle

Alas, \texttilde is also broken in most LaTeX fonts.

- Bob

Andrew Gelman

unread,
Feb 21, 2016, 7:59:55 PM2/21/16
to stan...@googlegroups.com
I don’t want the vertical quote! What annoys me is that I just want the regular quotation mark but Latex doesn’t recognize it!

Bob Carpenter

unread,
Feb 21, 2016, 8:52:56 PM2/21/16
to stan...@googlegroups.com
Try this:

\documentclass{article}
\usepackage[utf8]{inputenc}
\begin{document}
\section*{Fun with Unicode}
Single open: ‘
\\
Single close: ’
\\
Double open: “
\\
Double close: ”
\end{document}

I'm attaching both the input file and the output pdf from
pdflatex. I prefer to use ` and ' and `` and '' in
LaTeX because it's more robust and I've been doing that
way since before Unicode existed.

- Bob





foo.tex
foo.pdf

Jeffrey Arnold

unread,
Feb 21, 2016, 11:40:20 PM2/21/16
to stan...@googlegroups.com
I tend to use xelatex with the fontspec package, which also doesn't cause any issues. See attached: 

On the topic of the unicode characters. I was thinking the benefits as (1) making the models in Stan similar to the mathematical notation, and (2) it is the sort of feature that looks really good in presentations (especially for new users)! However, if the technical debt of it is too high, its not worth it.

--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
foo.pdf
foo.tex

Bob Carpenter

unread,
Feb 22, 2016, 1:30:55 AM2/22/16
to stan...@googlegroups.com
I don't think it'd be too bad if it could be made
portable across interfaces. I don't know how hard
that'd be. It looks like RStan and CmdStan would be
OK on the Mac.

- Bob
> <foo.pdf><foo.tex>

Jeffrey Arnold

unread,
Feb 23, 2016, 7:01:43 AM2/23/16
to stan...@googlegroups.com
If you do want to do this at some point, the way that Julia handles is (apart from arbitrary operators) seems like a good set of rules:

Variable names must begin with a letter (A-Z or a-z), underscore, or a subset of Unicode code points greater than 00A0; in particular, Unicode character categories Lu/Ll/Lt/Lm/Lo/Nl (letters), Sc/So (currency and other symbols), and a few other letter-like characters (e.g. a subset of the Sm math symbols) are allowed. Subsequent characters may also include ! and digits (0-9 and other characters in categories Nd/No), as well as other Unicode code points: diacritics and other modifying marks (categories Mn/Mc/Me/Sk), some punctuation connectors (category Pc), primes, and a few other characters.
 
Operators like + are also valid identifiers, but are parsed specially. In some contexts, operators can be used just like variables; for example (+) refers to the addition function, and (+) = f will reassign it. Most of the Unicode infix operators (in category Sm), such as , are parsed as infix operators and are available for user-defined methods (e.g. you can use const  = kron to define as an infix Kronecker product).
 
The only explicitly disallowed names for variables are the names of built-in statements:

Bob Carpenter

unread,
Feb 23, 2016, 3:36:57 PM2/23/16
to stan...@googlegroups.com
You'll see that they're using Unicode categories. This is
easy to parse out with something like the International
Components for Unicode (ICU):

http://site.icu-project.org

It has all the classification into categories and also
all the normalization. So none of this is that hard, and
I've had quite a bit of experience dealing with Unicode.

There's a chapter of Mitzi's book that goes over all of
the gory details, including the UTF encodings, the categories,
the normalizations, and the ICU toolkit:

http://www.amazon.com/gp/product/B00HYLZ2O2

- Bob

Tamas Papp

unread,
May 9, 2016, 9:06:37 AM5/9/16
to stan development mailing list
Sorry to revive an old (but recent) thread. In Julia, I have been staying away from Unicode variable names for a while, being concerned about font support, editor support, and corrupted files. And frankly, the whole thing looked like a gimmick.

Then I have tried and within a few days I changed my opinion completely. I find that I can organize my code better, and use variable names that are short yet correspond to formulas in papers. This leads to very compact equations. Corruption has not been a problem in practice (using UTF8 all the time).

AFAIK most editors support some form of entry. In Emacs for example, you can use autocomplete, company-mode, and quail (see eg the packages listed at the end of https://github.com/vspinu/math-symbol-lists ) to enter \alpha for α. So it would be great to have Unicode in Stan.

Best,

Tamas

Bob Carpenter

unread,
May 9, 2016, 11:07:07 AM5/9/16
to stan...@googlegroups.com
Thanks for the report. I don't think this will be a priority
for us for a while given that we're very busy with other matters.

Any idea what the support is in all of our interface languages?
One thing I want to do is keep compatibility of models across
interfaces.

Feel free to create a feature request issue on stan-dev/stan.
Any implementation hints for C++ would be appreciated.

The nice thing about UTF-8 is that it's backward compatible with
ASCII.

- Bob

Tamas Papp

unread,
May 9, 2016, 11:44:20 AM5/9/16
to stan...@googlegroups.com
Opened https://github.com/stan-dev/cmdstan/issues/394 .

Please comment there if you know about UTF8 support in your favorite
editor, I will include it in the issue.

Bob Carpenter

unread,
May 9, 2016, 1:22:09 PM5/9/16
to stan...@googlegroups.com
Thanks for the detailed request! The extent of support is
encouraging. We'd rule out Python 2, but I think C++11 is
going to do that for us eventually anyway.

- Bob

Krzysztof Sakrejda

unread,
May 11, 2016, 5:41:22 PM5/11/16
to stan development mailing list
We've already had problems with people sending us cut-and-paste programs broken due to multiple similar characters for hyphens, isn't this going to be a nightmare of bugs if people start sending us reports of bugs sure to using the wrong omega symbol or something like that? Are these bugs easy to trap and warn about? K
Reply all
Reply to author
Forward
0 new messages