OT: entering Unicode characters

Stefan Karpinski

unread,

Jan 15, 2014, 11:26:57 AM1/15/14

to Julia Users

Since Julia source code can use Unicode identifiers, I thought this slightly off-topic blog post by John D. Cook might be useful to people:

http://www.johndcook.com/symbols/2013/12/how-to-enter-unicode-characters/

In particular, I learned about the Unicode Hex input mode for OS X.

David van Leeuwen

unread,

Jan 17, 2014, 3:07:31 AM1/17/14

to julia...@googlegroups.com

Hi,

My two cents,

I started using μ and Σ in my code for normal distributions a while ago. On the mac, these symbols happen to exist in the US keyboard, but they turn out to in different unicode positions than the Greek alphabet letters, so apparently these option-US-keyboad entries have the meaning "micro" and "sum". On the screen, they look identical to the greek letters---so this can lead to bit of confusion.

In my current working environment I have a full greek keyboard defined, and switch between US and greek using a shortcut command-space. This gives me fairly quick access to the full greek symbol repertoire without having to remember unicode codes.

Cheers,

---david

John Myles White

unread,

Jan 17, 2014, 11:41:57 AM1/17/14

to julia...@googlegroups.com

Sigh. These kind of homographs are the major problem I’ve found with using Unicode, including my favorite phishing strategy: http://en.wikipedia.org/wiki/IDN_homograph_attack

On my machine, I get exactly the same results as you:

# Using U.S. keyboard on OS X, type alt-m
julia> int('µ')
181

# Using Greek keyboard on OS x, type m
julia> int('μ')
956

I think this means that we need to change all of the code we’ve written that uses Unicode to use only unambiguous ASCII characters. Allowing homographs in any code that more than one person will ever edit is almost certainly going to induce outbursts of rage at some point.

— John

Jiahao Chen

unread,

Jan 17, 2014, 11:55:16 AM1/17/14

to julia...@googlegroups.com

> I think this means that we need to change all of the code we’ve written that uses Unicode to use only unambiguous ASCII characters.

If we do this, I will have to bow out of maintaining any code I've
written with Unicode characters. There is a lot of numerical code that
simply too hard for me to read if I'm forced to do an extra layer of
transliteration for the sake of charset purity.

I agree that the homograph problem is an issue, but it is something
that is not hard to check in practice so long as one is consistent
within the same scope.

John Myles White

unread,

Jan 17, 2014, 12:01:30 PM1/17/14

to julia...@googlegroups.com

As you can probably imagine, I don’t agree. It seems like a huge problem to me if, when I read your code, I can’t tell which characters you’re using just by reading your code.

— John

Jiahao Chen

unread,

Jan 17, 2014, 12:07:33 PM1/17/14

to julia...@googlegroups.com

> It seems like a huge problem to me if, when I read your code, I can’t tell which characters you’re using just by reading your code.

Seems ridiculous, doesn't it? Welcome to the world of non-ASCII
character sets ;-)

I think we can agree that it would be nice to have a better way of
detecting Unicode homograph collisions than manual verification. We
could try to standardize on various codepoints. But I think we'll just
have to agree to disagree on making this a prescription.
Thanks,

Jiahao Chen, PhD
Staff Research Scientist
MIT Computer Science and Artificial Intelligence Laboratory

Jiahao Chen

unread,

Jan 17, 2014, 12:28:59 PM1/17/14

to julia...@googlegroups.com

Leaving aside the nuclear option and proselytizing for the moment, the
Unicode consortium does helpfully (?) provide a long list of
confusable characters

http://www.unicode.org/Public/security/revision-05/confusables.txt

and a related technical standard

http://www.unicode.org/reports/tr39/

which builds upon a general standard addressing the general security
question of spoofing

http://www.unicode.org/reports/tr36/

Hmm, writing a text checker to detect potential confusables could
actually make for a decent undergraduate thesis project...

Milan Bouchet-Valat

unread,

Jan 17, 2014, 12:32:31 PM1/17/14

to julia...@googlegroups.com

Le vendredi 17 janvier 2014 à 12:07 -0500, Jiahao Chen a écrit :
> > It seems like a huge problem to me if, when I read your code, I
> can’t tell which characters you’re using just by reading your code.
>
> Seems ridiculous, doesn't it? Welcome to the world of non-ASCII
> character sets ;-)
>
> I think we can agree that it would be nice to have a better way of
> detecting Unicode homograph collisions than manual verification. We
> could try to standardize on various codepoints. But I think we'll just
> have to agree to disagree on making this a prescription.

If a tool is written to automatically format Julia code, it could also
check that two homograph characters are not used at the same time in the
same project. This would catch all problematic cases, and I don't think
there would be many false positives.

That wouldn't catch the case were you do not call the tool at all before
running the code -- but doing the check in the compiler itself may slow
down compilation.

Regards

Mike Nolta

unread,

Jan 17, 2014, 12:36:16 PM1/17/14

to julia...@googlegroups.com

On Jan 17, 2014, at 12:28, Jiahao Chen <jia...@mit.edu> wrote:

Leaving aside the nuclear option and proselytizing for the moment, the
Unicode consortium does helpfully (?) provide a long list of
confusable characters

http://www.unicode.org/Public/security/revision-05/confusables.txt

and a related technical standard

http://www.unicode.org/reports/tr39/

which builds upon a general standard addressing the general security
question of spoofing

http://www.unicode.org/reports/tr36/

Hmm, writing a text checker to detect potential confusables could
actually make for a decent undergraduate thesis project...

Eh, life's too short:

http://www.icu-project.org/apiref/icu4c/uspoof_8h.html

-Mike

Leah Hanson

unread,

Jan 17, 2014, 12:36:47 PM1/17/14

to julia...@googlegroups.com

If we had a script to check for this, it could be set up as part of the default Travis thing generated for packages.

There could be a package/tool to generate helpful error messages when you try to used a function/MathConstant by the wrong (confusable) unicode character. Something like "You're using the wrong pi; try this one or the name `pi`".

-- Leah

Jiahao Chen

unread,

Jan 17, 2014, 12:53:05 PM1/17/14

to julia...@googlegroups.com

>> Hmm, writing a text checker to detect potential confusables could
>> actually make for a decent undergraduate thesis project...
>
> Eh, life's too short:
>
> http://www.icu-project.org/apiref/icu4c/uspoof_8h.html

Hmm, writing a Julia package to wrap libicu to detect potential
confusables in Julia code could actually make for a decent
undergraduate thesis project..

Eric Davies

unread,

Jan 17, 2014, 1:17:07 PM1/17/14

to julia...@googlegroups.com

Someone could add uspoof support to ICU.jl as a first step.

Jiahao Chen

unread,

Jan 17, 2014, 1:21:44 PM1/17/14

to julia...@googlegroups.com

Ok, the joke wears thin the third time round.

Raphael Sofaer

unread,

Jan 17, 2014, 1:32:08 PM1/17/14

to julia...@googlegroups.com

I think the ideal behavior would be for Julia itself to have an opinion on which character in each set of identical-looking characters was right, and to warn on using a homograph that was not canonical. Combined with a tool that would substitute any character causing a warning with the appropriate homograph, that would end any problems.

Toivo Henningsson

unread,

Jan 17, 2014, 3:17:25 PM1/17/14

to julia...@googlegroups.com

Perhaps Julia could canonicalize symbols at parse time (besides warning for non-canonical ones?). I think that whichever homograph is chosen as canonical, it won't be the one that is easiest to type for everyone.

Jeff Bezanson

unread,

Jan 17, 2014, 3:29:39 PM1/17/14

to julia...@googlegroups.com

This is my secret weapon for entering unicode characters:

https://gist.github.com/JeffBezanson/8480786

After adding that to a .emacs, you can switch to symbol-input mode and
type e.g. \theta to enter a theta. The set of characters is obviously
easy to extend.

Eric Davies

unread,

Jan 17, 2014, 3:51:37 PM1/17/14

to julia...@googlegroups.com

Sublime Text (2 and 3) has a package called UnicodeMath which has similar functionality. and covers much of unicode by default. Has functionality to add symbols and to add synonyms to existing names.

P.S.: \upSigma produces Σ and \sum produces ∑ (different).

Steven G. Johnson

unread,

Jan 17, 2014, 4:07:36 PM1/17/14

to julia...@googlegroups.com

On Friday, January 17, 2014 1:32:08 PM UTC-5, Raphael Sofaer wrote:

I think the ideal behavior would be for Julia itself to have an opinion on which character in each set of identical-looking characters was right, and to warn on using a homograph that was not canonical. Combined with a tool that would substitute any character causing a warning with the appropriate homograph, that would end any problems.

Yes, I think this is essential. Otherwise I can foresee users banging their heads on the keyboard for hours because they don't understand why

const μ = 3
µ + 1

gives a "µ not defined" exception.

Steven G. Johnson

unread,

Jan 17, 2014, 4:22:02 PM1/17/14

to julia...@googlegroups.com

I opened an issue for this:

https://github.com/JuliaLang/julia/issues/5434

My preference would be for Julia to silently canonicalize all homoglyphs in identifiers (rather than issuing a warning or whatever).

Ivar Nesje

unread,

Jan 17, 2014, 4:58:18 PM1/17/14

to julia...@googlegroups.com

+10 for automatic canoncialization. If we could have a optional warning if canonicalization is needed for travis to barf at, that would be great too.

Marcus Urban

unread,

Jan 18, 2014, 12:34:44 AM1/18/14

to julia...@googlegroups.com

I'm not sure whether people are using "canonicalize" in the generic sense or if they mean canonical mappings as defined by the Unicode standard. Just to be clear, the initial issue raised about U+00B5 MICRO SIGN versus U+03BC GREEK SMALL LETTER MU would not be fixed by a canonical decomposition. However, U+00B5 does have a compatibility decomposition to U+03BC.

The official definitions are given in http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf, and some relevant suggestions about handling identifiers in the context of Unicode are in http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf.

Marcus Urban

unread,

Jan 18, 2014, 12:36:37 AM1/18/14

to julia...@googlegroups.com

Make that last link to the FAQ http://www.unicode.org/faq/normalization.html

Patrick O'Leary

unread,

Jan 18, 2014, 9:34:05 AM1/18/14

to julia...@googlegroups.com

We've been specifically discussing normalization form KC as defined by UAX #15 (http://unicode.org/reports/tr15/) in the issue (https://github.com/JuliaLang/julia/issues/5434), which is a compatibility normalization.

On Friday, January 17, 2014 11:34:44 PM UTC-6, Marcus Urban wrote:

Steven G. Johnson

unread,

May 22, 2014, 1:27:41 PM5/22/14

to julia...@googlegroups.com

A quick update for people who haven't been tracking git closely:

The Julia REPL (#6911), IJulia, and (soon) Emacs julia-mode (#6920) now allows you to type many mathematical Unicode characters simply by typing the LaTeX symbol and hitting TAB.

e.g. you can type \alpha<TAB> and get α, or x\hat<TAB> and get x̂.

There are currently 736 supported symbols (though not all of them are valid in Julia identifiers). This should provide a consistent, cross-platform Julian idiom for entering Unicode math.

Hopefully this can also be added to other popular editors at some point, e.g. presumably vim can be programmed to do this, and there is a somewhat similar mode for Sublime (https://github.com/mvoidex/UnicodeMath). (Less-programmable editors might need source-level patches, but it doesn't seem like an unreasonable patch to suggest.)

Mike Innes

unread,

May 22, 2014, 1:33:12 PM5/22/14

to julia...@googlegroups.com

Great! This feature will be in Light Table soon, too – complete with fuzzy searching, so that it's easy to browse all available symbols :)

harven

unread,

May 22, 2014, 3:00:42 PM5/22/14

to julia...@googlegroups.com

Le jeudi 22 mai 2014 19:27:41 UTC+2, Steven G. Johnson a écrit :

A quick update for people who haven't been tracking git closely:

The Julia REPL (#6911), IJulia, and (soon) Emacs julia-mode (#6920) now allows you to type many mathematical Unicode characters simply by typing the LaTeX symbol and hitting TAB.

e.g. you can type \alpha<TAB> and get α, or x\hat<TAB> and get x̂.

– Nice. 'Course there's an emacs command to do that.
– Oh yeah! Good ol' M-x set-input-method RET TeX RET
– Dammit, Emacs.

http://www.emacswiki.org/emacs/TeXInputMethod

Miguel Bazdresch

unread,

May 22, 2014, 3:03:39 PM5/22/14

to julia...@googlegroups.com

In vim, you can do something like

imap \alpha<TAB> <C-V>u03b1

to reproduce this behavior.

-- mb

Steven G. Johnson

unread,

May 22, 2014, 3:28:24 PM5/22/14

to julia...@googlegroups.com

On Thursday, May 22, 2014 3:00:42 PM UTC-4, harven wrote:

– Nice. 'Course there's an emacs command to do that.
– Oh yeah! Good ol' M-x set-input-method RET TeX RET
– Dammit, Emacs.

http://www.emacswiki.org/emacs/TeXInputMethod

(Unfortunately, I find this mode is too insanely annoying to actually leave turned on all the time ... as soon as you type a backslash, it starts crazily jumping the cursor around as it substitutes one character after another, not to mention the difficulty of typing valid Julia expressions like A\b. Which means you need to come up with some keybinding to turn it on only when you need it.)

Daniel Jones

unread,

May 22, 2014, 3:56:04 PM5/22/14

to julia...@googlegroups.com

Also for vim users who aren't aware of this: vim has a convenient way to enter common special characters in the form of digraphs which you can enter by pressing ctrl-k in insert mode. You have to learn the digraph for the symbol, but they are pretty mnemonic in their assignment (e.g 'C(' -> ⊂, 'm*' -> μ, 's*' -> σ, 'Fm' -> ♀), and honestly, you wouldn't be using vim if you weren't into maximizing efficiency by learning short cryptic commands.

Steven G. Johnson

unread,

May 22, 2014, 3:59:16 PM5/22/14

to julia...@googlegroups.com

On Thursday, May 22, 2014 3:03:39 PM UTC-4, Miguel Bazdresch wrote:

In vim, you can do something like

imap \alpha<TAB> <C-V>u03b1

to reproduce this behavior.

This works, sort of, but I find it a bit annoying. If you are too slow in typing "\alpha" then it doesn't perform the substitution. If you type it quickly, it works, but you have to type it blindly because vim doesn't move the cursor (the characters "\alpha" fall on top of one another as you type). Worse, it makes it harder t

I find it much nicer to be able to type \alpha, see what I'm doing, and then type <TAB> at any later point in time, only when I'm ready to make the substitution. Presumably you can program vim to do this, but it may not be as simple as "imap"?

On the other hand, I'm not a vi user. Maybe an editing mode that requires rapid, blind typing would fit right in with that editor. ;-)

Stefan Karpinski

unread,

May 22, 2014, 4:03:39 PM5/22/14

to Julia Users

No true vim user types so slowly that this is a problem.

Patrick O'Leary

unread,

May 22, 2014, 5:06:08 PM5/22/14

to julia...@googlegroups.com

On Thursday, May 22, 2014 2:28:24 PM UTC-5, Steven G. Johnson wrote:

...Which means you need to come up with some keybinding to turn it on only when you need it.

C-\ is bound to toggle-input-method by default.

Miguel Bazdresch

unread,

May 22, 2014, 7:34:39 PM5/22/14

to julia...@googlegroups.com

That's true, but I think the best solution would be to have the same keybindings in the julia REPL and in vim. I think it'd be terribly confusing otherwise.

-- mb

Carlo Baldassi

unread,

May 23, 2014, 8:38:44 PM5/23/14

to julia...@googlegroups.com

Update: the vim plug-in now includes this feature. If you press Tab after a valid latex sequence, it substitutes it, otherwise it falls-back to whatever was previously mapped for Tab. Or at least that's what it's supposed to do.

Carlo Baldassi

unread,

May 29, 2014, 8:27:38 PM5/29/14

to julia...@googlegroups.com

Yet another small update, since most users might miss this: Vim now has an optional on-the-fly-as-you-type LaTeX-to-Unicode substitution mode, which however is off by default (so as to emulate the Julia REPL as closely as possible).
See the documentation on how to enable it at https://github.com/JuliaLang/julia-vim, or use ":help julia-vim" after updating the julia-vim plug-in.
Opinions, bug reports etc. welcome.

On Wednesday, January 15, 2014 5:26:57 PM UTC+1, Stefan Karpinski wrote:

Since Julia source code can use Unicode identifiers, I thought this slightly off-topic blog post by John D. Cook might be useful to people:

http://www.johndcook.com/symbols/2013/12/how-to-enter-unicode-characters/

In particular, I learned about the Unicode Hex input mode for OS X.

Henri Girard

unread,

Apr 20, 2016, 1:38:25 PM4/20/16

to julia-users

Wonderfull answer ! I am new to julia (ijulia) today's exactly... But I was wondering how to get symbols... You save me a good lot of time !
So I will enjoy longer my favorite "Saumur Champigny" ! lol

Reply all

Reply to author

Forward