Unicode domain names issue (Encrypting a "fake" domain name)

Martin Heaps

unread,

Apr 17, 2017, 10:38:25 AM4/17/17

to mozilla-de...@lists.mozilla.org

I was led today to this article - https://www.wordfence.com/blog/2017/04/chrome-firefox-unicode-phishing/ - about the domain name 'www.epic.com' which can be written using unicode as 'www.xn--e1awd7f.com' to make a fake website look exactly like the URL of the real website.

There is a work around in Firefox for this, but possibly more seriously:

LetsEncrypt has a valid encryption certificate for the fake domain, and in Firefox the domain certificate (when clicking on the padlock icon to view the drop down box) still shows the processed unicode URL rather than the "raw" URL (showing that the Encryption certificate is for "epic.com" rather than "xn--e1awd7f.com" which is the domain given to LetsEncrypt).

This issue is clearly quite serious that certificate names can be manipulated to look like other names in certain situations, further undermining the value of the certificate as an authenticity check.

I would like this post to be a warning, and a sharing of this information, however the issue may well lay at the hands of browsers, but Certificate Authorities should have a premise with Browsers that their certificate addresses can and should only be shown in a set character set and never ever "interrepted" by the browser.

Article date: 14th April 2017.

Links:

- https://www.wordfence.com/blog/2017/04/chrome-firefox-unicode-phishing/

- https://www.xudongz.com/blog/2017/idn-phishing/

- https://www.reddit.com/r/netsec/comments/65csdk/phishing_with_unicode_domains/

Martin Heaps

unread,

Apr 17, 2017, 10:41:28 AM4/17/17

to mozilla-de...@lists.mozilla.org

Sorry; "interrepted" should be "interpreted" .

Anne van Kesteren

unread,

Apr 17, 2017, 11:54:27 AM4/17/17

to Martin Heaps, mozilla-de...@lists.mozilla.org

On Mon, Apr 17, 2017 at 4:38 PM, Martin Heaps <mar...@mhcreations.co.uk> wrote:
> LetsEncrypt has a valid encryption certificate for the fake domain,

It's not a fake domain, it just looks alike, but it's a valid domain
for all intents and purposes and Let's Encrypt should not do a
registrar's job. That registrars allow lookalikes to be assigned to
different entities is a problem.
https://bugzilla.mozilla.org/show_bug.cgi?id=1332714 tracks this.

--
https://annevankesteren.nl/

Martin Heaps

unread,

Apr 17, 2017, 2:43:20 PM4/17/17

to mozilla-de...@lists.mozilla.org

On Monday, 17 April 2017 16:54:27 UTC+1, Anne van Kesteren wrote:

> On Mon, Apr 17, 2017 at 4:38 PM, Martin Heaps wrote:
> > LetsEncrypt has a valid encryption certificate for the fake domain,
>
> It's not a fake domain, it just looks alike, but it's a valid domain
> for all intents and purposes and Let's Encrypt should not do a
> registrar's job. That registrars allow lookalikes to be assigned to
> different entities is a problem.
> https://bugzilla.mozilla.org/show_bug.cgi?id=1332714 tracks this.
>
>
> --
> https://annevankesteren.nl/

No, a clear destinction should be made; the certificate is for the domain "xn--e1awd7f.com" but in certain browsers this is presented to the user as "epic.com" . It may not be the certificates authority to provide trust to the user that the domain certified is valid; but they should as a bare minimum have an understanding with browser providers that the certificate name is not misconstrued in any way; and that a certificate for "xn--e1awd7f.com" can never be confused with applying to "epic.com".

THIS is the issue I am raising here.

The word "fake" is probably an inpracise word to use, but in the context that the domain name as registered is perporting to be another domain name it is not; this is fake. The SSL provider; LetsEncrypt in this case seems to not be able to ensure with browsers that there is a clear destinction between the two names of the domain certified by the CA.

Martin Thomson

unread,

Apr 18, 2017, 3:34:55 AM4/18/17

to Martin Heaps, mozilla-de...@lists.mozilla.org

On Tue, Apr 18, 2017 at 4:43 AM, Martin Heaps <mar...@mhcreations.co.uk> wrote:
> No, a clear destinction should be made; the certificate is for the domain "xn--e1awd7f.com" but in certain browsers this is presented to the user as "epic.com" . It may not be the certificates authority to provide trust to the user that the domain certified is valid; but they should as a bare minimum have an understanding with browser providers that the certificate name is not misconstrued in any way; and that a certificate for "xn--e1awd7f.com" can never be confused with applying to "epic.com".

Anne is right here. But so also is Martin (I on the other hand cannot
claim any sort of authority). Let's Encrypt have a firm policy here;
the line is grey and they are (rightly) uninterested in trying to set
where that line exists. But a browser might be able to do
something... maybe.

This is a hard problem, because even if this case seems obvious,
others are much less so. People want to use names they know and that
means using their native script. There are things that a browser can
do, but the task of effectively communicating the true security status
of a site is one of the hard problems.

We're still open to ideas. I've heard of many ideas, but the gap
between theory and practice is much larger in practice than we'd like.
For example, we could define a set of characters that form the native
script and warn if characters outside that script are used. That
would work for browsers in an English-native locale, but would
disadvantage English-speakers from Japan in this particular example,
for whom both of these names appear to be special.

Gervase Markham

unread,

Apr 18, 2017, 5:29:59 AM4/18/17

to mozilla-de...@lists.mozilla.org

On 17/04/17 19:43, Martin Heaps wrote:
> No, a clear destinction should be made; the certificate is for the
> domain "xn--e1awd7f.com" but in certain browsers this is presented to
> the user as "epic.com" .

This is perhaps a philosophical issue, but I see this viewpoint as
(perhaps inadvertently) telling users of non-Latin scripts that they are
second-class citizens.

A better way of looking at it is:

* The internet community wished to make all currently-used scripts
first-class citizens on the web, including allowing people to have
domain names in their own script

* It was discovered that the current DNS did not support using Unicode
directly for this purpose

* An encoding was developed such that domains in non-Latin letters could
be encoded using Latin letters

So the domain "epic.com" is actually the domain "epic.com". The fact
that it's encoded on the wire as "xn--e1awd7f.com" is an implementation
detail that, in ideal circumstances, should never be exposed to users.

> The word "fake" is probably an inpracise word to use, but in the
> context that the domain name as registered is perporting to be
> another domain name it is not; this is fake. The SSL provider;
> LetsEncrypt in this case seems to not be able to ensure with browsers
> that there is a clear destinction between the two names of the domain
> certified by the CA.

Neither browsers nor CAs have a database of all domain names, such that
they can see that one is visually confusable with another. Registries
have this data, and it is their responsibility to deal with this problem.

Gerv

Igor Bukanov

unread,

Apr 18, 2017, 6:04:13 AM4/18/17

to Martin Thomson, Martin Heaps, mozilla-de...@lists.mozilla.org

There is a difference between domains for non-latin scripts and an
xn-- domain that represents a name that can also be written using
plain latin letters. It is just too bad that registrars are allowed to
issues such certificates. So as a simple heuristic a browser can show
the domain as xn-- if it encodes a name that does not require xn--
encoding. Of cause that does not help when somebody uses
xn--eic-0ed.com which is eрic.com with the Cyrillic letter "р", but
that is a different and indeed hard to solve problems with homographs
in Unicode.

> _______________________________________________
> dev-security mailing list
> dev-se...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-security

Henri Sivonen

unread,

Apr 18, 2017, 6:29:03 AM4/18/17

to mozilla-de...@lists.mozilla.org

On Tue, Apr 18, 2017 at 12:29 PM, Gervase Markham <ge...@mozilla.org> wrote:
> Neither browsers nor CAs have a database of all domain names, such that
> they can see that one is visually confusable with another. Registries
> have this data, and it is their responsibility to deal with this problem.

Indeed, registries could normalize names first to remove diacritics
and then to map confusables to a canonical exemplar of each confusion
group (e.g. map Cyrillic о to Latin o or the other way round) and then
require that names that become the same under such confusability
normalization belong to the same registrant.

Sadly, it seems that .com in particular doesn't care and it seems that
ICANN doesn't care to require this kind of thing. Solving this would
require putting security ahead of greed at the ICANN level, and the
financial incentives (ability to make money by allowing confusing
names to be sold) go against security.

(I think ICANN's failure to solve this doesn't mean that CA policy
should try to solve this.)

--
Henri Sivonen
hsiv...@hsivonen.fi
https://hsivonen.fi/

Daniel Veditz

unread,

Apr 18, 2017, 11:27:53 AM4/18/17

to Igor Bukanov, Martin Heaps, Martin Thomson, mozilla-de...@lists.mozilla.org

On Tue, Apr 18, 2017 at 3:03 AM, Igor Bukanov <ig...@mir2.org> wrote:

> So as a simple heuristic a browser can show
>
> the domain as xn-- if it

encodes a name that does not require xn--encoding.

Are there are no legitimate Russian words made only of the 11 or so
letters that look like latin script? Should we tell Russians (and
Ukrainians, Bulgarians, Kazakhs, etc) to get the hell off our American
Internet? Should the browser ship with dictionaries of words in those
languages (or a subset using only the confusable characters) and check that
a cyrillic domain name maps to a legit word? What about non-word brand
names? Or the other way checking against English: we could map 'epic' and
'apple' to latin script and find those in an English dictionary. But what
about 'plaece.com', a new hypothetical social brand? Or what if it's a
cyrillic German or French word?

> Of cause that does not help when somebody uses
> xn--eic-0ed.com which is eрic.com with the Cyrillic letter "р", but
> that is a different and indeed hard to solve problems with homographs
> in Unicode.
>

That one is the easy case: mixing scripts is not allowed. It's extremely
unlikely to be legitimate (the toys-Я-us brand notwithstanding) and that
domain will only be shown in the punycode form.

-
Dan Veditz

Boris Zbarsky

unread,

Apr 18, 2017, 11:55:05 AM4/18/17

to mozilla-de...@lists.mozilla.org

On 4/18/17 6:03 AM, Igor Bukanov wrote:
> There is a difference between domains for non-latin scripts and an
> xn-- domain that represents a name that can also be written using
> plain latin letters. It is just too bad that registrars are allowed to
> issues such certificates. So as a simple heuristic a browser can show
> the domain as xn-- if it encodes a name that does not require xn--
> encoding.

Note that this would not have helped with the "epic.com" case,because
the actual Unicode characters involved were not ASCII and hence did need
the xn-- encoding.

-Boris

Eli the Bearded

unread,

Apr 18, 2017, 3:58:30 PM4/18/17

to mozilla-de...@lists.mozilla.org

In mozilla.dev.security, Daniel Veditz <dve...@mozilla.com> wrote:
> > So as a simple heuristic a browser can show
> > the domain as xn-- if it
> encodes a name that does not require xn--encoding.
>
> Are there are no legitimate Russian words made only of the 11 or so
> letters that look like latin script?

The original example was one.

U+0435 е CYRILLIC SMALL LETTER IE
U+0440 р CYRILLIC SMALL LETTER ER
U+0456 і CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
U+0441 с CYRILLIC SMALL LETTER ES

Which is a given name. It's better than the old ѕсоре example, which
doesn't seem to have any Russian look-a-likes.

Elijah
------
has seen ѕсоре.com next to рaypal.com in IDN warning documents

Kai Engert

unread,

Apr 18, 2017, 4:13:29 PM4/18/17

to mozilla-de...@lists.mozilla.org

Trying to come up with some ideas, without knowing the history of ideas already
discussed.

Could the browser UI always display the xn-- encoding in addition, if it isn't
plain ASCII, like:
URL bar showing: https:// www. еріс .com (xn--e1awd7f.com)

If there's worry that the hostname could be very long, longer than the user's
display, which results in the trailing (xn--) not being inside the visible
display, maybe the URL bar display could alternate between showing the nicely
rendered hostname, and the xn-- name, for 2 seconds each?

Alternatively, could the URL bar show the name of the identified script?

https:// www. еріс .com (cyrillic)

Maybe "cyrillic" and the subset of the letters in that script could be displayed
using highlighting, like a different color (same color for the word cyrillic and
the cyrillic letters), or underlining?

https:// www. epic .com (cyrillic)
---- --------

How about indicating the script as part of the protocl? Absent means
latin/ascii.

https-cyrillic:// www. epic .com

Kai

Kai Engert

unread,

Apr 18, 2017, 4:27:52 PM4/18/17

to mozilla-de...@lists.mozilla.org

How the ideas could look like with URLs that include a path:

On Tue, 2017-04-18 at 22:13 +0200, Kai Engert wrote:
> Trying to come up with some ideas, without knowing the history of ideas
> already
> discussed.
>
> Could the browser UI always display the xn-- encoding in addition, if it isn't
> plain ASCII, like:
> URL bar showing: https:// www. еріс .com (xn--e1awd7f.com)

https:// www. еріс .com (xn--e1awd7f.com) /more/of/url.html

> If there's worry that the hostname could be very long, longer than the user's
> display, which results in the trailing (xn--) not being inside the visible
> display, maybe the URL bar display could alternate between showing the nicely
> rendered hostname, and the xn-- name, for 2 seconds each?
>
> Alternatively, could the URL bar show the name of the identified script?
>
> https:// www. еріс .com (cyrillic)
> Maybe "cyrillic" and the subset of the letters in that script could be
> displayed
> using highlighting, like a different color (same color for the word cyrillic
> and
> the cyrillic letters), or underlining?
>
> https:// www. epic .com (cyrillic)
> ---- --------

https:// www. epic .com (cyrillic) /more/of/url.html

---- --------

> How about indicating the script as part of the protocl? Absent means
> latin/ascii.
>
> https-cyrillic:// www. epic .com

https-cyrillic:// www. epic .com /more/of/url.html

Kai Engert

unread,

Apr 18, 2017, 4:40:07 PM4/18/17

to mozilla-de...@lists.mozilla.org

Another idea, inspired by the UI for sites having an extended validation (EV)
certificate:

If the URL isn't plain ASCII, but a better rendering is available, then display
the better rendering to the left of the URL bar, but display the URL itself in
plain xn-- style.

For an extended validation site that used non-ascii characters, the URL bar
could look like this:

[lock] [company name(country)] [cyrillic: epic.com] https://xn--e1awd7f.com/

non-EV:

[lock] [cyrillic: epic.com] https://xn--e1awd7f.com/

Highlighting could be used to make it likely that [cyrillic epic.com] is
properly noticed.

Maybe the extra term "cyrillic" isn't necessary, when this approach is used.

Given that the location to the left of the URL is already being used to display
browser determined/approved information, this could be seen as the appropriate
place to display pretty hostname renderings.

Given that the https:// location is a "technical" string, and given that the
underlying standards allow ascii characters here, only, maybe this could be an
acceptable compromise?

Kai

Kyle Hamilton

unread,

Apr 18, 2017, 4:52:57 PM4/18/17

to Kai Engert, mozilla-de...@lists.mozilla.org

Why not display the encoding of the host name in the area with the
security indicator, to the left of the address bar? That is the area
where security-related displays are already made.

-Kyle H

On Tue, Apr 18, 2017 at 1:13 PM, Kai Engert <ka...@kuix.de> wrote:
> Trying to come up with some ideas, without knowing the history of ideas already
> discussed.
>
> Could the browser UI always display the xn-- encoding in addition, if it isn't
> plain ASCII, like:
> URL bar showing: https:// www. еріс .com (xn--e1awd7f.com)
>

> If there's worry that the hostname could be very long, longer than the user's
> display, which results in the trailing (xn--) not being inside the visible
> display, maybe the URL bar display could alternate between showing the nicely
> rendered hostname, and the xn-- name, for 2 seconds each?
>
> Alternatively, could the URL bar show the name of the identified script?
>
> https:// www. еріс .com (cyrillic)
>
> Maybe "cyrillic" and the subset of the letters in that script could be displayed
> using highlighting, like a different color (same color for the word cyrillic and
> the cyrillic letters), or underlining?
>
> https:// www. epic .com (cyrillic)
> ---- --------
>

> How about indicating the script as part of the protocl? Absent means
> latin/ascii.
>
> https-cyrillic:// www. epic .com
>

> Kai

Boris Zbarsky

unread,

Apr 18, 2017, 5:01:02 PM4/18/17

to mozilla-de...@lists.mozilla.org

On 4/18/17 4:52 PM, Kyle Hamilton wrote:
> Why not display the encoding of the host name in the area with the
> security indicator, to the left of the address bar? That is the area
> where security-related displays are already made.

Here's a question that I think any proposal here should be able to
address: Given two strings of glyphs, both non-ASCII, which are
homographs of each other, does the proposed solution help?

I think showing two different blobs of "xn--****" noise does not help
distinguish cases like that, unless someone is doing a careful
side-by-side comparison.

-Boris

Kai Engert

unread,

Apr 18, 2017, 5:33:24 PM4/18/17

to dev-se...@lists.mozilla.org, Boris Zbarsky, mozilla-de...@lists.mozilla.org

On 18 April 2017 23:00:24 GMT+02:00, Boris Zbarsky <bzba...@mit.edu> wrote:
>Here's a question that I think any proposal here should be able to
>address: Given two strings of glyphs, both non-ASCII, which are
>homographs of each other, does the proposed solution help?
>
>I think showing two different blobs of "xn--****" noise does not help
>distinguish cases like that, unless someone is doing a careful
>side-by-side comparison.

IIUC, Dan said, mixed scripts are always shown as xn--, never rendered.

I think that means, if thete are two different domains, that are rendered similarly, must originate from two different scripts.

If correct, then displaying the name of the script next to the rendering (cyrillic: epic.com) could be sufficient to allow them to be distinguished, without requiring comparison of the xn-- strings?

Kai

Kai Engert

unread,

Apr 18, 2017, 5:33:24 PM4/18/17

to dev-se...@lists.mozilla.org, Boris Zbarsky, mozilla-de...@lists.mozilla.org

Boris Zbarsky

unread,

Apr 18, 2017, 5:36:24 PM4/18/17

to mozilla-de...@lists.mozilla.org

On 4/18/17 5:32 PM, Kai Engert wrote:
> IIUC, Dan said, mixed scripts are always shown as xn--, never rendered.

Who said anything about mixed scripts?

> I think that means, if thete are two different domains, that are rendered similarly, must originate from two different scripts.

Yes, correct. Like the cyrillic/latin case here.

> If correct, then displaying the name of the script next to the rendering (cyrillic: epic.com) could be sufficient to allow them to be distinguished

To be clear, that wasn't the proposal I was replying to.

Will be display "Latin:" next to every non-punycode domain, so we're not
making non-English-speakers second-class citizens?

-Boris

Kai Engert

unread,

Apr 18, 2017, 5:41:46 PM4/18/17

to dev-se...@lists.mozilla.org, Boris Zbarsky, mozilla-de...@lists.mozilla.org

On 18 April 2017 23:32:44 GMT+02:00, Kai Engert <ka...@kuix.de> wrote:
>
>
>On 18 April 2017 23:00:24 GMT+02:00, Boris Zbarsky <bzba...@mit.edu>
>wrote:
>>Here's a question that I think any proposal here should be able to
>>address: Given two strings of glyphs, both non-ASCII, which are
>>homographs of each other, does the proposed solution help?
>>
>>I think showing two different blobs of "xn--****" noise does not help
>>distinguish cases like that, unless someone is doing a careful
>>side-by-side comparison.
>

>IIUC, Dan said, mixed scripts are always shown as xn--, never rendered.
>

>I think that means, if thete are two different domains, that are
>rendered similarly, must originate from two different scripts.
>

>If correct, then displaying the name of the script next to the
>rendering (cyrillic: epic.com) could be sufficient to allow them to be

>distinguished, without requiring comparison of the xn-- strings?
>
>Kai

But it would require that users notice that the shown script name isn't the expected one.

Could the browser use the configured default language, to know the expected usual script, and use special hightlighting (looking like a warning) whenever the domain uses a non-matching script?

A user having configured russian as their language would see the term cyrillic in neutral display. Users not using a cyrillic language would see the term cyrillic with a visually emphasized highlighting.

Kai

Kai Engert

unread,

Apr 18, 2017, 5:41:46 PM4/18/17

to dev-se...@lists.mozilla.org, Boris Zbarsky, mozilla-de...@lists.mozilla.org

Kai Engert

unread,

Apr 18, 2017, 5:55:13 PM4/18/17

to dev-se...@lists.mozilla.org, Boris Zbarsky, mozilla-de...@lists.mozilla.org

On 18 April 2017 23:35:46 GMT+02:00, Boris Zbarsky <bzba...@mit.edu> wrote:
>Will be display "Latin:" next to every non-punycode domain, so we're
>not
>making non-English-speakers second-class citizens?

Maybe it's sufficient to show Latin: for users having configured a default language that isn't based on latin characters.

Kai

Kai Engert

unread,

Apr 18, 2017, 5:55:14 PM4/18/17

to dev-se...@lists.mozilla.org, Boris Zbarsky, mozilla-de...@lists.mozilla.org

Kai Engert

unread,

Apr 18, 2017, 6:01:39 PM4/18/17

to dev-se...@lists.mozilla.org, Boris Zbarsky, mozilla-de...@lists.mozilla.org

On 18 April 2017 23:35:46 GMT+02:00, Boris Zbarsky <bzba...@mit.edu> wrote:
>On 4/18/17 5:32 PM, Kai Engert wrote:

>> IIUC, Dan said, mixed scripts are always shown as xn--, never
>rendered.
>

>Who said anything about mixed scripts?

Dan Veditz did.

I repeated his statement to support my conclusion.

Kai

Kai Engert

unread,

Apr 18, 2017, 6:01:39 PM4/18/17

to dev-se...@lists.mozilla.org, Boris Zbarsky, mozilla-de...@lists.mozilla.org

On 18 April 2017 23:35:46 GMT+02:00, Boris Zbarsky <bzba...@mit.edu> wrote:
>On 4/18/17 5:32 PM, Kai Engert wrote:

>> IIUC, Dan said, mixed scripts are always shown as xn--, never
>rendered.
>

L. David Baron

unread,

Apr 18, 2017, 6:58:31 PM4/18/17

to Gervase Markham, mozilla-de...@lists.mozilla.org

On Tuesday 2017-04-18 10:29 +0100, Gervase Markham wrote:
> Neither browsers nor CAs have a database of all domain names, such that
> they can see that one is visually confusable with another. Registries
> have this data, and it is their responsibility to deal with this problem.

So we used to have a whitelist of registries that had sensible
policies for dealing with this, but we stopped using it in
https://bugzilla.mozilla.org/show_bug.cgi?id=843689 .

Should we enable the whitelist approach again?

(One of the big issues with it was that some of the most prominent
domains, like .com, had policies that we saw as unacceptable.)

-David

--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla https://www.mozilla.org/ 𝄂
Before I built a wall I'd ask to know
What I was walling in or walling out,
And to whom I was like to give offense.
- Robert Frost, Mending Wall (1914)

signature.asc

Kyle Hamilton

unread,

Apr 18, 2017, 9:13:43 PM4/18/17

to L. David Baron, Gervase Markham, mozilla-de...@lists.mozilla.org

How did the algorithm in
https://bugzilla.mozilla.org/show_bug.cgi?id=722299 (which points to
https://wiki.mozilla.org/IDN_Display_Algorithm#Algorithm ) fail to
help in this instance?

Are there other instances in which it could be expected to fail?

If there are, the hypothesis set forth in
https://bugzilla.mozilla.org/show_bug.cgi?id=843689 (that the new IDN
display algorithm was sufficient enough to prevent IDN weirdnesses
that the whitelist could be removed) is shown to be false, and Mozilla
either needs to either find a better solution, or go back to the
whitelist.

-Kyle H

Boris Zbarsky

unread,

Apr 18, 2017, 9:26:12 PM4/18/17

to mozilla-de...@lists.mozilla.org

On 4/18/17 9:13 PM, Kyle Hamilton wrote:
> How did the algorithm in
> https://bugzilla.mozilla.org/show_bug.cgi?id=722299 (which points to
> https://wiki.mozilla.org/IDN_Display_Algorithm#Algorithm ) fail to
> help in this instance?

All the characters are from a single script.

-Boris

Igor Bukanov

unread,

Apr 19, 2017, 3:39:08 AM4/19/17

to Daniel Veditz, Martin Heaps, Martin Thomson, mozilla-de...@lists.mozilla.org

On 18 April 2017 at 17:27, Daniel Veditz <dve...@mozilla.com> wrote:
> Are there are no legitimate Russian words made only of the 11 or so letters
> that look like latin script?

If mixed scripts are not allowed, then the browser should show the
type of language of the script, perhaps using native abbreviations,
like Lat for Latin, Кир for Cyrillic etc. That should be sufficient to
cover this case.

Anne van Kesteren

unread,

Apr 19, 2017, 3:43:12 AM4/19/17

to Igor Bukanov, Martin Thomson, Martin Heaps, mozilla-de...@lists.mozilla.org, Daniel Veditz

On Wed, Apr 19, 2017 at 9:38 AM, Igor Bukanov <ig...@mir2.org> wrote:
> If mixed scripts are not allowed, then the browser should show the
> type of language of the script, perhaps using native abbreviations,
> like Lat for Latin, Кир for Cyrillic etc. That should be sufficient to
> cover this case.

No, we should strive to offer less and simpler UI to users, not more,
and certainly not something as complicated as that.

--
https://annevankesteren.nl/

Igor Bukanov

unread,

Apr 19, 2017, 3:48:04 AM4/19/17

to Kai Engert, Boris Zbarsky, dev-se...@lists.mozilla.org, mozilla-de...@lists.mozilla.org

On 18 April 2017 at 23:41, Kai Engert <ka...@kuix.de> wrote:
> Could the browser use the configured default language, to know the expected usual script, and use special hightlighting (looking like a warning) whenever the domain uses a non-matching script?

The default language does not work for countries using Cyrillic
script. The vast majority of domains there are in Latin. That makes
fishing attacks more effective as domains are not expected to be typed
at all. They either come from search engines or links in email or
social media.

Igor Bukanov

unread,

Apr 19, 2017, 3:48:05 AM4/19/17

to Kai Engert, Boris Zbarsky, dev-se...@lists.mozilla.org, mozilla-de...@lists.mozilla.org

Gervase Markham

unread,

Apr 19, 2017, 7:37:53 AM4/19/17

to L. David Baron

On 18/04/17 14:35, L. David Baron wrote:
> On Tuesday 2017-04-18 10:29 +0100, Gervase Markham wrote:
>> Neither browsers nor CAs have a database of all domain names, such that
>> they can see that one is visually confusable with another. Registries
>> have this data, and it is their responsibility to deal with this problem.
>
> So we used to have a whitelist of registries that had sensible
> policies for dealing with this, but we stopped using it in
> https://bugzilla.mozilla.org/show_bug.cgi?id=843689 .
>
> Should we enable the whitelist approach again?

The reason we stopped using a whitelist is that it didn't scale when the
gTLD explosion happened. This is still the case - no-one has the time or
energy to keep tracking of 1000+ anti-spoofing policies.

We could perhaps have a blacklist, but see below.

> (One of the big issues with it was that some of the most prominent
> domains, like .com, had policies that we saw as unacceptable.)

The new mechanism has the advantage of allowing IDN domains in .com,
which users both want and use. If we returned to a white or blacklist,
would we whitelist .com? If so, we are no further forward. If not, we
break all those domains.

The newly-authored https://wiki.mozilla.org/IDN_Display_Algorithm_FAQ
sets out the position in what is hopefully a clear fashion.

Gerv

Gervase Markham

unread,

Apr 19, 2017, 7:41:39 AM4/19/17

to Kyle Hamilton, L. David Baron

On 19/04/17 02:13, Kyle Hamilton wrote:
> How did the algorithm in
> https://bugzilla.mozilla.org/show_bug.cgi?id=722299 (which points to
> https://wiki.mozilla.org/IDN_Display_Algorithm#Algorithm ) fail to
> help in this instance?

Because it is a known issue that it does not deal with whole-script
confusables. This was documented at the time we adopted it - see:
https://wiki.mozilla.org/IDN_Display_Algorithm#Downsides

> Are there other instances in which it could be expected to fail?

No.

> If there are, the hypothesis set forth in
> https://bugzilla.mozilla.org/show_bug.cgi?id=843689 (that the new IDN
> display algorithm was sufficient enough to prevent IDN weirdnesses
> that the whitelist could be removed) is shown to be false, and Mozilla
> either needs to either find a better solution, or go back to the
> whitelist.

That was not the hypothesis. As noted above, this edge case was a known
and accepted part of the solution, because all of the alternatives are
worse.

The argument is that the browser only has sufficient knowledge to solve
a part of this problem; we can't solve the entire thing using an
algorithm without privileging some scripts over others, which is not an
appropriate action for an organization which believes in a truly World
Wide web. Fixing whole-script spoofing is the responsibility of those
who have databases of all the existing registrations - i.e. registries.

See https://wiki.mozilla.org/IDN_Display_Algorithm_FAQ for more details.

Gerv

Martin Heaps

unread,

Apr 19, 2017, 9:45:33 AM4/19/17

to mozilla-de...@lists.mozilla.org

I have been away for a few days, hence I would have added this below clarifier earlier:

My issue is NOT with the character sets used or the ambiguity of these character sets, but is with the Browsers complete lack of ability in telling the human user that epic.com !== epic.com at any point short of opening up and readng the core TLS certificate itself.

Reading the certificate is relatively simple (albeit 3-4 clicks) for Firefox but it's a hidden away aspect on Google Chrome, where the user needs to know where to find the certificate to reach it, rather than just exploreing and clicking suitable looking buttons (As seems to be with firefox Browser) .

Some examples using the epic.com domain name; showing that ALL views of the security of the website short of viewing the full cerificate output the "output" name rather than the "raw" name of the website.

THIS is the issue I am taking, and feel that should be fixed by browsers, by certificate providers and all parties in between.

I have NO issue with the character set of the certificate or the character set of the URL, this does not need to use the data held by registrars but simply a patch on the browsers to note that:

- A certificate for "xn--e1awd7f.com" can be interpreted as "epic.com"

Some screen shots of the core issue, please review:

https://www.imageupload.co.uk/images/2017/04/19/chrome-epic-dot-com.png

https://www.imageupload.co.uk/images/2017/04/19/chrome-epic-dot-com2.png

https://www.imageupload.co.uk/images/2017/04/19/Firefox-epic-dot-com.png

https://www.imageupload.co.uk/images/2017/04/19/Firefox-epic-dot-com3.png

https://www.imageupload.co.uk/images/2017/04/19/Firefox-epic-dot-com2.png

Thanks

Martin Heaps

unread,

Apr 19, 2017, 9:56:20 AM4/19/17

to mozilla-de...@lists.mozilla.org

A possible solution is one that I envisage has the following practical effects:

That if the name can appear valid in another character set (as xn--e1awd7f.com can be epic.com, and russian characters can be used for latin characters, etc.) that the specific name the certificate is for is detailed on the Browser as exampled in my edited screen shots below.

Please review this possible approach in clarifying the exact nature of which URL is being certified by a TLS certificate.

https://www.imageupload.co.uk/images/2017/04/19/certificated_example2.png

https://www.imageupload.co.uk/images/2017/04/19/certificated_example.png

Cheers

Craig Francis

unread,

Apr 19, 2017, 1:28:21 PM4/19/17

to Igor Bukanov, Boris Zbarsky, dev-se...@lists.mozilla.org, Kai Engert, mozilla-de...@lists.mozilla.org

For those who use Latin characters "most of the time" (US, UK, etc), then why not apply a highlight to any non-Latin characters? i.e. characters you would not expect to see normally.

As per the screenshot attached, or if attachments get removed, at this URL:

https://www.krang.org.uk/misc/unicode-domain.jpg

Notes:

- I only highlighted the first character, in this case it should have been the whole word.

- I am a little unsure about this approach from an accessibility point of view (which might not be as much of an issue for screen readers, e.g. VoiceOver says something like "yeris dot com").

- This does not consider that Chinese and Spanish are the most spoken languages.

Craig Francis

unread,

Apr 19, 2017, 1:28:23 PM4/19/17

to Igor Bukanov, Boris Zbarsky, dev-se...@lists.mozilla.org, Kai Engert, mozilla-de...@lists.mozilla.org

For those who use Latin characters "most of the time" (US, UK, etc), then why not apply a highlight to any non-Latin characters? i.e. characters you would not expect to see normally.

As per the screenshot attached, or if attachments get removed, at this URL:

https://www.krang.org.uk/misc/unicode-domain.jpg

Notes:

- I only highlighted the first character, in this case it should have been the whole word.

- I am a little unsure about this approach from an accessibility point of view (which might not be as much of an issue for screen readers, e.g. VoiceOver says something like "yeris dot com").

- This does not consider that Chinese and Spanish are the most spoken languages.

> On 19 Apr 2017, at 08:47, Igor Bukanov <ig...@mir2.org> wrote:
>
> On 18 April 2017 at 23:41, Kai Engert <ka...@kuix.de> wrote:
>> Could the browser use the configured default language, to know the expected usual script, and use special hightlighting (looking like a warning) whenever the domain uses a non-matching script?
>
> The default language does not work for countries using Cyrillic
> script. The vast majority of domains there are in Latin. That makes
> fishing attacks more effective as domains are not expected to be typed
> at all. They either come from search engines or links in email or
> social media.

Justin Dolske

unread,

Apr 19, 2017, 8:40:51 PM4/19/17

to mozilla-de...@lists.mozilla.org

On 4/18/17 2:29 AM, Gervase Markham wrote:

>> The word "fake" is probably an inpracise word to use, but in the
>> context that the domain name as registered is perporting to be
>> another domain name it is not; this is fake. The SSL provider;
>> LetsEncrypt in this case seems to not be able to ensure with browsers
>> that there is a clear destinction between the two names of the domain
>> certified by the CA.

>
> Neither browsers nor CAs have a database of all domain names, such that
> they can see that one is visually confusable with another. Registries
> have this data, and it is their responsibility to deal with this problem.

[As much as I hate to wade into this...]

Hmm. One thing browsers do have is the user's browsing history.

Half-baked thought for an imperfect mitigation:

When visiting a page, compute the normalized version domain, and see if
that exists as a history entry. If the entry exists, display the
punycode version of the domain. Otherwise, display the unicode version
of the domain.

That makes it more difficult to trick an existing user of a site. If
I've previously visited epic.com (ascii), visiting xn--e1awd7f.com will
show xn--e1awd7f.com instead of еріс.com (cyrillic). But does nothing
for attacks against domains a user might recognize but hasn't visited.

To handle bz's case of two non-ascii domains that are homographs of each
other, I think you'd need to store the normalized domain in history too?

Bah, but visiting one such homograph would then cause both to display as
punycode (as both history entries exist). So the history check would
need to be a little more complex, to see which is the oldest site in the
user's history. (If the oldest site is the fake site, the user is
screwed for a while.)

So, I dunno. I'm sure there are other issues too. But maybe some kind of
imperfect mitigation (perhaps not this one) is better than nothing?

Justin

Boris Zbarsky

unread,

Apr 19, 2017, 9:57:31 PM4/19/17

to mozilla-de...@lists.mozilla.org

On 4/19/17 8:40 PM, Justin Dolske wrote:
> When visiting a page, compute the normalized version domain

Normalized in what sense?

-Boris

Gervase Markham

unread,

Apr 20, 2017, 5:55:28 AM4/20/17

to mozilla-de...@lists.mozilla.org

Presumably, in a "fold it to a canonical form using (the subset used in
IDN of) ftp://ftp.unicode.org/Public/security/latest/confusables.txt" sense.

Gerv

Eli the Bearded

unread,

Apr 20, 2017, 6:55:16 PM4/20/17

to mozilla-de...@lists.mozilla.org

In mozilla.dev.security, Justin Dolske <dol...@mozilla.com> wrote:
> [As much as I hate to wade into this...]
>
> Hmm. One thing browsers do have is the user's browsing history.

Objection. Configuration to not record history is trivial, and even
if not configured such, some confusables could easily be sites that
the user doesn't visit often enough to have in history.

> Half-baked thought for an imperfect mitigation:
>
> When visiting a page, compute the normalized version domain, and see if
> that exists as a history entry. If the entry exists, display the
> punycode version of the domain. Otherwise, display the unicode version
> of the domain.

More baked: Using the confusables list from Unicode, if a domain label
consists entirely of letters in one script that are "confusable" to
another (single) script, start raising red flags.

Probably special case things that can be confused with a FULL STOP for
attacks that attempt to just confuse part of the DNS name.

Elijah
------
has not checked to see what can be can be confused with a FULL STOP

Justin Dolske

unread,

Apr 20, 2017, 8:00:23 PM4/20/17

to mozilla-de...@lists.mozilla.org

Correct. To be more concrete, with the simple case of apple.com (real)
vs аррӏе.com (spoof): both "normalize" to apple.com (plain ascii). If
that's in the user's history, the spoofed domain would be displayed as
xn--80ak6aa92e.com.

Justin

Justin Dolske

unread,

Apr 20, 2017, 8:38:04 PM4/20/17

to mozilla-de...@lists.mozilla.org

On 4/20/17 3:54 PM, Eli the Bearded wrote:
> In mozilla.dev.security, Justin Dolske <dol...@mozilla.com> wrote:
>> [As much as I hate to wade into this...]
>>
>> Hmm. One thing browsers do have is the user's browsing history.
>
> Objection. Configuration to not record history is trivial, and even
> if not configured such, some confusables could easily be sites that
> the user doesn't visit often enough to have in history.

Yep.

I don't think "browsing history disabled" is necessarily common enough
to worry about (for an imperfect mitigation), but in any case
not-yet-visited is certainly an issue. Hence, again, "imperfect
mitigation". :-)

> More baked: Using the confusables list from Unicode, if a domain label
> consists entirely of letters in one script that are "confusable" to
> another (single) script, start raising red flags.

Sure, but the angle I found interesting here was to make a guess as to
which one is the legitimate site for the user, based solely on local
user data and avoiding favoring a particular script.

Justin

Kyle Hamilton

unread,

Apr 20, 2017, 9:36:32 PM4/20/17

to Gervase Markham, L. David Baron, mozilla-de...@lists.mozilla.org

Perhaps, only display non-punycode from codepoint sets used in
languages already installed on the computer?

i.e., if the Russian language is installed on the computer, it might
be a strong indicator that Cyrillic codepoints should be shown as
Cyrillic. Otherwise, it's someone who probably can't even read it,
and so the commitment to displaying non-punycode probably can only be
damaging.

-Kyle H

On Wed, Apr 19, 2017 at 4:40 AM, Gervase Markham <ge...@mozilla.org> wrote:
> On 19/04/17 02:13, Kyle Hamilton wrote:

>> How did the algorithm in
>> https://bugzilla.mozilla.org/show_bug.cgi?id=722299 (which points to
>> https://wiki.mozilla.org/IDN_Display_Algorithm#Algorithm ) fail to
>> help in this instance?
>

> Because it is a known issue that it does not deal with whole-script
> confusables. This was documented at the time we adopted it - see:
> https://wiki.mozilla.org/IDN_Display_Algorithm#Downsides
>

>> Are there other instances in which it could be expected to fail?
>

> No.

>
>> If there are, the hypothesis set forth in
>> https://bugzilla.mozilla.org/show_bug.cgi?id=843689 (that the new IDN
>> display algorithm was sufficient enough to prevent IDN weirdnesses
>> that the whitelist could be removed) is shown to be false, and Mozilla
>> either needs to either find a better solution, or go back to the
>> whitelist.
>

Eli the Bearded

unread,

Apr 20, 2017, 11:00:50 PM4/20/17

to mozilla-de...@lists.mozilla.org

In mozilla.dev.security, Justin Dolske <dol...@mozilla.com> wrote:
> On 4/20/17 3:54 PM, Eli the Bearded wrote:
>> Objection. Configuration to not record history is trivial, and even

> Yep.
> I don't think "browsing history disabled" is necessarily common enough

Common or not, it's broken to ignore that case.

>> More baked: Using the confusables list from Unicode, if a domain label
>> consists entirely of letters in one script that are "confusable" to
>> another (single) script, start raising red flags.
> Sure, but the angle I found interesting here was to make a guess as to
> which one is the legitimate site for the user, based solely on local
> user data and avoiding favoring a particular script.

I'm not proposing favoring any particular script, just highlight to the
user that an IDN is composed entirely of confusables to a single
different script. There may be false positives, particalarly on short
hostnames, but I suspect that will be unlikely in practice.

This site, https://www.xn--80ak6aa92e.com/, uses the Cyrillic
alphabet to create a URL that resembles the Latin alphabet
"www.apple.com". Do you wish to continue?

Elijah
------
bonus for defaulting to "No"

Boris Zbarsky

unread,

Apr 20, 2017, 11:55:46 PM4/20/17

to mozilla-de...@lists.mozilla.org

On 4/20/17 6:54 PM, Eli the Bearded wrote:
> More baked: Using the confusables list from Unicode, if a domain label
> consists entirely of letters in one script that are "confusable" to
> another (single) script, start raising red flags.

So just to be clear, per that proposal we should be raising red flags on
the "real" epic.com and keep.com and so forth, right?

-Boris

Gervase Markham

unread,

Apr 21, 2017, 6:26:50 AM4/21/17

to mozilla-de...@lists.mozilla.org

On 20/04/17 23:54, Eli the Bearded wrote:
> More baked: Using the confusables list from Unicode, if a domain label
> consists entirely of letters in one script that are "confusable" to
> another (single) script, start raising red flags.

So we raise a red flag on http://apple.com, the site for the largest
tech company in the world?

Or did you mean "if a domain label consists entirely of letters in one
non-Latin script..."? If so, we've just got back to a situation of
privileging one script over another.

Gerv

Gervase Markham

unread,

Apr 21, 2017, 6:28:41 AM4/21/17

to Kyle Hamilton, L. David Baron

On 21/04/17 02:36, Kyle Hamilton wrote:
> Perhaps, only display non-punycode from codepoint sets used in
> languages already installed on the computer?

Before we do another canter through the six different ideas that always
occur to people when first presented with this issue, the very unofficial
https://wiki.mozilla.org/Gerv%27s_IDN_Display_Algorithm_FAQ
might help shortcut the process. In this case, questions 7-9 are
relevant, as your proposal is a variant of the one those address.

Gerv

Igor Bukanov

unread,

Apr 21, 2017, 7:05:58 AM4/21/17

to Eli the Bearded, mozilla-de...@lists.mozilla.org

On 21 April 2017 at 00:54, Eli the Bearded <*@eli.users.panix.com> wrote:
> Objection. Configuration to not record history is trivial, and even

> if not configured such, some confusables could easily be sites that
> the user doesn't visit often enough to have in history.

Still the history is very valuable to dismiss it because it does not
work in all cases. At the very list in a typical case the history
shows what kind of scripts the user ever visited per top-level domain
allowing to flag unexpected script. For example, personally I will be
very suspicious about a Cyrillic.com domain as those are awkward to
type (one has to change the keyboard layout in the middle). If a site
needs Russian domain, I expect it to use .рф not .com

> More baked: Using the confusables list from Unicode, if a domain label
> consists entirely of letters in one script that are "confusable" to
> another (single) script, start raising red flags.

The problem is that at least for Russian/English language the words
consisting of only confusable characters are frequent enough to bring
way too many false positives to be usable.

Chaddaï Fouché

unread,

Apr 21, 2017, 1:25:36 PM4/21/17

to mozilla-de...@lists.mozilla.org

Le vendredi 21 avril 2017 12:28:41 UTC+2, Gervase Markham a écrit :
>
> Before we do another canter through the six different ideas that always
> occur to people when first presented with this issue, the very unofficial
> https://wiki.mozilla.org/Gerv%27s_IDN_Display_Algorithm_FAQ
> might help shortcut the process.

The initial proposition of Igor Bukanov (that every domain name be preceded by an icon indicating the writing system, yes, even the latin ones) seems completely neutral regarding non-latin languages and not very obtrusive.

I'm pretty sure most non-technical people don't even look at the url bar and don't care about the icons that already appear there : info, lock, read mode, zooming state, reload. Adding just one more probably won't suddenly overload them and it would allow every technically minded people to see instantly if the domain name is in the writing system they expect. Your IDN algorithm already compute this information anyway.

Your reaction amounting to "we give priority to our ideal of handling every language equally over security (of everyone, regardless of their language) because we consider 1) that it's the fault of the registrars (irrelevant from the user point of view, and unlikely to be fixed from that side) and 2) that our users are fragile little flowers that will be scared by any additional UI element (that's insulting by the way even if a cleaner UI is a worthy goal)" is giving me second thoughts about staying with Firefox after almost 15 years with Mozilla (since the 1.0 version)... Ideals are one of the reason I stayed with the Mozilla foundation so I'm not faulting you on that but on your priorities : security for your users should really be more important than avoiding *anything* that could offense their sensibility.

--
Chaddaï Fouché

Kyle Hamilton

unread,

Apr 21, 2017, 2:42:29 PM4/21/17

to Gervase Markham, L. David Baron, mozilla-de...@lists.mozilla.org

The idea may be a variant, but it's by no means the same as what has
been presented there.

In the instant case, the font used in the address bar uses the same
glyph shapes for both Latin and Cyrillic. Might it be appropriate to
use (and provide) a font that uses different glyphs for every
confusable code point, and then provide some kind of user training on
how if the shapes don't match what they're used to it might be
phishing? This would be demonstrably script-neutral.

The downside is that it would unduly burden users whose
shape-recognition is sub-par, but pretty much every other idea for
protecting the users has been shot down by Mozilla reps on this list.
I'm sorry, but this is not "somebody else's problem". The users use
your software, and you are the only ones they can hope to save them
from threats that others refuse to take responsibility for.

Mozilla has always claimed that it's focused on user security. If
you're enforcing the rule "if it works on one Firefox, it works on all
Firefoxes" (in the context of "IDN owners might not use IDN if IDN
doesn't work everywhere") to the detriment of user security and
increasing phishability, are you really focused on user security? Why
is IDN display a sacred cow, when it increases the risk for your users
to be scammed? IDN owners don't apparently provide mindshare to
Mozilla, nor contribute to the installed base.

Mozilla reps on this list have tried to push the problem off on
everyone else -- the registrars (of which a subset refuse to accept
the responsibility, and cannot be compelled to do so), and the users
who have no means to differentiate in the scripts (because IDN owners
can't abide uncertainty). For some reason, though, you're not trying
to push the problem off onto the IDN owners who, as web site owners,
already have to deal with the uncertainty of their users using
whatever browsers with whatever security policies baked in that they
choose to.

-Kyle H

On Fri, Apr 21, 2017 at 3:28 AM, Gervase Markham <ge...@mozilla.org> wrote:
> On 21/04/17 02:36, Kyle Hamilton wrote:

>> Perhaps, only display non-punycode from codepoint sets used in
>> languages already installed on the computer?
>

> Before we do another canter through the six different ideas that always
> occur to people when first presented with this issue, the very unofficial
> https://wiki.mozilla.org/Gerv%27s_IDN_Display_Algorithm_FAQ

Alex Gaynor

unread,

Apr 21, 2017, 2:59:18 PM4/21/17

to Kyle Hamilton, L. David Baron, Gervase Markham, mozilla-de...@lists.mozilla.org

This thread is very focused on what's the right algorithm for preventing
homoglyph attacks. But why do we care about homoglyphs? Because they make
phishing for credentials easier.

But do homoglyphs need unique handling vs, other phishing attacks? First,
it's clear that homoglyphs are not a necessary component of a phishing
attack, facebook-com.nonsense.biz is also an effective strategy for
phishers. Homoglyphs are particularly scary to us though, because no amount
of training or vigilance on the part of users will work, we must have a
machine solution.

Given that, what machine solutions do we have for _all_ phishing attacks.

- There's ongoing work to implement U2F, to provide phishing-proof
credentials
- We use Google's SafeBrowsing API to check for known phishing sites.
- Probably there's lots of other things we already do that I don't even
realize!

But I agree with those who've argued for "the buck stops here" -- if
there's more we can do, we must, because even if registrars prevent
homoglyph there's 1001 other effective phishing techniques that need to be
stopped.

- Are there clever clientside algorithms we can use to detect phishing
websites above what SafeBrowsing gives us?
- Should the UI surface "this is your first time visiting this website"?

There's probably other good ideas out there.

Cheers,
Alex

> >> Perhaps, only display non-punycode from codepoint sets used in
> >> languages already installed on the computer?
> >

> > Before we do another canter through the six different ideas that always
> > occur to people when first presented with this issue, the very unofficial
> > https://wiki.mozilla.org/Gerv%27s_IDN_Display_Algorithm_FAQ
> > might help shortcut the process. In this case, questions 7-9 are
> > relevant, as your proposal is a variant of the one those address.
> >
> > Gerv
> >

Eli the Bearded

unread,

Apr 21, 2017, 3:07:53 PM4/21/17

to mozilla-de...@lists.mozilla.org

The (alas unwritten) bit that is important in my proposal is that this
confusable check only happens for punycode DNS labels.

This example I posted elsewhere might be helpful to demonstate my idea:

This site, https://www.xn--80ak6aa92e.com/, uses the Cyrillic
alphabet to create a URL that resembles the Latin alphabet
"www.apple.com". Do you wish to continue?

Elijah
------

might not be good enough for "full baked" yet

Eli the Bearded

unread,

Apr 21, 2017, 3:31:40 PM4/21/17

to mozilla-de...@lists.mozilla.org

Hmmm. Checking for Cyrillic to Latin / Digit issues in confusables.txt,
there are more than I had suspected.

A644 ; 0032 ; MA # ( Ꙅ → 2 ) CYRILLIC CAPITAL LETTER REVERSED DZE → DIGIT TWO # →Ƨ→
0417 ; 0033 ; MA # ( З → 3 ) CYRILLIC CAPITAL LETTER ZE → DIGIT THREE #
04E0 ; 0033 ; MA # ( Ӡ → 3 ) CYRILLIC CAPITAL LETTER ABKHASIAN DZE → DIGIT THREE # →Ʒ→
0431 ; 0036 ; MA # ( б → 6 ) CYRILLIC SMALL LETTER BE → DIGIT SIX #

0430 ; 0061 ; MA # ( а → a ) CYRILLIC SMALL LETTER A → LATIN SMALL LETTER A #
0410 ; 0041 ; MA # ( А → A ) CYRILLIC CAPITAL LETTER A → LATIN CAPITAL LETTER A #
042C ; 0062 ; MA # ( Ь → b ) CYRILLIC CAPITAL LETTER SOFT SIGN → LATIN SMALL LETTER B # →Ƅ→
0412 ; 0042 ; MA # ( В → B ) CYRILLIC CAPITAL LETTER VE → LATIN CAPITAL LETTER B #
0441 ; 0063 ; MA # ( с → c ) CYRILLIC SMALL LETTER ES → LATIN SMALL LETTER C #
0421 ; 0043 ; MA # ( С → C ) CYRILLIC CAPITAL LETTER ES → LATIN CAPITAL LETTER C #
0501 ; 0064 ; MA # ( ԁ → d ) CYRILLIC SMALL LETTER KOMI DE → LATIN SMALL LETTER D #
0435 ; 0065 ; MA # ( е → e ) CYRILLIC SMALL LETTER IE → LATIN SMALL LETTER E #
0415 ; 0045 ; MA # ( Е → E ) CYRILLIC CAPITAL LETTER IE → LATIN CAPITAL LETTER E #
050C ; 0047 ; MA # ( Ԍ → G ) CYRILLIC CAPITAL LETTER KOMI SJE → LATIN CAPITAL LETTER G #
04BB ; 0068 ; MA # ( һ → h ) CYRILLIC SMALL LETTER SHHA → LATIN SMALL LETTER H #
041D ; 0048 ; MA # ( Н → H ) CYRILLIC CAPITAL LETTER EN → LATIN CAPITAL LETTER H #
0456 ; 0069 ; MA # ( і → i ) CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I → LATIN SMALL LETTER I #
0458 ; 006A ; MA # ( ј → j ) CYRILLIC SMALL LETTER JE → LATIN SMALL LETTER J #
0408 ; 004A ; MA # ( Ј → J ) CYRILLIC CAPITAL LETTER JE → LATIN CAPITAL LETTER J #
043A ; 006B ; MA # ( к → k ) CYRILLIC SMALL LETTER KA → LATIN SMALL LETTER K #
041A ; 004B ; MA # ( К → K ) CYRILLIC CAPITAL LETTER KA → LATIN CAPITAL LETTER K #
0406 ; 006C ; MA # ( І → l ) CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I → LATIN SMALL LETTER L #
041C ; 004D ; MA # ( М → M ) CYRILLIC CAPITAL LETTER EM → LATIN CAPITAL LETTER M #
043F ; 006E ; MA # ( п → n ) CYRILLIC SMALL LETTER PE → LATIN SMALL LETTER N #
043E ; 006F ; MA # ( о → o ) CYRILLIC SMALL LETTER O → LATIN SMALL LETTER O #
041E ; 004F ; MA # ( О → O ) CYRILLIC CAPITAL LETTER O → LATIN CAPITAL LETTER O #
0440 ; 0070 ; MA # ( р → p ) CYRILLIC SMALL LETTER ER → LATIN SMALL LETTER P #
0420 ; 0050 ; MA # ( Р → P ) CYRILLIC CAPITAL LETTER ER → LATIN CAPITAL LETTER P #
051B ; 0071 ; MA # ( ԛ → q ) CYRILLIC SMALL LETTER QA → LATIN SMALL LETTER Q #
0433 ; 0072 ; MA # ( г → r ) CYRILLIC SMALL LETTER GHE → LATIN SMALL LETTER R #
0455 ; 0073 ; MA # ( ѕ → s ) CYRILLIC SMALL LETTER DZE → LATIN SMALL LETTER S #
0405 ; 0053 ; MA # ( Ѕ → S ) CYRILLIC CAPITAL LETTER DZE → LATIN CAPITAL LETTER S #
0442 ; 0074 ; MA # ( т → t ) CYRILLIC SMALL LETTER TE → LATIN SMALL LETTER T # →τ→
0422 ; 0054 ; MA # ( Т → T ) CYRILLIC CAPITAL LETTER TE → LATIN CAPITAL LETTER T #
0446 ; 0075 ; MA # ( ц → u ) CYRILLIC SMALL LETTER TSE → LATIN SMALL LETTER U #
0475 ; 0076 ; MA # ( ѵ → v ) CYRILLIC SMALL LETTER IZHITSA → LATIN SMALL LETTER V #
0474 ; 0056 ; MA # ( Ѵ → V ) CYRILLIC CAPITAL LETTER IZHITSA → LATIN CAPITAL LETTER V #
051C ; 0057 ; MA # ( Ԝ → W ) CYRILLIC CAPITAL LETTER WE → LATIN CAPITAL LETTER W #
0445 ; 0078 ; MA # ( х → x ) CYRILLIC SMALL LETTER HA → LATIN SMALL LETTER X #
0425 ; 0058 ; MA # ( Х → X ) CYRILLIC CAPITAL LETTER HA → LATIN CAPITAL LETTER X #
0443 ; 0079 ; MA # ( у → y ) CYRILLIC SMALL LETTER U → LATIN SMALL LETTER Y #
04AF ; 0079 ; MA # ( ү → y ) CYRILLIC SMALL LETTER STRAIGHT U → LATIN SMALL LETTER Y # →γ→
04AE ; 0059 ; MA # ( Ү → Y ) CYRILLIC CAPITAL LETTER STRAIGHT U → LATIN CAPITAL LETTER Y #

I don't have a Russian wordlist to compare to any of the English
wordlists I do have (and that would be subpar because my English word
lists don't include internet names like "paypal" (CYRILLIC: ER A U ER A
BYELORUSSIAN-UKRAINIAN I)). But, yeah, I can see potential for there
being a lot of natural overlaps.

Elijah
------
possibly some in ARMENIAN and other widely used alphabets, too

Boris Zbarsky

unread,

Apr 21, 2017, 3:36:04 PM4/21/17

to mozilla-de...@lists.mozilla.org

On 4/21/17 3:07 PM, Eli the Bearded wrote:
> The (alas unwritten) bit that is important in my proposal is that this
> confusable check only happens for punycode DNS labels.

Right, which comes back to the "yeah, English is the only language
anyone should be speaking, and screw people who might be using other
languages" problem. I know that's not the intention, but that's the effect.

-Boris

Gervase Markham

unread,

Apr 24, 2017, 5:00:24 AM4/24/17

to Chaddaï Fouché

On 21/04/17 18:25, Chaddaï Fouché wrote:
> I'm pretty sure most non-technical people don't even look at the url
> bar and don't care about the icons that already appear there : info,
> lock, read mode, zooming state, reload.

So why do you think adding another icon will solve this problem?

> Adding just one more probably
> won't suddenly overload them and it would allow every technically
> minded people to see instantly if the domain name is in the writing
> system they expect. Your IDN algorithm already compute this
> information anyway.

So we are trying to solve the problem only for technically minded people?

> Your reaction amounting to "we give priority to our ideal of handling
> every language equally over security (of everyone, regardless of
> their language) because we consider 1) that it's the fault of the
> registrars (irrelevant from the user point of view, and unlikely to
> be fixed from that side) and 2) that our users are fragile little
> flowers that will be scared by any additional UI element (that's
> insulting by the way even if a cleaner UI is a worthy goal)"

Well, that's basically what you just said above :-)

Gerv

Gervase Markham

unread,

Apr 24, 2017, 5:04:49 AM4/24/17

to Kyle Hamilton, L. David Baron

On 21/04/17 19:42, Kyle Hamilton wrote:
> In the instant case, the font used in the address bar uses the same
> glyph shapes for both Latin and Cyrillic. Might it be appropriate to
> use (and provide) a font that uses different glyphs for every
> confusable code point, and then provide some kind of user training on
> how if the shapes don't match what they're used to it might be
> phishing? This would be demonstrably script-neutral.

Whose letters get distorted and whose letters get to stay the same?

If the differences are only small, the chances are people won't notice.
If they don't notice apple.com.example.com, then they won't notice this.

> The downside is that it would unduly burden users whose
> shape-recognition is sub-par, but pretty much every other idea for
> protecting the users has been shot down by Mozilla reps on this list.
> I'm sorry, but this is not "somebody else's problem". The users use
> your software, and you are the only ones they can hope to save them
> from threats that others refuse to take responsibility for.

Is it a bird? Is it a plane? No... it's a dinosaur!

> Mozilla has always claimed that it's focused on user security. If
> you're enforcing the rule "if it works on one Firefox, it works on all
> Firefoxes" (in the context of "IDN owners might not use IDN if IDN
> doesn't work everywhere") to the detriment of user security and
> increasing phishability, are you really focused on user security? Why
> is IDN display a sacred cow, when it increases the risk for your users
> to be scammed? IDN owners don't apparently provide mindshare to
> Mozilla, nor contribute to the installed base.

Are you properly assessing the level of the risk? Unlike mixed-script
systems, there is at most 1 and normally 0 Cyrillic whole-script
homographs of any domain. That means that now we've done this dance,
no-one can ever do this to Apple again. I would expect other major
domain owners who are paying attention to be going out there and
spending all of $7 on the Cyrillic homograph of their domain, if there
is one.

> Mozilla reps on this list have tried to push the problem off on
> everyone else -- the registrars (of which a subset refuse to accept
> the responsibility, and cannot be compelled to do so),

So it's OK for them to say it's not their responsibility but not OK for
us to say it's not our responsibility?

Or is what you mean that because we have an open process, there's more
chance of shouting at us until we do something than there is of shouting
a the registries until they do something?

Gerv

Gervase Markham

unread,

Apr 24, 2017, 5:06:57 AM4/24/17

to Alex Gaynor, Kyle Hamilton, L. David Baron

On 21/04/17 19:59, Alex Gaynor wrote:
> But I agree with those who've argued for "the buck stops here" -- if
> there's more we can do, we must

Can I just note in passing that, independent of what we do here, "if
there's more we can do, we must" is a terrible, terrible argument?

Everything's a trade-off. Time, money, complexity, risk. Taking one
particular problem and saying "this risk must be eliminated to the
uttermost, regardless of how much time, money and added complexity is
needed" is just not a reasonable position.

Gerv

L. David Baron

unread,

Apr 24, 2017, 6:53:42 AM4/24/17

to Igor Bukanov, mozilla-de...@lists.mozilla.org

On Friday 2017-04-21 13:05 +0200, Igor Bukanov wrote:
> On 21 April 2017 at 00:54, Eli the Bearded <*@eli.users.panix.com> wrote:

> > Objection. Configuration to not record history is trivial, and even
> > if not configured such, some confusables could easily be sites that
> > the user doesn't visit often enough to have in history.
>
> Still the history is very valuable to dismiss it because it does not
> work in all cases. At the very list in a typical case the history
> shows what kind of scripts the user ever visited per top-level domain
> allowing to flag unexpected script. For example, personally I will be
> very suspicious about a Cyrillic.com domain as those are awkward to
> type (one has to change the keyboard layout in the middle). If a site
> needs Russian domain, I expect it to use .рф not .com

This makes me wonder: could we become more suspicious (in terms of
UI indications) of sites where the script changes between different
parts of the hostname (or eTLD+1), i.e., move towards expecting that
non-Latin domain names will be using a non-Latin TLD?

-David

--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla https://www.mozilla.org/ 𝄂
Before I built a wall I'd ask to know
What I was walling in or walling out,
And to whom I was like to give offense.
- Robert Frost, Mending Wall (1914)

signature.asc

Robert Kaiser

unread,

Apr 24, 2017, 7:41:08 AM4/24/17

to mozilla-de...@lists.mozilla.org

Gervase Markham schrieb:

> Everything's a trade-off. Time, money, complexity, risk. Taking one
> particular problem and saying "this risk must be eliminated to the
> uttermost, regardless of how much time, money and added complexity is
> needed" is just not a reasonable position.

While that's true, right now, our position has the risk of the
completely wrong point that Mozilla doesn't care if phishing happens to
our users or by extension about their security. Now, we all know that
this is both extremely far from the truth - but esp. if other browsers
"do something" (no matter how useful that "something" is) and we "do
nothing" and "play the blame game" by saying it's someone else's fault
(Douglas Adams fans would call it a "SEP field") then it's easy for
outsiders to get that wrong improession.

It's also a truism that being right is not always enough to make the
right things. I think we need to do some kind of "mitigation" in the
light of not looking worse than our competitors (which we do often
enough unfortunately).

I think, in hindsight, it was a bad idea to abandon the whitelisting
approach - even though it looked perfectly fine back then. We probably
would have needed to a model somewhat similar to how we run the root CA
list: After seeing the whitelists with what we had from our own research
before, only add new TLDs that request that from their side and prove
that they follow certain rules (esp. anti-homograph-attack ones). Move
as much of the burden to those that actually make money selling IDN
domains. The maintenenace of the list and setting of rules could
potentially have been even in some common form between all or some of
Mozilla, Google, Microsoft, and Apple. The alternative would have been
to not allow IDN for new TLDs at all unless ICANN enforces rules of that
kind (given that ICANN makes money with granting TLDs, that also would
fit the "put the burden where the financial incentive is) - and that
could potentially have been in accordance with other browser vendors as
well.

Maybe we can now have some influence on ICANN so they actually set up
anti-homograph-attack rules for TLDs, are we in talks with them on that?

That said, if any activities on the side of registires cannot be reached
fast, I fear we need to ship "somthing" in Firefox that doesn't
"something" to help or we risk being stamped "insecure" or
"phisihing-friendly" even if that's not true.

KaiRo

Chaddaï Fouché

unread,

Apr 24, 2017, 8:38:20 AM4/24/17

to Gervase Markham, mozilla-de...@lists.mozilla.org

Le lun. 24 avr. 2017 à 11:00, Gervase Markham <ge...@mozilla.org> a écrit :

> On 21/04/17 18:25, Chaddaï Fouché wrote:

> > I'm pretty sure most non-technical people don't even look at the url
> > bar and don't care about the icons that already appear there : info,
> > lock, read mode, zooming state, reload.
>

> So why do you think adding another icon will solve this problem?
>
>

Clearly _as I wrote in this quote_, I don't think it will help for
*non-technical people*, but then neither would deactivating IDN support
help either since most people only see the content of the site and if it
looks like an apple site, it must be from apple, right ? A minority has now
trained themselves to look at the lock but that won't help much here. The
only protection that would work is SafeBrowsing which is very good when it
works but obviously can't protect against every phishing website
instantaneously after its creation (and they don't want to make a blanket
block against this potential problem, from what I understand ?).

> > Adding just one more probably
> > won't suddenly overload them and it would allow every technically
> > minded people to see instantly if the domain name is in the writing
> > system they expect. Your IDN algorithm already compute this
> > information anyway.
>

> So we are trying to solve the problem only for technically minded people?
>
>

And ? How is a solution that won't inconvenience non-technical people but
help technically minded people bad ? Or are you implying that technically
minded people shouldn't be considered in Mozilla's decision ? Because in
case you've forgotten, an important minority of your public is technically
minded, and most of those who aren't installed Firefox because they were
advised so by their technically minded friend... unless you think your
enormous PR prowess and your unlimited ad budget were the main reason for
your success ?
The proposed solution is not miraculous but it solves the problem that even
technically minded people can't see that the site is fake even by looking
at the url bar, even by looking at the information from the certificate
that appears when clicking on the lock... Sure you can copy and paste the
url to an ascii editor but nobody will do that for every website, even
sensible websites. Convenience is important for technically minded people
too, even if they have a higher threshold for convenience vs security.

> > Your reaction amounting to "we give priority to our ideal of handling
> > every language equally over security (of everyone, regardless of
> > their language) because we consider 1) that it's the fault of the
> > registrars (irrelevant from the user point of view, and unlikely to
> > be fixed from that side) and 2) that our users are fragile little
> > flowers that will be scared by any additional UI element (that's
> > insulting by the way even if a cleaner UI is a worthy goal)"
>

> Well, that's basically what you just said above :-)
>
>

So the solution that propose to add an icon in the url bar to improve
security is somehow synonymous to :
1) give priority to every other ideals over security
2) it's the fault of the registrars
3) our users are fragile little flowers that will be scared by any
additional UI element
??
Words must not have the same meaning for you and I...

> Everything's a trade-off. Time, money, complexity, risk. Taking one
> particular problem and saying "this risk must be eliminated to the
> uttermost, regardless of how much time, money and added complexity is
> needed" is just not a reasonable position.

Sure, but some of the proposed solutions aren't huge time sinks, mine
(Igor's) for instance only means adding an UI element in the URL bar,
something that has already been done and can probably be reproduced without
too much innovation. What would take the most time would probably be to
find reasonable abbreviations for the writing systems and testing
afterward. The time spent discussing this issue would have been enough to
implement this several times over.

--
Chaddaï Fouché

Alex Gaynor

unread,

Apr 24, 2017, 9:31:18 AM4/24/17

to Gervase Markham, L. David Baron, mozilla-de...@lists.mozilla.org, Kyle Hamilton

You're right Gerv, everything is a trade-off. If I could resend that email,
I'd stress that "we" is the most important part of that sentence, not "if
there's more, it must be done". Specifically, my position is that just
because the registrars, or someone else, _should_ have done something,
doesn't make a good reason for us _not_ to do something.

We are the user's agent, and we should balance priorities in serving our
users against each other, not against what someone else could have done :-)

As I said, I think the right solution is to invest in more general phishing
mitigation, but if one believes that homoglyphs are particularly
threatening, I don't think it's relevant that registers could have done
something.

Alex

On Mon, Apr 24, 2017 at 5:06 AM, Gervase Markham <ge...@mozilla.org> wrote:

> On 21/04/17 19:59, Alex Gaynor wrote:

> > But I agree with those who've argued for "the buck stops here" -- if

> > there's more we can do, we must
>
> Can I just note in passing that, independent of what we do here, "if
> there's more we can do, we must" is a terrible, terrible argument?
>

> Everything's a trade-off. Time, money, complexity, risk. Taking one
> particular problem and saying "this risk must be eliminated to the
> uttermost, regardless of how much time, money and added complexity is
> needed" is just not a reasonable position.
>

> Gerv
>

Frederik Braun

unread,

Apr 24, 2017, 12:11:25 PM4/24/17

to dev-se...@lists.mozilla.org

On 24.04.2017 13:40, Robert Kaiser wrote:
> Gervase Markham schrieb:

>> Everything's a trade-off. Time, money, complexity, risk. Taking one
>> particular problem and saying "this risk must be eliminated to the
>> uttermost, regardless of how much time, money and added complexity is
>> needed" is just not a reasonable position.
>

> While that's true, right now, our position has the risk of the
> completely wrong point that Mozilla doesn't care if phishing happens to
> our users or by extension about their security. Now, we all know that
> this is both extremely far from the truth - but esp. if other browsers
> "do something" (no matter how useful that "something" is) and we "do
> nothing" and "play the blame game" by saying it's someone else's fault
> (Douglas Adams fans would call it a "SEP field") then it's easy for
> outsiders to get that wrong improession.

FWIW, if this web page was actively phishing users, we would block it
through Safe Browsing. This one is not.

So the problem here is about deducing (from the domain name) if a
website is phishy. That's admittedly hard.

Kyle Hamilton

unread,

Apr 24, 2017, 4:37:01 PM4/24/17

to Gervase Markham, L. David Baron, mozilla-de...@lists.mozilla.org

On Mon, Apr 24, 2017 at 2:04 AM, Gervase Markham <ge...@mozilla.org> wrote:
> On 21/04/17 19:42, Kyle Hamilton wrote:

>> In the instant case, the font used in the address bar uses the same
>> glyph shapes for both Latin and Cyrillic. Might it be appropriate to
>> use (and provide) a font that uses different glyphs for every
>> confusable code point, and then provide some kind of user training on
>> how if the shapes don't match what they're used to it might be
>> phishing? This would be demonstrably script-neutral.
>

> Whose letters get distorted and whose letters get to stay the same?

Anything that is not U+0020 to U+007F gets altered.

> If the differences are only small, the chances are people won't notice.
> If they don't notice apple.com.example.com, then they won't notice this.

Then it's probably a user training issue: how many people even look at
the address bar after they click a link? Create a UI that trains them
to look at it.

Or, you know, a UI that pops up a "if this is supposed to be Apple
Computer, it should look like 'apple.com'. If the letters don't look
the same as this, you are not at Apple Computer's site." or something.

>> The downside is that it would unduly burden users whose
>> shape-recognition is sub-par, but pretty much every other idea for
>> protecting the users has been shot down by Mozilla reps on this list.
>> I'm sorry, but this is not "somebody else's problem". The users use
>> your software, and you are the only ones they can hope to save them
>> from threats that others refuse to take responsibility for.
>

> Is it a bird? Is it a plane? No... it's a dinosaur!
>

>> Mozilla has always claimed that it's focused on user security. If
>> you're enforcing the rule "if it works on one Firefox, it works on all
>> Firefoxes" (in the context of "IDN owners might not use IDN if IDN
>> doesn't work everywhere") to the detriment of user security and
>> increasing phishability, are you really focused on user security? Why
>> is IDN display a sacred cow, when it increases the risk for your users
>> to be scammed? IDN owners don't apparently provide mindshare to
>> Mozilla, nor contribute to the installed base.
>

> Are you properly assessing the level of the risk? Unlike mixed-script
> systems, there is at most 1 and normally 0 Cyrillic whole-script
> homographs of any domain. That means that now we've done this dance,
> no-one can ever do this to Apple again. I would expect other major
> domain owners who are paying attention to be going out there and
> spending all of $7 on the Cyrillic homograph of their domain, if there
> is one.

...I'm sure that Apple and Epic and whoever else are going to be quite
happy going through the UDHR process to seize ownership of their
homograph domains from whoever do currently own them. You realize
that costs a bit more than $7, right?

>> Mozilla reps on this list have tried to push the problem off on
>> everyone else -- the registrars (of which a subset refuse to accept
>> the responsibility, and cannot be compelled to do so),
>

> So it's OK for them to say it's not their responsibility but not OK for
> us to say it's not our responsibility?

Correct. They don't have contracts in place (EULAs) with individual
users, whereas you do. They don't advertise that they operate for the
benefit or safety of the people who look up using their software or
services, whereas you do. You're the ones whose software is actually
used by the users, with one-on-one relationships with you. You're the
last and only line of defense for your users against the uncaring
policies of the rest of the Internet world. (More importantly, Chrome
and Safari have accepted responsibility for their users' security on
this topic, and Mozilla is now the outlier.)

> Or is what you mean that because we have an open process, there's more
> chance of shouting at us until we do something than there is of shouting
> a the registries until they do something?

Wow. Someone's taking this personally.

No, I mean that Mozilla has always in the past acted toward its users'
security, has always acted against homograph and easily-confusable
domains, and has always advertised that record. This lack of action
is a marked departure from what it has done, and a marked departure
from its advertising. It's not far-fetched to worry that there could
be legal liability here for false advertising.

Mozilla does have the ability to act for the benefit of its users
here. But it seems that it no longer feels interested in doing so.

-Kyle H

Gervase Markham

unread,

Apr 25, 2017, 5:16:37 AM4/25/17

to Robert Kaiser

On 24/04/17 12:40, Robert Kaiser wrote:
> I think, in hindsight, it was a bad idea to abandon the whitelisting
> approach

I don't agree. If ubiquity of IDNs is a goal, every IDN-using client
having its own whitelist would kill that goal stone dead.

> That said, if any activities on the side of registires cannot be reached
> fast, I fear we need to ship "somthing" in Firefox that doesn't
> "something" to help or we risk being stamped "insecure" or
> "phisihing-friendly" even if that's not true.

So you are suggesting we prioritize product changes based on
ill-informed media pressure?

Gerv

Gervase Markham

unread,

Apr 25, 2017, 5:19:23 AM4/25/17

to L. David Baron, Igor Bukanov

On 24/04/17 11:53, L. David Baron wrote:
> This makes me wonder: could we become more suspicious (in terms of
> UI indications) of sites where the script changes between different
> parts of the hostname (or eTLD+1), i.e., move towards expecting that
> non-Latin domain names will be using a non-Latin TLD?

That ends up basically being "no .com for _you_, suspicious-looking
non-Latin script". It's another way of treating some scripts as second
class. Admittedly, it's not the worst way of doing so, and a very
measured approach to this (basically, a TLD _black_list for TLDs which
are actively allowing their customers to attack each other) isn't a
totally terrible idea. The trouble is the collateral damage - those
companies and businesses who are happily using <some Cyrillic
string>.com as their domain name and now find it appears as gibberish in
major browsers after they've spent years building their brand, just
because the letters in their name happen all to have Latin homographs.

Gerv

L. David Baron

unread,

Apr 25, 2017, 6:29:04 AM4/25/17

to Gervase Markham, mozilla-de...@lists.mozilla.org, Igor Bukanov

On Tuesday 2017-04-25 10:19 +0100, Gervase Markham wrote:
> On 24/04/17 11:53, L. David Baron wrote:

> > This makes me wonder: could we become more suspicious (in terms of
> > UI indications) of sites where the script changes between different
> > parts of the hostname (or eTLD+1), i.e., move towards expecting that
> > non-Latin domain names will be using a non-Latin TLD?
>

> That ends up basically being "no .com for _you_, suspicious-looking
> non-Latin script". It's another way of treating some scripts as second
> class. Admittedly, it's not the worst way of doing so, and a very
> measured approach to this (basically, a TLD _black_list for TLDs which
> are actively allowing their customers to attack each other) isn't a
> totally terrible idea. The trouble is the collateral damage - those
> companies and businesses who are happily using <some Cyrillic
> string>.com as their domain name and now find it appears as gibberish in
> major browsers after they've spent years building their brand, just
> because the letters in their name happen all to have Latin homographs.

Couldn't it be done in a pretty limited way? For example, we could
use the punycode representation if:

* the component before the eTLD consists entirely of characters
that are homographs for characters in a single other script, and

* the component before the eTLD is in a different script from the
eTLD.

If there are some legitimate sites that this would catch, maybe we
could then whitelist them?

(I'm assuming we already require each component to be
single-script.)

signature.asc

Gervase Markham

unread,

Apr 25, 2017, 6:49:04 AM4/25/17

to mozilla-de...@lists.mozilla.org

On 25/04/17 11:28, L. David Baron wrote:
> * the component before the eTLD consists entirely of characters
> that are homographs for characters in a single other script, and

(I assume you mean s/eTLD/TLD/ in each case.)

> * the component before the eTLD is in a different script from the
> eTLD.

AIUI this is what Chrome did, for Cyrillic only, and they said it
affected 2,800 sites in .com. I don't know if they did more analysis for
other TLDs - .ru, I suspect, would have a large number, and there would
be more if we extended to all possible homographs across all scripts. A
whitelist might solve that, but of course that would grandfather in
existing examples and not allow for businesses not yet existing or not
yet on the net.

One guiding principle I have found useful here is "what if the Internet
were invented by the Russians, and Latin was the script late to the
party?". I am trying to avoid doing anything to Cyrillic that I would
think were unfair were it done to Latin if the boot were on the other foot.

The trouble with Cyrillic in particular is that there are quite a few
clashing letters:
https://en.wikipedia.org/wiki/IDN_homograph_attack#Cyrillic
In Russian, you have a, c, e, o, p, x and y. Add in numbers, and you
have 3, 4 and 6. Cyrillic non-Russian languages add i, j and s, and if
you go rare/archaic (which may or may not be supported in the font
and/or noticeably different) you can add d, h, l and v. And that's just
lowercase. In the worst case, that's 14 of Latin's 26 letters, including
4 of the 5 vowels. It would be a significant crimp on Cyrillic domain
names if all names using only those letters didn't work except in .рф
and the like.

> (I'm assuming we already require each component to be
> single-script.)

Yes, we do. That is what solves 99% of the problem.

Gerv

Jonathan Kingston

unread,

Apr 25, 2017, 1:11:51 PM4/25/17

to Gervase Markham, mozilla-de...@lists.mozilla.org

Besides the fact lists are hard to maintain.
There isn't anything technical preventing Firefox having one for existing
popular sites that registries have registered and shouldn't have right?
This could just make the punycode show in the browser for sites in this
list.

On Tue, Apr 25, 2017 at 11:48 AM, Gervase Markham <ge...@mozilla.org> wrote:

> On 25/04/17 11:28, L. David Baron wrote:

> > * the component before the eTLD consists entirely of characters
> > that are homographs for characters in a single other script, and
>

> (I assume you mean s/eTLD/TLD/ in each case.)
>

> > * the component before the eTLD is in a different script from the
> > eTLD.
>

> AIUI this is what Chrome did, for Cyrillic only, and they said it
> affected 2,800 sites in .com. I don't know if they did more analysis for
> other TLDs - .ru, I suspect, would have a large number, and there would
> be more if we extended to all possible homographs across all scripts. A
> whitelist might solve that, but of course that would grandfather in
> existing examples and not allow for businesses not yet existing or not
> yet on the net.
>
> One guiding principle I have found useful here is "what if the Internet
> were invented by the Russians, and Latin was the script late to the
> party?". I am trying to avoid doing anything to Cyrillic that I would
> think were unfair were it done to Latin if the boot were on the other foot.
>
> The trouble with Cyrillic in particular is that there are quite a few
> clashing letters:
> https://en.wikipedia.org/wiki/IDN_homograph_attack#Cyrillic
> In Russian, you have a, c, e, o, p, x and y. Add in numbers, and you
> have 3, 4 and 6. Cyrillic non-Russian languages add i, j and s, and if
> you go rare/archaic (which may or may not be supported in the font
> and/or noticeably different) you can add d, h, l and v. And that's just
> lowercase. In the worst case, that's 14 of Latin's 26 letters, including
> 4 of the 5 vowels. It would be a significant crimp on Cyrillic domain
> names if all names using only those letters didn't work except in .рф
> and the like.
>

> > (I'm assuming we already require each component to be
> > single-script.)
>

> Yes, we do. That is what solves 99% of the problem.
>
> Gerv
>
>

Daniel Veditz

unread,

Apr 25, 2017, 9:24:57 PM4/25/17

to L. David Baron, mozilla-de...@lists.mozilla.org, Igor Bukanov

On Mon, Apr 24, 2017 at 3:53 AM, L. David Baron <dba...@dbaron.org> wrote:

> This makes me wonder: could we become more suspicious (in terms of
> UI indications) of sites where the script changes between different
> parts of the hostname (or eTLD+1), i.e., move towards expecting that
> non-Latin domain names will be using a non-Latin TLD?
>

It would be nice and sometimes we could (I think I read that the .ru
registrar only allows ascii domains, and the Cyrillic version of their
ccTLD only has Cyrillic domains) but not in other cases. Of course .com is
a complete mess, but even with more thoughtful registries you have .eu
which explicitly accepts Cyrillic domains because Bulgaria is an EU member.

That would come back around to a TLD whitelist (or blacklist?) scheme.

-Dan Veditz

Gervase Markham

unread,

Apr 26, 2017, 4:58:47 AM4/26/17

to Jonathan Kingston

On 25/04/17 14:59, Jonathan Kingston wrote:
> Besides the fact lists are hard to maintain.
> There isn't anything technical preventing Firefox having one for existing
> popular sites that registries have registered and shouldn't have right?
> This could just make the punycode show in the browser for sites in this
> list.

We could do this, but it seems to me like it would be whack-a-mole, with
a bad press round at each whack because we are ostensibly taking
responsibility for the problem but not resolving it.

Gerv

akost...@gmail.com

unread,

Mar 12, 2018, 1:36:59 PM3/12/18

to mozilla-de...@lists.mozilla.org

On Tuesday, April 25, 2017 at 1:49:04 PM UTC+3, Gervase Markham wrote:
...

> One guiding principle I have found useful here is "what if the Internet
> were invented by the Russians, and Latin was the script late to the
> party?". I am trying to avoid doing anything to Cyrillic that I would
> think were unfair were it done to Latin if the boot were on the other foot.

If internet was invented in a Cyrillic using country, then the whole domain would have been in Cyrillic, not only the different parts of it.

I'm from such a country (Cyrillic alphabet) and I find mixed domains useless. I mean mixed like "www.cyrillic-part.com". Am I expected to switch my keyboard to type the domain name in the URL bar?

Do you want, in case DNS was invented by a country with a Cyrillic alphabet, to type parts in Latin and parts in Cyrillic?

I don't care that many people bought mixed charset domains. Let them buy non-mixed ones and resolve the issue long-term. I want (as a technical user) to have ability to recognize when domains are using mixed charsets easily.

It is strange for me to see many Latin only users blocking any progress of this issue because potentially non-latin users would be alienated. If you are concerned about this, then as your non-latin users what they want. You are just guessing and blocking any sensible decision. There are polls and other strategies that can be used.

IMO, at the very least, there should be some highlighting when domain uses mixed charsets, no matter whether in single component of the domain name or not. This is pretty much equal treating IMO and wouldn't kill anybody.

Even better if mixed domains show up in punycode by default but have some UI to switch them to Unicode if user decides. But looking at the sentiment here, I don't really hope about this. At least *please* add some highlighting, no matter what it is, pretty please.

> The trouble with Cyrillic in particular is that there are quite a few
> clashing letters:
> https://en.wikipedia.org/wiki/IDN_homograph_attack#Cyrillic
> In Russian, you have a, c, e, o, p, x and y. Add in numbers, and you
> have 3, 4 and 6. Cyrillic non-Russian languages add i, j and s, and if
> you go rare/archaic (which may or may not be supported in the font
> and/or noticeably different) you can add d, h, l and v. And that's just
> lowercase. In the worst case, that's 14 of Latin's 26 letters, including
> 4 of the 5 vowels. It would be a significant crimp on Cyrillic domain
> names if all names using only those letters didn't work except in .рф
> and the like.
>
> > (I'm assuming we already require each component to be
> > single-script.)
>
> Yes, we do. That is what solves 99% of the problem.

Not really. There are some many high profile sites that can be abused. First things come to my mind ерау.bg and ебау.com

Former is impossible to spot. Latter one needs to carefully look at it. For the "b" also "в" and "ь" could be hard to spot. An icon, different colors of the letters, or whatever will be much more useful. For example a warning icon and when you hover, to show explanation with more info about the problem.

In fact such a warning icon might be a good idea for many occasions. Firefox could detect different kinds of warnings going forward. An interested user (usually technical) would be able to make an informed decision whether the warning is relevant or not.

I'm not suggesting to abandon other long-term solutions that might be better for non-technical users. On the other hand, if Firefox ignores technical users, I doubt it would be good for it. I always preferred Firefox for the ability to make it behave as you want.
Presently quantum blocked many useful plugins for apparently no better stability in my personal observations (yes, had issues with replacements that used new APIs only that made my whole browsing experience a mess until I figured out what's going on). Now lets ignore the need for technical people to be sure in what they read in address bar. I really hope Firefox can be good for technical and non-technical people. Otherwise it will not matter anymore which browser am I using. It could be whatever comes pre-installed.