Non-ASCII characters in a node body make edit operations produce unintended results

32 views
Skip to first unread message

SegundoBob

unread,
Apr 13, 2022, 2:48:06 PM4/13/22
to leo-editor
I don't know if this is a bug or just the way PyQt works, but this is a very annoying problem.  Sometimes HOME takes you to the end of line instead of the start.  Sometimes select and Ctrl+C copies unselected characters.  The "mistakes" are endless because the displayed cursor position is not "correct".

I first noticed this problem in 2022-02 because more and more articles posted on the Internet contain non-ASCII and everyday I copy many articles to node bodies and then edit them slightly.

2022-04-13 Wed I definitely identified the problem with the help of this command:

grep --color='auto' -P -n "[^\x00-\x7F]" x.txt

which I obtained from

https://stackoverflow.com/questions/3001177/how-do-i-grep-for-all-non-ascii-characters

Here is an example article containing many non-ASCII characters:

https://newsletter.pragmaticengineer.com/p/scoop-atlassian

There are many suggestions on the Internet for removing non-ASCII characters using Python.  So far this is the best workaround that I've come up with.  If we don't come with a fix or a better workaround, I'll eventually figure out how to replace non-ASCII charcters that have similar ASCII characters with the appropriate ASCII characters.  Someone has probably implemented this, but so far I have not found it.

Unfortunately, I have higher priority problems right now that prevent me from devoting much time to this problem.

Versions tested:

Leo 6.6b2-devel, devel branch, build 0ce2fa9ad5
2022-02-24 09:55:29 -0600
Python 3.8.10, PyQt version 5.12.8
linux
---------------
Leo 6.6.1-devel, devel branch, build 90bad4f475
2022-04-13 09:33:47 -0500
Python 3.8.10, PyQt version 5.12.8

tbp1...@gmail.com

unread,
Apr 13, 2022, 4:14:07 PM4/13/22
to leo-editor
It looks like that on particular page, the non-ascii characters are emojis.  I copied part of that page with two of the emojis into a Leo node and didn't see any unusual behavior.  <Home>, <End> and copying with <CTRL-C> worked as expected.  Do you have an example that didn't work right for you?

Here's an online checker for non-ascii characters:  Non-Ascii Checker. You can paste suspect text in or point it to a file.

Since Python by default uses utf-8 and unicode, text that isn't encoded in utf-8 could cause problems.  Or if it is wrongly encoded, or encoded with some other encoding.  Some text editors can figure it out and you can tell them to save a file in a different encoding.  EditPlus is the one I use for this.  Not free but worth the $35.  Notepad++ also can do it, though I haven't used it.

Characters that your font does not have a glyph for might be troublesome too, but I'm not sure.  Again, emojis probably would be the most likely if we're not getting into cjk characters, since so many new emojis are getting introduced..

If we see the kind of behavior you experienced in properly encoded strings, then for sure we'd have a problem.  Unfortunately there is a lot of incorrectly encoded material out there.  Hmm, I wonder if Leo should have an encoding checker built in?

tbp1...@gmail.com

unread,
Apr 13, 2022, 8:55:33 PM4/13/22
to leo-editor

There could also be a problem with a specific version of Qt, so if you can try later version (or possibly earlier) it might behave differently. Supposedly, all Qt widgets and strings work correctly with unicode and/or utf-8 encoding.

Arjan

unread,
May 5, 2022, 10:27:25 AM5/5/22
to leo-editor

Edward K. Ream

unread,
May 6, 2022, 10:15:40 AM5/6/22
to leo-editor
On Thu, May 5, 2022 at 9:27 AM Arjan <arjan...@gmail.com> wrote:
Yes. I have no idea how it could be fixed.  Anyone have any suggestions?

Edward

tbp1...@gmail.com

unread,
May 6, 2022, 1:01:57 PM5/6/22
to leo-editor
I'm in the dark too, but when i encountered the same problem several leo versions ago, i raised the issue and then after a few more merges the problem was gone.  actually, that time it was more serious because every use of the ctrl key (iirc) inserted those symbols.  it's always been those exact strange symbols, so there must be some very specific thing going on.  the symbols involved do not change with the font, so it's not a matter of missing or wrong glyphs getting drawn.  it might be something like utf8 decoding getting shifted by a bit or a nibble, or getting an extra byte inserted spuriously into the decoded stream.  if so, it would probably be an issue with the decoding library that qt uses.

i never knew what had changed - i assumed that @edward had done something to fix it - but if someone can find my report - maybe 2 years ago by now?  - and has some skill with tracking changes through git - something might come too light.  sorry, i'm not in a position to do it myself right now.

Edward K. Ream

unread,
May 6, 2022, 1:07:12 PM5/6/22
to leo-editor
On Fri, May 6, 2022 at 12:02 PM tbp1...@gmail.com <tbp1...@gmail.com> wrote:

the symbols involved do not change with the font, so it's not a matter of missing or wrong glyphs getting drawn.  it might be something like utf8 decoding getting shifted by a bit or a nibble, or getting an extra byte inserted spuriously into the decoded stream.  if so, it would probably be an issue with the decoding library that qt uses.

I agree with this analysis. The only workaround I can think of is not to paste the offending glyphs/characters.

Edward
Reply all
Reply to author
Forward
0 new messages