Unicode Ms

1 view

Skip to first unread message

Mohammed Faerber

unread,

Aug 5, 2024, 2:08:13 PM8/5/24

to gaconafing

Lookingto hire smart programmers who get things done? Stack Overflow Talent is a fully-customized sourcing solution that helps you understand, reach, and attract developers on the platform they trust most. Find the right candidates for your jobs. Learn more.

For my day job, I'm the co-founder and CEO of Stack Overflow, the largest online community for programmers to learn, share their knowledge, and level up. Each month, more than 40 million professional and aspiring programmers visit Stack Overflow to ask and answer questions and find better jobs. Stack Overflow is also the flagship site of the Stack Exchange network, 160+ question and answer sites dedicated to all kinds of topics from cooking to gaming. According to Quantcast, Stack Overflow is the 30th largest web property in the United States and in the top 100 in the world.

But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work. But of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented.

There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway.

Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already two ways to store Unicode. So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.

I don't understand the difference between the two options. Is one better than the other? In my particular case, should I leave all 3 files alone or should I rename two of them? Only the middle one works for now as it is the only one that has a size bigger than 0.

Yes, file name extension is the part of the name after the last dot in that name (if any - may be missing). It's usually few letters (typically 3 or 4, but can be any number on most present day systems). In particular for Portable Document Format file type it's "pdf" or ".pdf" (dot is included for more expressive representation, ? but formerly isn't integral part of the name extension itself; actually the last dot is just a separator between a basic name's part and the name extension).

No, definitely not! If the extension was different, as a part of file name, then the file name would be different too. In this context, the conflict would be impossible. Since you have conflict (whatever type), that means there is a name match (from Dropbox point of view, at least).

Are you sure the extensions are matching still. Maybe the unchanged name's extension is something like ".pdf", but changed name's extension looks like ".pdf (Unicode Encoding Conflict)" or something similar. Isn't it? Again name's extension is everything after the last point!!! You can try move ".pdf" at the names end wherever it's not. That's it.

As you probably know, a Unicode encoding conflict happens when two files or two folders with the same name are saved in the same location on your Dropbox account. To resolve this, Dropbox will append one with the words (Unicode Encoding Conflict).

You can resolve it by either of the ways you described, while it's recommended to avoid creating a file or folder with the same name as a file or folder in the same location to prevent this from happening again.

Did this post help you? If so, give it a Like below to let us know.

Need help with something else? Ask me a question!

Find Tips & Tricks Discover more ways to use Dropbox here!

Interested in Community Groups? Click here to join

I also want to know if all the 3 files are the same file or if they were different? I uploaded them to dropbox a long time ago so I don't remember. One has the name unicode encoding conflict and another has the name unicode encoding conflict (1). And one has a normal name.

This is a deep topic related to how file/folder names (and not only) are represented. In short, every symbol you can see is represented by a corresponding code. A sequence of codes is represented to sequence of corresponding glyphs on a display. The key moment is that some letters can be represented as one or more glyphs - i.e. the same looking text can be represented with different code sequence. For most present day systems it's not a issue: they just handle different representation as different names (even when they look in exactly same way). Some old systems (such that can't use Unicode or with partial support) can have troubles. Honestly I haven't seen such for very looong time. Dropbox keeps support for actually non existing systems (even when some lack of support for new systems can be noted) and make some normalization of names. In such a way files that can coexist on your local file system, cannot coexist in Dropbox namespace. When Dropbox meets such names it does rename some of them - only one remains unchanged (the issue you are observing). There is no guarantee which one remains unchanged. Such move doesn't affect files validity in any way. In some systems can be expected inability to distinguish file type correctly since improper naming - extension change. You can fix this by hand rename once again to correct the extension. Take in mind that Dropbox doesn't fully support Unicode - some symbols can be correct on your local filesystem, but incorrect for Dropbox (it just doesn't able to understand them - asaid of normalization issues)!

... When Dropbox meets such names it does rename some of them - only one remains unchanged (the issue you are observing). There is no guarantee which one remains unchanged. Such move doesn't affect files validity in any way. In some systems can be expected inability to distinguish file type correctly since improper naming - extension change. You can fix this by hand rename once again to correct the extension. ...

Again, some systems and other software can distinguish file type and handle file properly without need particular file extension to be set. Unfortunately there are cases that exact files extension is a must (Dropbox web interface is one such example)! Automatic file renaming usually leads to inconsistent file name (more precisely - its extension). That's what needs to be the target of your files name repairing. I hope you know what's file name extension.

So, what you're basically saying, is that I need to figure out what the original (correct) file extension of that particular file was. So maybe it was a .docx or something, and I need to figure that out. Is that it? Thanks.

It's becoming increasingly harder to have reasonable discussions about thedifferences between Python 2 and 3 because one language is dead and theother is actively developed. So when someone starts a discussion aboutthe Unicode support between those two languages it's not an even playingfield. So I won't discuss the actual Unicode support of those twolanguages but the core model of how to deal with text and bytes in both.

Since I have to maintain lots of code that deals exactly with the pathbetween Unicode and bytes this regression from 2 to 3 has caused me lotsof grief. Especially when I see slides by core Python maintainers abouthow I should trust them that 3.3 is better than 2.7 makes me more than angry.