Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

What extended ASCII character set uses 0x9D?

4,289 views
Skip to first unread message

John Nagle

unread,
Aug 17, 2017, 8:14:40 PM8/17/17
to
I'm cleaning up some data which has text description fields from
multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
And some are in some other character set. So I have to examine and
sanity check each field in a database dump, deciding which character
set best represents what's there.

Here's a hard case:

g1 = bytearray(b'\\"Perfect Gift Idea\\"\x9d Each time')

g1.decode("utf8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position
21: invalid start byte

g1.decode("windows-1252")
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
21: character maps to <undefined>

0x9d is unmapped in "windows-1252", according to

https://en.wikipedia.org/wiki/Windows-1252

So the Python codec isn't wrong here.

Trying "latin-1"

g1.decode("latin-1")
'\\"Perfect Gift Idea\\"\x9d Each time'

That just converts 0x9d in the input to 0x9d in Unicode.
That's "Operating System Command" (the "Windows" key?)
That's clearly wrong; some kind of quote was intended.
Any ideas?


John Nagle

Chris Angelico

unread,
Aug 17, 2017, 8:28:08 PM8/17/17
to
Another possibility is that it's some kind of dash or ellipsis or
something, but I can't find anything that does. (You already have
quote characters in there.) The nearest I can actually find is:

>>> b'\\"Perfect Gift Idea\\"\x9d Each time'.decode("1256")
'\\"Perfect Gift Idea\\"\u200c Each time'
>>> unicodedata.name("\u200c")
'ZERO WIDTH NON-JOINER'

which, honestly, doesn't make a lot of sense either. :(

ChrisA

John Nagle

unread,
Aug 17, 2017, 8:30:51 PM8/17/17
to
On 08/17/2017 05:14 PM, John Nagle wrote:
> I'm cleaning up some data which has text description fields from
> multiple sources.
A few more cases:

bytearray(b'miguel \xe3\x81ngel santos')
bytearray(b'lidija kmeti\xe4\x8d')
bytearray(b'\xe5\x81ukasz zmywaczyk')
bytearray(b'M\x81\x81\xfcnster')
bytearray(b'ji\xe5\x99\xe3\xad urban\xe4\x8d\xe3\xadk')
bytearray(b'\xe4\xbdubom\xe3\xadr mi\xe4\x8dko')
bytearray(b'petr urban\xe4\x8d\xe3\xadk')

0x9d is the most common; that occurs in English text. The others
seem to be in some Eastern European character set.

Understand, there's no metadata available to disambiguate this. What I
have is a big CSV file in which different character sets are mixed.
Each field has a uniform character set, so I need character set
detection on a per-field basis.

John Nagle

Ian Kelly

unread,
Aug 17, 2017, 8:40:06 PM8/17/17
to
On Thu, Aug 17, 2017 at 6:27 PM, Chris Angelico <ros...@gmail.com> wrote:
> On Fri, Aug 18, 2017 at 10:14 AM, John Nagle <na...@animats.com> wrote:
> Another possibility is that it's some kind of dash or ellipsis or
> something, but I can't find anything that does. (You already have
> quote characters in there.) The nearest I can actually find is:
>
>>>> b'\\"Perfect Gift Idea\\"\x9d Each time'.decode("1256")
> '\\"Perfect Gift Idea\\"\u200c Each time'
>>>> unicodedata.name("\u200c")
> 'ZERO WIDTH NON-JOINER'
>
> which, honestly, doesn't make a lot of sense either. :(

In CP437 it's ¥ which makes some sense in the "gift idea" context. But
then I'd expect a number to appear with it.

It could also just be junk data.

Ian Kelly

unread,
Aug 17, 2017, 8:53:11 PM8/17/17
to
On Thu, Aug 17, 2017 at 6:30 PM, John Nagle <na...@animats.com> wrote:
> A few more cases:
>
> bytearray(b'miguel \xe3\x81ngel santos')

If that were b'\xc3\x81' it would be Á in UTF-8 which would fit the
rest of the name.

> bytearray(b'\xe5\x81ukasz zmywaczyk')

If that were b'\xc5\x81' it would be Ł in UTF-8 which would fit the
rest of the name.

I suspect the others contain similar errors. I don't know if it's the
result of some form of Mojibake or maybe just transcription errors.

Chris Angelico

unread,
Aug 17, 2017, 8:54:23 PM8/17/17
to
On Fri, Aug 18, 2017 at 10:30 AM, John Nagle <na...@animats.com> wrote:
> On 08/17/2017 05:14 PM, John Nagle wrote:
>> I'm cleaning up some data which has text description fields from
>> multiple sources.
> A few more cases:
>
> bytearray(b'\xe5\x81ukasz zmywaczyk')

This one has to be Polish, and the first character should be the
letter Ł U+0141 or ł U+0142. In UTF-8, U+0141 becomes C5 81, which is
very similar to the E5 81 that you have.

So here's an insane theory: something attempted to lower-case the byte
stream as if it were ASCII. If you ignore the high bit, 0xC5 looks
like 0x45 or "E", which lower-cases by having 32 added to it, yielding
0xE5. Reversing this transformation yields sane data for several of
your strings - they then decode as UTF-8:

miguel Ángel santos
lidija kmetič
Łukasz zmywaczyk
jiří urbančík
Ľubomír mičko
petr urbančík

That doesn't work for everything, though. The 0x81 0x81 and 0x9d ones
are still a puzzle.

ChrisA

Ian Kelly

unread,
Aug 17, 2017, 8:55:27 PM8/17/17
to
On Thu, Aug 17, 2017 at 6:52 PM, Ian Kelly <ian.g...@gmail.com> wrote:
> On Thu, Aug 17, 2017 at 6:30 PM, John Nagle <na...@animats.com> wrote:
>> A few more cases:
>>
>> bytearray(b'miguel \xe3\x81ngel santos')
>
> If that were b'\xc3\x81' it would be Á in UTF-8 which would fit the
> rest of the name.
>
>> bytearray(b'\xe5\x81ukasz zmywaczyk')
>
> If that were b'\xc5\x81' it would be Ł in UTF-8 which would fit the
> rest of the name.
>
> I suspect the others contain similar errors. I don't know if it's the
> result of some form of Mojibake or maybe just transcription errors.

Oh shit, I think know what happened. In ASCII you can lower-case
letters by just adding 32 (0x20) to them. Somebody tried to do that
here and fucked up the encoding. That's why all the ASCII letters in
the strings are lower-case while these ones aren't.

Chris Angelico

unread,
Aug 17, 2017, 9:03:28 PM8/17/17
to
On Fri, Aug 18, 2017 at 10:54 AM, Ian Kelly <ian.g...@gmail.com> wrote:
> On Thu, Aug 17, 2017 at 6:52 PM, Ian Kelly <ian.g...@gmail.com> wrote:
>> On Thu, Aug 17, 2017 at 6:30 PM, John Nagle <na...@animats.com> wrote:
>>> A few more cases:
>>>
>>> bytearray(b'miguel \xe3\x81ngel santos')
>>
>> If that were b'\xc3\x81' it would be Á in UTF-8 which would fit the
>> rest of the name.
>>
>>> bytearray(b'\xe5\x81ukasz zmywaczyk')
>>
>> If that were b'\xc5\x81' it would be Ł in UTF-8 which would fit the
>> rest of the name.
>>
>> I suspect the others contain similar errors. I don't know if it's the
>> result of some form of Mojibake or maybe just transcription errors.
>
> Oh shit, I think know what happened. In ASCII you can lower-case
> letters by just adding 32 (0x20) to them. Somebody tried to do that
> here and fucked up the encoding. That's why all the ASCII letters in
> the strings are lower-case while these ones aren't.

That applies to some, but not all.

> bytearray(b'M\x81\x81\xfcnster')

This should be Münster, which is a U+00FC. You have 81 81 FC. I don't
know of any encoding that does this, but it looks indicative - and
it's not the lower-casing. And the 0x9d doesn't either, but maybe
that's some relation to 0x2d which is an ASCII hyphen?

ChrisA

Ian Kelly

unread,
Aug 17, 2017, 9:07:11 PM8/17/17
to
On Thu, Aug 17, 2017 at 6:53 PM, Chris Angelico <ros...@gmail.com> wrote:
> That doesn't work for everything, though. The 0x81 0x81 and 0x9d ones
> are still a puzzle.

I'm fairly sure that b'M\x81\x81\xfcnster' is 'Münster'. It decodes to
that in Latin-1 if you remove the \x81 bytes. The question then is
what those extra bytes are doing there. I suspect that they and 0x9d
are just non-printing junk control bytes from the C1 set that got
inserted into the character stream somehow.

Ben Bacarisse

unread,
Aug 17, 2017, 9:07:24 PM8/17/17
to
I wrote a little shell script to try every encoding known to iconv and
the two most likely intended characters seem to be cedilla (if someone
mistook it for a comma) and a zero width non-joiner.

The former mainly comes from IBM character sets and the latter from IBM
and MS character sets (WINDOWS-1256 for example).

Neither seems very plausible so I'm betting on an error!

--
Ben.

MRAB

unread,
Aug 17, 2017, 9:22:11 PM8/17/17
to
It's preceded by something in quotes, so it might be ™ (trademark
symbol, '\u2122') or something similar. No idea which encoding that
would be, though.

MRAB

unread,
Aug 17, 2017, 10:16:23 PM8/17/17
to
On 2017-08-18 01:53, Chris Angelico wrote:
> On Fri, Aug 18, 2017 at 10:30 AM, John Nagle <na...@animats.com> wrote:
>> On 08/17/2017 05:14 PM, John Nagle wrote:
>>> I'm cleaning up some data which has text description fields from
>>> multiple sources.
>> A few more cases:
>>
>> bytearray(b'\xe5\x81ukasz zmywaczyk')
>
> This one has to be Polish, and the first character should be the
> letter Ł U+0141 or ł U+0142. In UTF-8, U+0141 becomes C5 81, which is
> very similar to the E5 81 that you have.
>
> So here's an insane theory: something attempted to lower-case the byte
> stream as if it were ASCII. If you ignore the high bit, 0xC5 looks
> like 0x45 or "E", which lower-cases by having 32 added to it, yielding
> 0xE5. Reversing this transformation yields sane data for several of
> your strings - they then decode as UTF-8:
>
> miguel Ángel santos

I think that's:

miguel ángel santos

> lidija kmetič
> Łukasz zmywaczyk
> jiří urbančík
> Ľubomír mičko
> petr urbančík
>

MRAB

unread,
Aug 17, 2017, 10:24:48 PM8/17/17
to
On 2017-08-18 01:30, John Nagle wrote:
> On 08/17/2017 05:14 PM, John Nagle wrote:
> > I'm cleaning up some data which has text description fields from
> > multiple sources.
> A few more cases:
>
> bytearray(b'miguel \xe3\x81ngel santos')
> bytearray(b'lidija kmeti\xe4\x8d')
> bytearray(b'\xe5\x81ukasz zmywaczyk')
> bytearray(b'M\x81\x81\xfcnster')

I suspect that it's b'M\xc3\xbcnster', i.e. 'Münster'.encode('utf'8')

Ian Kelly

unread,
Aug 17, 2017, 10:24:48 PM8/17/17
to
On Thu, Aug 17, 2017 at 8:15 PM, MRAB <pyt...@mrabarnett.plus.com> wrote:
> On 2017-08-18 01:53, Chris Angelico wrote:
>> So here's an insane theory: something attempted to lower-case the byte
>> stream as if it were ASCII. If you ignore the high bit, 0xC5 looks
>> like 0x45 or "E", which lower-cases by having 32 added to it, yielding
>> 0xE5. Reversing this transformation yields sane data for several of
>> your strings - they then decode as UTF-8:
>>
>> miguel Ángel santos
>
>
> I think that's:
>
> miguel ángel santos

It would be if it had been lower-cased correctly. The UTF-8 for á is
\xc3\xa1, not \xe3x81 (ironically the add-32 method still works in
this particular case; it was just added to the wrong byte).

John Nagle

unread,
Aug 17, 2017, 11:46:54 PM8/17/17
to
On 08/17/2017 05:53 PM, Chris Angelico wrote:> On Fri, Aug 18, 2017 at
10:30 AM, John Nagle <na...@animats.com> wrote:
>> On 08/17/2017 05:14 PM, John Nagle wrote:
>>> I'm cleaning up some data which has text description fields from
>>> multiple sources.
>> A few more cases:
>>
>> bytearray(b'\xe5\x81ukasz zmywaczyk')
>
> This one has to be Polish, and the first character should be the
> letter Ł U+0141 or ł U+0142. In UTF-8, U+0141 becomes C5 81, which is
> very similar to the E5 81 that you have.
>
> So here's an insane theory: something attempted to lower-case the byte
> stream as if it were ASCII. If you ignore the high bit, 0xC5 looks
> like 0x45 or "E", which lower-cases by having 32 added to it, yielding
> 0xE5. Reversing this transformation yields sane data for several of
> your strings - they then decode as UTF-8:
>
> miguel Ángel santos
> lidija kmetič
> Łukasz zmywaczyk
> jiří urbančík
> Ľubomír mičko
> petr urbančík

I think you're right for those. I'm working from a MySQL dump of
supposedly LATIN-1 data, but LATIN-1 will accept anything. I've
found UTF-8 and Windows-2152 in there. It's quite possble that someone
lower-cased UTF-8 stored in a LATIN-1 field. There are lots of
questions on the web which complain about getting a Python decode error
on 0x9d, and the usual answer is "Use Latin-1". But that doesn't really
decode properly, it just doesn't generate an exception.

> That doesn't work for everything, though. The 0x81 0x81 and 0x9d ones
> are still a puzzle.

The 0x9d thing seems unrelated to the Polish names thing. 0x9d
shows up in the middle of English text that's otherwise ASCII.
Is this something that can appear as a result of cutting and
pasting from Microsoft Word?

I'd like to get 0x9d right, because it comes up a lot. The
Polish name thing is rare. There's only about a dozen of those
in 400MB of database dump. There are hundreds of 0x9d hits.

Here's some more 0x9d usage, each from a different data item:


Guitar Pro, JamPlay, RedBana\\\'s Audition,\x9d Doppleganger\x99s The
Lounge\x9d or Heatwave Interactive\x99s Platinum Life Country,\\"

for example \\"I\\\'ve seen the bull run in Pamplona, Spain\x9d.\\"
Everything

Netwise Depot is a \\"One Stop Web Shop\\"\x9d that provides

sustainable \\"green\\"\x9d living

are looking for a \\"Do It for Me\\"\x9d solution


This has me puzzled. It's often, but not always after a close quote.
"TM" or "(R)" might make sense, but what non-Unicode character set
has those. And "green"(tm) makes no sense.

John Nagle


Ian Kelly

unread,
Aug 18, 2017, 1:13:13 AM8/18/17
to
On Thu, Aug 17, 2017 at 9:46 PM, John Nagle <na...@animats.com> wrote:
> The 0x9d thing seems unrelated to the Polish names thing. 0x9d
> shows up in the middle of English text that's otherwise ASCII.
> Is this something that can appear as a result of cutting and
> pasting from Microsoft Word?
>
> I'd like to get 0x9d right, because it comes up a lot. The
> Polish name thing is rare. There's only about a dozen of those
> in 400MB of database dump. There are hundreds of 0x9d hits.
>
> Here's some more 0x9d usage, each from a different data item:
>
>
> Guitar Pro, JamPlay, RedBana\\\'s Audition,\x9d Doppleganger\x99s The
> Lounge\x9d or Heatwave Interactive\x99s Platinum Life Country,\\"

This one seems like a good hint since \x99 here looks like it should
be an apostrophe. But what character set has an apostrophe there? The
best I can come up with is that 0xE2 0x80 0x99 is "right single
quotation mark" in UTF-8. Also known as the "smart apostrophe", so it
could have been entered by a word processor.

The problem is that if that's what it is, then two out of the three
bytes are outright missing. If the same thing happened to \x9d then
who knows what's missing from it?

One possibility is that it's the same two bytes. That would make it
0xE2 0x80 0x9D which is "right double quotation mark". Since it keeps
appearing after ending double quotes that seems plausible, although
one has to wonder why it appears *in addition to* the ASCII double
quotes.

> This has me puzzled. It's often, but not always after a close quote.
> "TM" or "(R)" might make sense, but what non-Unicode character set
> has those. And "green"(tm) makes no sense.

CP-1252 has ™ at \x99, perhaps coincidentally. CP-1252 and Latin-1
both have ® at \xae.

Steve D'Aprano

unread,
Aug 18, 2017, 1:26:15 AM8/18/17
to
On Fri, 18 Aug 2017 10:14 am, John Nagle wrote:

> I'm cleaning up some data which has text description fields from
> multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
> And some are in some other character set. So I have to examine and
> sanity check each field in a database dump, deciding which character
> set best represents what's there.
>
> Here's a hard case:
>
> g1 = bytearray(b'\\"Perfect Gift Idea\\"\x9d Each time')

py> unicodedata.name('\x9d'.decode('macroman'))
'LATIN SMALL LETTER U WITH GRAVE'

Doesn't seem too likely.

This may help:

http://i18nqa.com/debug/bug-double-conversion.html


There's always the possibility that it's just junk, or moji-bake from some other
source, so it might not be anything sensible in any extended ASCII character
set.




--
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

John Nagle

unread,
Aug 18, 2017, 2:25:20 AM8/18/17
to
On 08/17/2017 10:12 PM, Ian Kelly wrote:

>> Here's some more 0x9d usage, each from a different data item:
>>
>>
>> Guitar Pro, JamPlay, RedBana\\\'s Audition,\x9d Doppleganger\x99s The
>> Lounge\x9d or Heatwave Interactive\x99s Platinum Life Country,\\"
>
> This one seems like a good hint since \x99 here looks like it should
> be an apostrophe. But what character set has an apostrophe there? The
> best I can come up with is that 0xE2 0x80 0x99 is "right single
> quotation mark" in UTF-8. Also known as the "smart apostrophe", so it
> could have been entered by a word processor.
>
> The problem is that if that's what it is, then two out of the three
> bytes are outright missing. If the same thing happened to \x9d then
> who knows what's missing from it?
>
> One possibility is that it's the same two bytes. That would make it
> 0xE2 0x80 0x9D which is "right double quotation mark". Since it keeps
> appearing after ending double quotes that seems plausible, although
> one has to wonder why it appears *in addition to* the ASCII double
> quotes.

I was wondering if it was a signal to some word processor to
apply smart quote handling.

>> This has me puzzled. It's often, but not always after a close quote.
>> "TM" or "(R)" might make sense, but what non-Unicode character set
>> has those. And "green"(tm) makes no sense.
>
> CP-1252 has ™ at \x99, perhaps coincidentally. CP-1252 and Latin-1
> both have ® at \xae.

That's helpful. All those text snippets failed Windows-1252
decoding, though, because 0x9d isn't in Windows-1252.

I'm coming around to the idea that some of these snippets
have been previously mis-converted, which is why they make no sense.
Since, as someone pointed out, there was UTF-8 which had been
run through an ASCII-type lower casing algorithm, that's a reasonable
assumption. Thanks for looking at this, everyone. If a string won't
parse as either UTF-8 or Windows-1252, I'm just going to convert the
bogus stuff to the Unicode replacement character. I might remove
0x9d chars, since that never seems to affect readability.

John Nagle

Chris Angelico

unread,
Aug 18, 2017, 2:31:56 AM8/18/17
to
On Fri, Aug 18, 2017 at 4:24 PM, John Nagle <na...@animats.com> wrote:
> I'm coming around to the idea that some of these snippets
> have been previously mis-converted, which is why they make no sense.
> Since, as someone pointed out, there was UTF-8 which had been
> run through an ASCII-type lower casing algorithm, that's a reasonable
> assumption. Thanks for looking at this, everyone. If a string won't
> parse as either UTF-8 or Windows-1252, I'm just going to convert the
> bogus stuff to the Unicode replacement character. I might remove
> 0x9d chars, since that never seems to affect readability.

That sounds like a good plan. Unless you can pin down a single
coherent encoding (even a broken one, like "UTF-8, then add 32 to
everything between 0xC1 and 0xDA"), all you have is decoding
individual strings. There just isn't enough context to do anything
smarter than flipping unparseable bytes to U+FFFD.

ChrisA

Paul Rubin

unread,
Aug 18, 2017, 2:38:39 AM8/18/17
to
John Nagle <na...@animats.com> writes:
> Since, as someone pointed out, there was UTF-8 which had been
> run through an ASCII-type lower casing algorithm

I spent a few minutes figuring out if some of the mysterious 0x81's
could be from ASCII-lower-casing some Unicode combining characters, but
the numbers didn't seem to work out. Might still be worth looking for
in some other cases.

Chris Angelico

unread,
Aug 18, 2017, 2:49:50 AM8/18/17
to
They can't be from anything like that. Lower-casing in ASCII consists
of adding 32 (or setting the fifth bit) on certain byte/character
values. Subtracting 32 from 0x81 gives 0x61 which is lower-case letter
'a'; the fifth bit isn't set in 0x81. So there's no way that UTF-8 +
dumb lowercasing could give you 0x81.

ChrisA

Marko Rauhamaa

unread,
Aug 18, 2017, 2:57:20 AM8/18/17
to
Chris Angelico <ros...@gmail.com>:

> On Fri, Aug 18, 2017 at 4:38 PM, Paul Rubin <no.e...@nospam.invalid> wrote:
>> John Nagle <na...@animats.com> writes:
>>> Since, as someone pointed out, there was UTF-8 which had been
>>> run through an ASCII-type lower casing algorithm
>>
>> I spent a few minutes figuring out if some of the mysterious 0x81's
>> could be from ASCII-lower-casing some Unicode combining characters,
>> but the numbers didn't seem to work out. Might still be worth looking
>> for in some other cases.
>
> They can't be from anything like that. Lower-casing in ASCII consists
> of adding 32 (or setting the fifth bit) on certain byte/character
> values.

How about lower-casing?


Marko

Chris Angelico

unread,
Aug 18, 2017, 3:07:04 AM8/18/17
to
Huh?

ChrisA

Marko Rauhamaa

unread,
Aug 18, 2017, 3:12:03 AM8/18/17
to
s/lower/upper/


Marko

Chris Angelico

unread,
Aug 18, 2017, 3:25:52 AM8/18/17
to
Ohh. We have no evidence that uppercasing is going on here, and a
naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it
would also convert 0x21 ("!") into 0x01 (SOH, a control character). So
this one's still a mystery.

ChrisA

Marko Rauhamaa

unread,
Aug 18, 2017, 3:39:49 AM8/18/17
to
Chris Angelico <ros...@gmail.com>:

> Ohh. We have no evidence that uppercasing is going on here, and a
> naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it
> would also convert 0x21 ("!") into 0x01 (SOH, a control character). So
> this one's still a mystery.

BTW, I was reading up on the history of ASCII control characters. Quite
fascinating.

For example, have you ever wondered why DEL is the odd control character
out at the code point 127? The reason turns out to be paper punch tape.
By backstepping and punching a DEL over the previous ASCII character you
can "rub out" the character.

(I got interested in the control characters after reading the sad spec
RFC 7464.)


Marko

Chris Angelico

unread,
Aug 18, 2017, 3:46:13 AM8/18/17
to
On Fri, Aug 18, 2017 at 5:39 PM, Marko Rauhamaa <ma...@pacujo.net> wrote:
> Chris Angelico <ros...@gmail.com>:
>
>> Ohh. We have no evidence that uppercasing is going on here, and a
>> naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it
>> would also convert 0x21 ("!") into 0x01 (SOH, a control character). So
>> this one's still a mystery.
>
> BTW, I was reading up on the history of ASCII control characters. Quite
> fascinating.
>
> For example, have you ever wondered why DEL is the odd control character
> out at the code point 127? The reason turns out to be paper punch tape.
> By backstepping and punching a DEL over the previous ASCII character you
> can "rub out" the character.

Yeah. Bvvvvvvvp, no more character there :) I'm not old enough to have
actually worked with those technologies (although I do have a punched
card somewhere around, being used as a bookmark), but a lot of them
have influenced the standards that we still use, so it's well worth
studying the history!

ChrisA

MRAB

unread,
Aug 18, 2017, 5:58:52 AM8/18/17
to
I googled for """Netwise Depot is a""" and found this page:

https://www.crunchbase.com/organization/netwise-depot#/entity

It has the text:

Netwise Depot is a "One Stop Web Shop" that provides a holistic
solution

Put that through the ascii function and you get:

'Netwise Depot is a "One Stop Web Shop"\x9d that provides a
holistic solution'

OK. Try another one.

Google for """Guitar Pro, JamPlay, RedBana""":

https://www.crunchbase.com/organization/the-rights-workshop#/entity"""

Look familiar?

That page has:

Guitar Pro, JamPlay, RedBana's Audition, Doppleganger™s

Is that where the data comes from?

Random832

unread,
Aug 18, 2017, 1:32:53 PM8/18/17
to
On Fri, Aug 18, 2017, at 03:39, Marko Rauhamaa wrote:
> BTW, I was reading up on the history of ASCII control characters. Quite
> fascinating.
>
> For example, have you ever wondered why DEL is the odd control character
> out at the code point 127? The reason turns out to be paper punch tape.
> By backstepping and punching a DEL over the previous ASCII character you
> can "rub out" the character.

I assume this is also why teletypes used even parity - so 0xFF can be
used as DEL on characters that had any parity.

Piet van Oostrum

unread,
Aug 18, 2017, 5:05:07 PM8/18/17
to
Marko Rauhamaa <ma...@pacujo.net> writes:

> Chris Angelico <ros...@gmail.com>:
>
>> Ohh. We have no evidence that uppercasing is going on here, and a
>> naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it
>> would also convert 0x21 ("!") into 0x01 (SOH, a control character). So
>> this one's still a mystery.
>
> BTW, I was reading up on the history of ASCII control characters. Quite
> fascinating.
>
> For example, have you ever wondered why DEL is the odd control character
> out at the code point 127? The reason turns out to be paper punch tape.
> By backstepping and punching a DEL over the previous ASCII character you
> can "rub out" the character.
>
Sure, I have done that many times. Years ago.
--
Piet van Oostrum <pie...@vanoostrum.org>
WWW: http://piet.vanoostrum.org/
PGP key: [8DAE142BE17999C4]

John Nagle

unread,
Aug 18, 2017, 5:43:13 PM8/18/17
to
On 08/17/2017 05:53 PM, Chris Angelico wrote:
> On Fri, Aug 18, 2017 at 10:30 AM, John Nagle <na...@animats.com> wrote:
>> On 08/17/2017 05:14 PM, John Nagle wrote:
>>> I'm cleaning up some data which has text description fields from
>>> multiple sources.
>> A few more cases:
>>
>> bytearray(b'\xe5\x81ukasz zmywaczyk')
>
> This one has to be Polish, and the first character should be the
> letter Ł U+0141 or ł U+0142. In UTF-8, U+0141 becomes C5 81, which is
> very similar to the E5 81 that you have.
>
> So here's an insane theory: something attempted to lower-case the byte
> stream as if it were ASCII. If you ignore the high bit, 0xC5 looks
> like 0x45 or "E", which lower-cases by having 32 added to it, yielding
> 0xE5. Reversing this transformation yields sane data for several of
> your strings - they then decode as UTF-8:
>
> miguel Ángel santos
> lidija kmetič
> Łukasz zmywaczyk
> jiří urbančík
> Ľubomír mičko
> petr urbančík

You're exactly right. The database has columns "name" and
"normalized name". Normalizing the name was done by forcing it
to lower case as if in ASCII, even for UTF-8. That resulted in
errors like

KACMAZLAR MEKANİK -> kacmazlar mekanä°k

Anita Calçados -> anita calã§ados

Felfria Resor för att Koh Lanta -> felfria resor fã¶r att koh lanta

The "name" field is OK; it's just the "normalized name" field
that is sometimes garbaged. Now that I know this, and have properly
captured the "name" field in UTF-8 where appropriate, I can
regenerate the "normalized name" field. MySQL/MariaDB know how
to lower-case UTF-8 properly.

Clean data at last. Thanks.

The database, by the way, is a historical snapshot of startup
funding, from Crunchbase.

John Nagle

Gregory Ewing

unread,
Aug 19, 2017, 11:18:07 AM8/19/17
to
Ian Kelly wrote:
> One possibility is that it's the same two bytes. That would make it
> 0xE2 0x80 0x9D which is "right double quotation mark". Since it keeps
> appearing after ending double quotes that seems plausible, although
> one has to wonder why it appears *in addition to* the ASCII double
> quotes.

Maybe something tried to replace right double quote marks
with ascii double quotes, but got it wrong by only replacing
2 bytes instead of 3.

--
Greg

Gregory Ewing

unread,
Aug 22, 2017, 3:15:43 AM8/22/17
to
Chris Angelico wrote:
> a naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it
> would also convert 0x21 ("!") into 0x01 (SOH, a control character). So
> this one's still a mystery.

It's unlikely that even a naive ascii upper/lower casing algorithm
would be *that* naive; it would have to check that the character
appeared to be a letter before changing it.

You might expect bytes >= 0x80 to be classed as non-letters by
that test, but what if it ignores the top bit or assumes it's
a parity bit to be left alone? What do you get under those
assumptions?

--
Greg

Chris Angelico

unread,
Aug 22, 2017, 4:52:23 AM8/22/17
to
Exactly, I do assume that it's checking for it to be a letter. But
everything previously has been on the assumption that it ignores the
top bit. That's how we got this far.

ChrisA
0 new messages