unicode problems with discogs data

10 views
Skip to first unread message

Kurt J

unread,
Feb 23, 2010, 10:48:32 AM2/23/10
to datain...@googlegroups.com
Hello,

I've been back and forth with Leigh about this offline, but thought
i'd post to the list incase anyone knew a quick fix.

The discogs endpoint seems to mangle unicode names. This not only
creates bad looking foaf:names but also mo:discogs urls that are wrong
and 404 as well as several other problems (see
http://www.discogs.com/artist/Piotr+Illitch+Tcha%C3%83%C2%AFkovsky).

The ruby scripts that do the transforms are efficient and well coded
but they fail to address ruby 1.8's failure to address unicode.
Strings are a collection of bytes not characters so names like
Tchaïkovsky come out Tchaïkovsky.

I had a go at fixing this, but i'm a total NB in ruby. I tried
setting $KCODE and such but to no avail. Using ruby 1.9 would
allegedly fix this but ubuntu doesn't have libxml bindings for 1.9 so
i didn't bother trying it.

The code in question is here

http://code.google.com/p/dataincubator/source/browse/#svn/trunk/discogs/scripts

specifically, lib/Utils.rb seems to do the escaping and such.

any ideas?

-Kurt J

Ian Davis

unread,
Feb 23, 2010, 3:35:02 PM2/23/10
to datain...@googlegroups.com
On Tue, Feb 23, 2010 at 3:48 PM, Kurt J <kur...@gmail.com> wrote:
> The discogs endpoint seems to mangle unicode names.  This not only
> creates bad looking foaf:names but also mo:discogs urls that are wrong
> and 404 as well as several other problems (see
> http://www.discogs.com/artist/Piotr+Illitch+Tcha%C3%83%C2%AFkovsky).
>
> The ruby scripts that do the transforms are efficient and well coded
> but they fail to address ruby 1.8's failure to address unicode.
> Strings are a collection of bytes not characters so names like
> Tchaïkovsky come out Tchaïkovsky.
>
> I had a go at fixing this, but i'm a total NB in ruby.  I tried
> setting $KCODE and such but to no avail.  Using ruby 1.9 would
> allegedly fix this but ubuntu doesn't have libxml bindings for 1.9 so
> i didn't bother trying it.

Sounds like an annoying problem. I'm about equal to you in Ruby skills
though :)

Anyone else got any ideas?

Ian

Chris Clarke

unread,
Feb 23, 2010, 3:51:03 PM2/23/10
to datain...@googlegroups.com

I used $KCODE = 'u' when processing some CSV with Unicode for periodicals@dataincubator, but I guess you've already tried that... I'm on ruby 1.8.6

Chris

--
You received this message because you are subscribed to the Google Groups "Data Incubator" group.
To post to this group, send email to datain...@googlegroups.com.
To unsubscribe from this group, send email to dataincubato...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/dataincubator?hl=en.


 

Please consider the environment before printing this email.


Find out more about Talis at www.talis.com

shared innovationTM


Any views or personal opinions expressed within this email may not be those of Talis Information Ltd or its employees. The content of this email message and any files that may be attached are confidential, and for the usage of the intended recipient only. If you are not the intended recipient, then please return this message to the sender and delete it. Any use of this e-mail by an unauthorised recipient is prohibited.


Talis Information Ltd is a member of the Talis Group of companies and is registered in England No 3638278 with its registered office at Knights Court, Solihull Parkway, Birmingham Business Park, B37 7YB.

 
 

Leigh Dodds

unread,
Mar 2, 2010, 6:53:19 AM3/2/10
to dataincubator
Hi Kurt,

Apologies for taking a few days to respond.

I spent some time looking at the issue this morning. It looks like
there are several issues:

- problems with mangled unicode
- problems with URLs

The second is a consequence of the first, but is likely to have a separate fix.

I've been looking at the conversion scripts to see where the mangling
is happening. However I'm not convinced that Ruby is actually the
cause of the problem. I took some of the XML markup from the discogs
exports and ran it through some XML command-line tools (xmllint) to
convert to different encodings, and also investigated how Ruby handles
the characters. The issues still seem to occur outside of a Ruby
environment.

Looking more closely at the discogs data, and at Tchaikovksy
specifically, I see that many of the name variants are marked up as:

Tcha&#195;&#175;kovsky

This says there are two characters, with unicode decimal codepoint of
195 (LATIN CAPITAL LETTER A WITH TILDE) and 175 (MACRON). This is what
the conversion is producing. But its clearly not correct. The hex
characters for these are C3 and AF. However, what I think its supposed
to be is LATIN SMALL LETTER I WITH DIAERESIS, which is unicode 239
(EF). If you encode that in UTF-8 then you get 0xC3 0xAF.

So it looks to me like there's a bug in the discogs export that is
taking 2 byte unicode chars and encoding them in the XML export as
separate unicode characters, rather than a single character. This
mangling is then preserved all the way through the process :)

Seem like a reasonable analysis to everyone else?

I note that the discogs online API does deliver characters in UTF-8
encoding correctly. I think the best way to get this fixed is by
reporting a problem to discogs themselves.

This may identify further problems in the conversion, but should move
us forward a little.

Cheers,

L.

> --
> You received this message because you are subscribed to the Google Groups "Data Incubator" group.
> To post to this group, send email to datain...@googlegroups.com.
> To unsubscribe from this group, send email to dataincubato...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/dataincubator?hl=en.
>
>

> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> ______________________________________________________________________
>

--
Leigh Dodds
Programme Manager, Talis Platform
Talis
leigh...@talis.com
http://www.talis.com

Kurt J

unread,
Mar 2, 2010, 1:38:19 PM3/2/10
to Data Incubator
Hi Leigh,

Looking more closely, I think you're right. Just like me to bang my
head against ruby for hours and assume the underlying data is fine ;-)

Have you contacted discogs? I went ahead and made the following bug
report to discogs:

http://www.discogs.com/disbugs/1439

Thanks,
Kurt J

> >http://code.google.com/p/dataincubator/source/browse/#svn/trunk/disco...


>
> > specifically, lib/Utils.rb seems to do the escaping and such.
>
> > any ideas?
>
> > -Kurt J
>
> > --
> > You received this message because you are subscribed to the Google Groups "Data Incubator" group.
> > To post to this group, send email to datain...@googlegroups.com.
> > To unsubscribe from this group, send email to dataincubato...@googlegroups.com.

> > For more options, visit this group athttp://groups.google.com/group/dataincubator?hl=en.


>
> > ______________________________________________________________________
> > This email has been scanned by the MessageLabs Email Security System.

> > For more information please visithttp://www.messagelabs.com/email


> > ______________________________________________________________________
>
> --
> Leigh Dodds
> Programme Manager, Talis Platform
> Talis

> leigh.do...@talis.comhttp://www.talis.com

Reply all
Reply to author
Forward
0 new messages