utf-8 via latin-1 view

1,719 views
Skip to first unread message

jnh

unread,
Jul 8, 2010, 5:59:16 PM7/8/10
to Sequel Pro
Greetings,

I'm trying to convert a legacy database which I is set as UTF-8, but
somewhere in the original insert sequence some of the data was
converted to latin1 and is thus stored in as a garbled mess ( ie
ФОМУШКА is in the database as ФОМУШКР). I've found that in
sequel-pro setting Database -> View Using Encoding - > UTF-8 unicode
via Latin 1 makes the data render correctly in the Sequel-Pro view,
yet the web output AND andy other View Using Encoding still shows the
garbled version. I'm wondering what "UTF-8 unicode via Latin 1" does
exactly and if there is anyway to permanently set the data in the
database so that the data is always correctly encoded and displayed.

Thanks,
jnh

Rowan Beentje

unread,
Jul 8, 2010, 6:27:23 PM7/8/10
to seque...@googlegroups.com
Hi jnh,

A slightly tricky one!

Basically, the data which was originally stored was originally UTF8 data, and it may or may not be stored on the server in a UTF8 table. The key thing that's usually responsible for getting the data in this state is that the database connection itself was using Latin1 - often the default state of the connection.

It used to happen quite a lot in web apps a couple of years ago - the front end would be updated to handle UTF8 data, and that data was saved in the database - but over a mysql connection that never had a 'SET NAMES utf8' (or wrapped database method) set. Each UTF8 multibyte character was therefore converted to a pair of Latin1 characters during transport between the database and the web app, although each end treats the data as UTF8.

If the same connection was used to read the data type - no problem, as the whole process is entirely reversible! This is what the "UTF-8 unicode via Latin 1" mode is emulating.

To convert the data, the easiest thing is to use a script in your favoured web language to perform the conversion. The easiest, but slowest, way is to read data while the connection is in Latin1, and then save it back over a connection set to UTF8 (and make sure all your tables are set to UTF8!).

Hope that makes a little bit of sense - it's rather tricky to explain...
--Rowan

> --
> You received this message because you are subscribed to the Google Groups "Sequel Pro" group.
> To post to this group, send email to seque...@googlegroups.com.
> To unsubscribe from this group, send email to sequel-pro+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/sequel-pro?hl=en.
>

David

unread,
Jul 9, 2010, 2:21:12 AM7/9/10
to Sequel Pro
We used to have good experiences exporting a database via mysqldump
and using --default-character-set=utf8 on the mysql import command,
e.g.

$ mysqldump sourcedb -u username -p > dump.sql

$ mysql targetdb -u username -p --default-character-set=utf8 <
dump.sql

This is from memory, so don't use that on production data directly,
but test it beforehand.

Regards,
David

jnh

unread,
Jul 14, 2010, 5:08:30 PM7/14/10
to Sequel Pro
Thanks to all for the help! The suggestions confirmed my suspicions
and helped me get down to figuring out what was going on. It was a
nasty overall problem, as there were numerous encoding types in the
database not mention any lack of charset specification in the old code/
my.conf file. Uggh.

The end goal was to get properly formatted utf-8 data to use in
Django. What ultimately worked was the following:

- Search specific database fields for troublesome non-latin1 or funky/
mysterious characters (windows/cp1252 for the most part) and replace
them with their ascii equivalents:
python example to replace a single right quote ('): newstring =
field.replace("\x92", "\x27")

- Perform the following sql the table per column with to convert
malformed utf8 via latin1 chars to binary and then rencode them. Watch
out for retaining the field type (varchar, text, etc)

SET NAMES latin1;
ALTER TABLE table_name1 MODIFY COLUMN column_name1 TEXT CHARACTER SET
latin1;
ALTER TABLE table_name1 MODIFY COLUMN column_name1 blob;
ALTER TABLE table_name1 MODIFY COLUMN column_name1 TEXT CHARACTER SET
utf8;
etc for rest of tables/columns
SET NAMES utf8;

-I also updated the my.cnf field to ensure that future databases will
have utf8 as their default encoding (and restarted mysqld):

[mysql]
default-character-set = utf8

[mysqld]
init_connect='SET NAMES utf8'
default-character-set = utf8

[mysql.server]
default-character-set = utf8

[client]
default-character-set = utf8

Hope this helps someone save some hours troubleshooting.

-jnh

Ricky Muniz

unread,
Nov 4, 2013, 10:01:57 AM11/4/13
to seque...@googlegroups.com, nick.r...@gmail.com
I was having this issue with Node.js Sequelize MySQL module and using "SET NAMES latin1" did it for me. Thanks a lot!
Reply all
Reply to author
Forward
0 new messages