Automating charset encoding/collation conversion for MySQL db.

29 views

Skip to first unread message

Jens Troeger

unread,

Dec 3, 2019, 7:50:46 PM12/3/19

to sqlalchemy-alembic

Hello,

Using a MySQL database keeping Unicode strings under control turned out to be a bit of a challenge. I could have sworn that character encoding and collation are set up correctly, i.e.

+--------------------------+----------------------+
| @@character_set_database | @@collation_database |
+--------------------------+----------------------+
| utf8                     | utf8_general_ci      |
+--------------------------+----------------------+

Alas, that turned out not to be the case. So now I’m looking at a few databases where random tables and columns are back to latin1 character set. Trying to get to a solution that takes a db, checks & switches its charset/encoding then iterates over all tables, checks & switches their charset/encoding then iterates over the columns and (for appropriate types) checks & switches their charset/encoding.

I can do that by issuing SQL statements directly:

SELECT @@character_set_database, @@collation_database;

then if need be

ALTER DATABASE db_name CHARACTER SET "utf8" COLLATE "utf8_general_ci";

Similarly for tables (link) and table columns (link).

However, I wonder if there is a more elegant way with SQLAlchemy than issuing SQL directly?

I have also noticed that some string types use Unicode (which maps to varchar) and others use UnicodeText (which maps to mediumtext). I vaguely remember that there were size constraints, I think the maximum of 65k across all columns of a table whereas mediumtext doesn’t have that 65k size constraint? Disregarding that, converting from UnicodeText to Unicode should not impact the value, right? While cleaning up the character set encoding and collation, I thought that it may be good to get the string types right.

Much thanks!

Jens

Mike Bayer

unread,

Dec 4, 2019, 9:27:31 AM12/4/19

to sqlalchem...@googlegroups.com

On Tue, Dec 3, 2019, at 7:50 PM, Jens Troeger wrote:

Hello,

Using a MySQL database keeping Unicode strings under control turned out to be a bit of a challenge. I could have sworn that character encoding and collation are set up correctly, i.e.

+--------------------------+----------------------+
| @@character_set_database | @@collation_database |
+--------------------------+----------------------+
| utf8 | utf8_general_ci |
+--------------------------+----------------------+

Alas, that turned out not to be the case. So now I’m looking at a few databases where random tables and columns are back to latin1 character set. Trying to get to a solution that takes a db, checks & switches its charset/encoding then iterates over all tables, checks & switches their charset/encoding then iterates over the columns and (for appropriate types) checks & switches their charset/encoding.

I can do that by issuing SQL statements directly:

SELECT @@character_set_database, @@collation_database;

then if need be

ALTER DATABASE db_name CHARACTER SET "utf8" COLLATE "utf8_general_ci";

Similarly for tables (link) and table columns (link).

However, I wonder if there is a more elegant way with SQLAlchemy than issuing SQL directly?

you can get lists of all table names as well as the MySQL-specific options using the inspector:

https://docs.sqlalchemy.org/en/13/core/reflection.html#fine-grained-reflection-with-inspector

from there you still need to conn.execute("ALTER ...").

I have also noticed that some string types use Unicode (which maps to varchar) and others use UnicodeText (which maps to mediumtext). I vaguely remember that there were size constraints, I think the maximum of 65k across all columns of a table whereas mediumtext doesn’t have that 65k size constraint?

mysql documentation would help you there

Disregarding that, converting from UnicodeText to Unicode should not impact the value, right? While cleaning up the character set encoding and collation, I thought that it may be good to get the string types right.

this would be an ALTER TABLE and modern MySQL versions should be fairly robust as far as warning if text is being truncated ( I would create a restorable dump of all affected databases first), however it might be a slow running operation so I would expect the database may be unavailable to other applications while some of these operations run.

Much thanks!
Jens

--
You received this message because you are subscribed to the Google Groups "sqlalchemy-alembic" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy-alem...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy-alembic/7f8346fa-91bc-45ac-b4f1-0c6086fb0352%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages