Following the discussions last week about character sets and arthus'
install breaking, it's probably time to look back in time at all the
various states Habari MySQL databases might be in before we try and
write anything to fix it.
Now, on to the history:
The beginning:
The Habari tables and the database connection followed whatever the
default of the database was. We (naïvely) assumed everything that we
received was UTF-8. This meant to function correctly either the
character set must be UTF-8 or a SBCS (single byte character set;
i.e., every character is represented by a single byte; e.g., all
ISO-8859 character sets) in which UTF-8 could be stored as binary data.
r1377:
This changed to interacting the database by calling `SET NAMES utf8;`.
This broke all blogs that weren't already using UTF-8, or using only
the intersection between the character set in the database and UTF-8.
The database could then be in three states:
- UTF-8,
- Only characters used in the intersection between the database
character set and UTF-8 (normally ASCII only in an ASCII-superset such
as ISO-8859-1);
- Fresh installs are stored in whatever the default database character
set is (this could be something completely different like UCS-2 which
isn't even an ASCII-superset).
Regardless of what the content is stored as in the database, it is now
passed to PHP from MySQL as UTF-8.
r1530:
This converted all installs to UTF-8 tables, and in the process broke
everything that didn't already use UTF-8, or used only the
intersection between the character set in the database and UTF-8.
This brought us down to two states:
- UTF-8;
- Fresh installs are stored in whatever the default database character
set is (this could be something completely different like UCS-2 which
isn't even an ASCII-superset).
r2909:
This made new installs use UTF-8. This also tried to move all existing
installs to UTF-8, but failed (see arthus's breakage). This upgrade
script was the same as in r1530 (this was wrong as we're coming from a
different state).
This resulted in everything being UTF-8, and breaking anything that
was installed between r1530–r2908 where the default database character
set was not UTF-8 (or didn't use only the intersection between the
database character set and UTF-8).
r2927:
This replaced the upgrade script added in r2909. This should be the
upgrade script we want.
This brought us down to knowing the database is UTF-8.
r2932:
This reverted r2927. Both myself and Matthias thought the patch was
wrong as the linked IRC discussion shows. This brings us back to the
same undesirable state that r2909 left us in.
This brings us to the present.
Now, to get us out of this hole, the upgrade script in r2927 should be
re-added and the r2909 one removed. Myself and Matt were wrong because
we did not realize that the r1530 upgrade script would avoid UTF-8
stored in a SBCS ever reaching this upgrade script. If anyone thinks
this is wrong, please do say.
--
Geoffrey Sneddon
<http://gsnedders.com/>
> Now, to get us out of this hole, the upgrade script in r2927 should be
> re-added and the r2909 one removed. Myself and Matt were wrong because
> we did not realize that the r1530 upgrade script would avoid UTF-8
> stored in a SBCS ever reaching this upgrade script. If anyone thinks
> this is wrong, please do say.
Thanks for the extensive and comprehensive analysis! I agree with your
conclusions and now that we know what should be done, ... let's do it.
-Matt