TYPO3: Sanitize a database that uses Latin1 character encodings in UTF-8 database fields

In TYPO3 systems before v4.5 you could get all sorts of messy database character encodings by setting wrong or incomplete parameters in your typo3conf/localconf.php

The article UTF-8 support in TYPO3Wiki explains all the backgrounds of using UTF-8 in TYPO3 and shows many ways to transition the database between Latin1 and UTF-8.

Unfortunately the article does not account for the possibility that the database already uses UTF-8 tables and columns (= fields), but still holds any extended characters as multi-byte characters as if the columns would still use Latin1 character sets. In this case phpMyAdmin shows any extended characters as weired double-character combinations, although the database, tables and fields are set to UTF-8. This could happen, if your localconf.php in TYPO3 <4.5 already contained…

$TYPO3_CONF_VARS['BE']['forceCharset'] = 'utf-8';

…but did NOT contain…

$TYPO3_CONF_VARS['SYS']['setDBinit'] = 'SET NAMES utf8;';

Solution

You can use “Possiblity 1″ from the above article, but with some extra steps at the beginning:

  • dump your database (which you should do anyway, see instructions in the Wiki article)
  • use either an editor or the Linux sed command (described in the Wiki article) to edit the dump and
    …exchange all occurrences of
    utf8_general_ci   to   latin1_german1_ci
    …and exchange all occurrences of
    utf8_unicode_ci   to   latin1_german1_ci
    …and exchange all occurrences of
    utf8
      to   latin1
    (in this order please).
    Actually you should only change occurrences that appear in any CHARSET and COLLATE parameters of MySQL table definitions. If the strings can appear anywhere else (e.g. in the data itself) then you will have to check each search hit and replace only those that follow the keywords CHARSET or COLLATE!
  • use your editor to convert the file format from UTF-8 to ASCII and save the file. Don’t try to safe it in UTF-8 (which your original dump probably is) or you will end up with even more weird characters in your database after re-import.
  • read the edited ASCII dump back into the database
  • now use db_utf8_fix.php as described in the article above.

Background

The script db_utf8_fix.php does only change database fields with charset latin1 to charset utf8. If the fields already are set to utf8, nothing happens at all. So we simply change the field settings back to Latin1 encoding and reimport that database. Now the PHP script will work nicely. The conversion of the database content is actually done by MySQL when the script tells the database to convert the table fields to UTF-8.

A lot of related background information about the algorithm used in db_utf8_fix.php can be found in this great posting by Nic Jansma. Nic also provides a PHP conversion script of his own which basically does the same as db_utf8_fix.php and is not limited to TYPO3. It actually looks more mature to me (although I did not test it).

[UPDATE 2011-10-25]

Correction 2011-11-11: Jigal van Hemert has eliminated the following bug. Thanks Jigal!
I found that under certain conditions db_utf8_fix.php outputs the following errors on MySQL 5.x:

Could not execute query!: BLOB/TEXT column '...' can't have a default value

This happens if the Server SQL Mode (parameter sql-mode in my.ini) is set to STRICT_TRANS_TABLES (which is the part of the default setting). The reason is a bug in the PHP code that tries to add “DEFAULT” values to blob and text fields in the DB. This is not allowed in MySQL. To eliminate that bug you have to change the code in two places:

[ ... ]

I have informed the author about the problem and hope he will change the original version.

Another problem I ran into where errors of the form:

static_countries: ..Could not execute query!: Incorrect string value: 'xC3' for column 'cn_official_name_local' at row 119

In this case the DB contained data that could not meaningfully be translated between character sets. Where this data came from is beyond my grasp. In my case the solution was simple: All affected tables where part of the extension “Static Info Tables” so I could simply drop them and have the extension recreate them later. If this happens to you you will have to carefully examine which tables and which table rows are affected. The error messages contain all the information you need to find them. Hopefully you can manually repair the affected rows in your SQL dump.

Dieser Beitrag wurde unter Configuration, MySQL, PHP, TYPO3 veröffentlicht. Setze ein Lesezeichen auf den Permalink.

Eine Antwort auf TYPO3: Sanitize a database that uses Latin1 character encodings in UTF-8 database fields

  1. Pingback: Umstellung auf UTF8 ( nach Update auf Typo3 4.7

Hinterlasse eine Antwort

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind markiert *

Du kannst folgende HTML-Tags benutzen: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>