Zeichensatz im Forum

Hallo Moderator, bei den Umlauten ist irgendwas im Forum “verbogen”… (nervt beim Lesen) Gruss, Markus

Ich mutmaße jetzt einfach mal, dass die Kodierung auf UTF-8 umgestellt wurde (von zuvor iso-8859-1) und deshalb nun alle alten Beiträge (in iso-8859-1) nicht korrekt dargestellt werden (da die Umlaute nicht entsprechend maskiert sind). Neue Beiträge scheinen davon jedenfalls nicht betroffen, da die entsprechend der Seitenkodierung in UTF-8 geschrieben werden.

Yes, the old posts are not shown correctly anymore. New posts should be handled much better now that everything is in UTF-8. Any ideas how I can fix the old posts? If there is a simple fix, like a query on the database then I’m willing to try correcting it.

What kind of database is it? MySQL?

Yep Mysql 4.1. But I already did some research and it looks like it needs some scripting (like in PHP) to do an export, conversion and import of all posts and topics … that’s a lot of work :expressionless:

Yes, that’s what I had expected (the effort). But the most critical thing COULD be the time stamp. It depends on how it’s implemented. In case the “last change” date which appears in the Forum and which is the base for the sorting is directly taken from by the “technical” change date od the record … you would have an issue :wink: If not … yes, then it’s possible to do such scripting.

The timestamp is an integer (UNIX timestamp) field in the posts table named ‘posted’, so updating the charset won’t break post order etc.

--  -- Table structure for table `osm_posts` --   CREATE TABLE `osm_posts` (   `id` int(10) unsigned NOT NULL auto_increment,   `poster` varchar(200) NOT NULL default '',   `poster_id` int(10) unsigned NOT NULL default '1',   `poster_ip` varchar(15) default NULL,   `poster_email` varchar(50) default NULL,   `message` text,   `hide_smilies` tinyint(1) NOT NULL default '0',   `posted` int(10) unsigned NOT NULL default '0',   `edited` int(10) unsigned default NULL,   `edited_by` varchar(200) default NULL,   `topic_id` int(10) unsigned NOT NULL default '0',   PRIMARY KEY  (`id`),   KEY `osm_posts_topic_id_idx` (`topic_id`),   KEY `osm_posts_multi_idx` (`poster_id`,`topic_id`) ) TYPE=MyISAM AUTO_INCREMENT=16182 ;
UPDATE osm_posts SET poster = replace(poster, 'Ã?¤', 'ä'); UPDATE osm_posts SET message = replace(message, 'Ã?¤', 'ä'); UPDATE osm_posts SET edited_by = replace(edited_by, 'Ã?¤', 'ä');

In principle it should work with such an approach. You would need to run it x times each on the listed 3 tables where “x” is the number of characters to be replaced (Ã?¤Ã?¶Ã?¼Ã?â??Ã?â??Ã?Å?Ã?Ÿââ??¬…). In addition to that you need to update the user table where the user names and locations are stored. I have taken the HTML coding as replacement above. If possible I would take the 2-Byte UTF8 coding instead. Kristian

I think that won’t work, cause the data is stored as ISO-8859-1. So you have to take the ISO-8859-1 character code of the umlauts as the key, or set the DB back into ISO-8859-1 mode before update.

Well, that might fix the Roman characters, but it’s not a solution for Cyrillic languages, like Russian

You could also do a table dump, load it into a text editor which automatically converts it to UTF8 and load it back to teh database.

Maybe this works: UPDATE osm_posts SET message = convert(message USING utf8); NOT tested. http://dev.mysql.com/doc/refman/4.1/en/charset-convert.html

Thanks for the tip, I tried that but got error 1064 on USING utf8. Further investigation learned that utf8 is not available as a character set in this instance of MySQL. Btw, the MySQL version used is 4.0.23a, not 4.1 as I said earlier. I’ll try to get my hosting provider give me a second database (MySQL 5 this time) so that I can migrate the whole forum.

Right, the hosting provider has processed my ticket requesting a MySQL 5 database, and I’ve already confirmed UTF8 actually working on that forum. So the next thing is to copy the forum, confirm that everything is working and make the forum migration permanent. This will require the forum to become read-only for a while though.

The forum migration is complete. Most messages could successfully be migrated, however a few still contain weird characters. I hope the migration wasn’t too inconvenient.

Well done, thank you.

But now in all old messages the linebreaks are lost…

I did not notice it before, but you are right. Dunno why they are lost, but there’s little I can do about it now. If you have any ideas how to fix that, the old forum is still available at http://test.na1400.info

The line breaks are not the only issue. It seems that some of the characters have changed the other way round also :wink: See my posting #8 above (in brackets and in the code). You should have stated the creation date in the WHERE clause in order to skip all items which had been created after the change :wink:

Regarding the breaks: The “
” tags are missing now in the HTML output. I don’t know whether they were produced by the PHP script based on certain characters in the DB field values or whether the DB values contained them already.

In the first case it would be interesting to know those control characters. They would have been replaced by something else by the convert function. In the second case the original tag would have been replaced - not likely.

Kristian

Unfortunately it’s not that simple, see this post.

This is the main problem:

I cannot convert them all correctly without knowing which topics, posts, usernames, realnames, signatures etc are using which encoding. There was no method detecting them perfectly, so I took the method that fixed the most posts and applied it. Unfortunately it removed some of the linebreaks (especially in the German forum) and some previously good posts became worse. That is regrettable but, looking at it from a distance, the conversion fixed more then it broke afaik. So as far as I’m concerned I would like to keep things as they are now and go on from here. This migration has cost me more then enough time already.