Frontend transliterator: translit. A battle against the ??????'s

greencaps · November 27, 2009, 12:50pm

Being focussed on different codepages (that was what I had read here) I decided to make different translation tables for different countries. So my collection contains now albania, belarus, bulgarije, cyprus-turkish, czech-republic, estonia, greece, hungary, kaliningrad, kazachstan, kosovo, kyrgyzstan, latvia, lithuania, macedonia, moldavie, mongolia, poland, romania, russia, serbia, slowakia, tajikstan, transnistria, turkey, turkmenistan, ukraina, uzbekistan. Work on thailand and china is in progress.

Now the problem would be to take the right transliteration table at the right moment: that is depending on a given lat,lon find out in which country is it and then take its table.

It happened that I just had programmed an algoritm to find out if a given lat,lon would lie in an area if you had for example a .gpx file with a track along the border of that area. So there were no big problems implementing all. Now it had to be done. I needed .gpx files for a lot of countries. As I did not know where to get them (I know now that geofabriek has them for a lot of countries. But I know now also that they are not always precise) the solution was to make a boundary2track program first that given the id for a boundary relation would download all data for that relation making a gpx file out of the border/boundary.

Gpx files are in this way made for above mentionend areas/countries. The transliteration tables were made in the meantime. Character for character by hand. For every country two files. For instance:
russia.frontier.gpx
russia.tansliterationtable.txt

All such files are placed in an Areas directoy. At startup the transliteration manager module of translit looks in that directory and creates areatransliterator instances for every pair of files.

Now that translit reads the osm data and sees a (at the moment only nodes are transliterated. Not ) it extracts the lat,lon values and asks the transliterationmanager in which areas it is. If it is not in any area translit is ready with that node and will output it unchanged. (It will also not inspect nodes which consist of only one line). Otherwise it will then look if there is a place tag and a name tag and not already an int_name or name:en tag. If a transliteration is needed it invokes the right areatransliterator. Depending on the result a tag will be added and the changed node written to the output.

My fear was first that adding more countrys (by means of adding their respective files to the Areas directory) would influence the processing time. But if it does it’s very minor. Four or twentyfive countries: it does not matter.

greencaps · November 29, 2009, 11:08am

Now that 's would be transliterated the next step were the 's. I copied the code for my nodehandler changed "<node"to "<way"and "place"to "highway"and let it run. Well that did not work out. I had forgotton that ways do not contain a lat,lon.

The <nd ref=“Id” refer to nodes. Did I have to inspect these nodes? Now translit is offered xml osm data. And that contains first all the nodes and then the ways. So upon inspecting a way the info for the related nodes wis already passed.

At this time I had already my doubts if my approch of different transliterationtables for different countries was the way to go. What I had seen meanwhile while making transliterationtables for russia, romania, greece and even thailand and china that a UTF-8 character used for the cyrillic alfabet would not be used for greek or romanian or for any other.

I did not know much about charactersets but in the old sets where every character is represented by -the value of -one byte (8 bits) you need a characterset as there are only 256 values possible with a byte. So the value 198 is in cyrillic a different character as in ours.

But UTF-8 takes one to six bytes to represent a character. Our characters can be represented by one byte. I found that for cyrillic characters always two bytes are taken (found only one exception were three were needed). Greek takes two bytes too. Thai takes three and the kind of chinese (What kind is that? Could someone tell me the name?) that is used in osm takes three too.

So UTF-8 is a characterset in itself. If it is UTF-8 you are ready. (Do not laugh if you already knew: I had to find out the hard way. http://www.ietf.org/rfc/rfc2279.txt is my friend.).

Only the minor problem that the Garmin does not know UTF-8 forces to do something. And the way is not to make for instance one byte cyrillic character of two utf8 characters because the Garmin will not handle that too. The way is also not to do it in two steps: make a one byte cyrillic character of the two and then replace that with a transliteration. No you can do away with all -old- codepages. Just make one transliterationtable straight from UTF-8 to garminusable characters.

This idea could nicely be applied to the transliteration of ways. First I combined all the transliterationtables I had at that moment to one world.transliterationtable.txt.

to be continued…

greencaps · November 29, 2009, 11:30am

At the moment when all tables are combined (except for the thai and chineese ones) a world transliterationtable is constructed with 339 entrys.

It was time to try it on the 's. To my joy all went like I thought it would. Program translit working on osm data that contained parts of russia, ukraina, lithuania, romania and greece transliterated all as if it had separate tables for every country.

For instance the above shown way in lithuania (http://api.openstreetmap.org/api/0.6/way/27950733) would leave it as:

When I saw this I realised that the algorithm used for the nodes to determine a transliteration table by means of lat,lon’s laying in country borders was superfluous.

greencaps · November 29, 2009, 11:49am

One size fits all

A nice demonstration of the potential of one transliteration table is this way on the border of Russia and China:

http://api.openstreetmap.org/api/0.6/way/39159352
http://www.openstreetmap.org/browse/way/39159352

If you click the links you will see that your browser has no difficulties displaying names which for the first half consist of cyrillic characters and for the second halfe of chinese. This is because it’s utf-8.

... ...

Above you see twice the same way. For the first the text is copy/pasted from a browser. For the second one from tekst in wordpad. (Copy/Pasting/Displaying utf-8 in different programs is a story on its own…).

Translit does not mind the combination of cyrillic and chinese and transliterates it all and adds the missing tag:

…

Edit:: well in this case adding a tag was not needed as there is already a name:en. But I found it too beautifull to not tell…

greencaps · November 29, 2009, 12:15pm

http://api.openstreetmap.org/api/0.6/way/10930885
This way from Greece treated by translit:

chris66 · December 1, 2009, 12:49pm

Hi Greencaps,

Is your program available for download ?

Chris

greencaps · December 1, 2009, 5:26pm

No. Not yet. As you can read the implementation changes and changes. It is in an early state of development.

It is now tested by Lambertus. The first results are not visible yet (I mean on http://garmin.na1400.info/routable.php ) . I have to spend more time on the transliterationtable(s). I first want to see that it runs at Lambertus like I want it to run.

After that we will see.

Are you interested in a special country/language?

greencaps · December 1, 2009, 9:25pm

Ukraina 63240172.img

Part of 63240172.img (Ukraina) displayed by GPSMapEdit. Hope that after the update of next weekend the questionmarks are gone.

Lambertus · December 2, 2009, 9:04am

It’s my fault that the transliterated names aren’t showing up yet. I simply forgot to add the ‘name:engels’ to the list of tags used for displaying the name. This is fixed now, but I’m running into a bug (nothing related to translit) that let’s Mkgmap crash on a lot of tiles. This has to be fixed before I’m running a new update (also, a new planet will be available tomorrow which I want to use for the next update).

I am sure the transliteration will fine in general because adding the Chinese name:zh_py worked fine as well.

chris66 · December 2, 2009, 10:24am

Europe.

So, is there some europe country missing ?

Chris

greencaps · December 2, 2009, 10:54am

Well you have seen the list. Only eastern Europe.

Even the ß will not be transliyterated yet.

@Lambertus: the forum clock is one hour offtime. I post at 11:50.

Edit: In my profile checking “Daylight savings is in effect (advance times by 1 hour).” did it.

liosha · December 2, 2009, 11:56am

I use this perl module for transliteration: http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm
Maybe it’s tables will be useful

chris66 · December 2, 2009, 1:16pm

But the german “ß” ist part of latin1 charset and don’t needs to be transcripted to “ss”.

greencaps · December 2, 2009, 1:58pm

It does because what counts is if a garmin device can display it.

I think a GPSmap 60Cx cannot. Well it is difficult to find out. ß in streetnames in osm are on Lambertus’ site ss. In City Navigator its only ss. That will have a reason I think.

If you have/know a small .img file with ß’s please give me a link. I’m eager to try it out.

greencaps · December 2, 2009, 2:02pm

Thank you. I see that all has been done before.

I will spit it through but at first glance it looks to be a transliteration from two byte unicode (See the Bei Jing example on that page). But osm comes with utf8 (1 to 6 bytes). You do a conversion first from utf8 to unicode-2 before using this function?

liosha · December 2, 2009, 2:14pm

It converts from perl’s internal unicode representation.
So the code is something like this:

use Encode;
use Text::Unidecode;
.....
$transliterated_string = unidecode( decode( 'utf8', $utf8_string ) );

chris66 · December 3, 2009, 8:24am

Speeking for my Legend HCX:

In general the device is able to diplay all(?) latin1 characters:

But: When compiling the map, mkgmap changes all street names to uppercase
(unless you use the --lower-case option). But there is no upper case
for the “ß”, so it is converted to “SS”.

The Garmin device convertes back to lower case in the tooltips and in other fields.

If the --lower-case option is used, the street names are displayed
as A… in the map (only first letter is shown).

Chris

greencaps · December 3, 2009, 9:14am

Chris you are talking about what mkgmap can/does. But I want to know something about the Garmin. I asked if you had/knew an .img file which contained a ß. Does not matter who put in in.

But your picture shows something very nice. Just above the Süntelstasse hint: ÄËäÜß.

Isn’t that a ß at the end? Did you type it in for a waypoint?

chris66 · December 3, 2009, 11:00am

Yes, that it a ß entered in a waypoint name.

here a gmapsupp.img generated with --lower-case, so you have a lot of …straße

http://www.megaupload.com/?d=J0LT749R

Chris

greencaps · December 3, 2009, 12:33pm

Thank you.

That is a very small map. I could hardly find the bbox on my device.

This is how it shows in the 60Cx.
ringel s

Well isn’t this a strange device?
-It can show a ß and it cannot.
-It can show lowercase and it can not.

What happened at Garmin to make this possible?

So I know now that translit should make an ss of it too.