From how I read that thread: it’s difficult to determine automatically which transliteration table you should use.
So ultimately I wish we could automatically transliterate from Japanese, Indian, Malayan, Greek, Russian, Chinese, etc to English so a readable map for the whole world would be possible… Dunno if this is possible at all, my searches all lead to transliteration from a specified language to English instead of language detection…
I found this page which seems to do a pretty solid job and is using Java, although I cannot find any source code. Perhaps a good starting point for Mkgmap? Or this python solution.
Google translate does a fair job. I entered “น้ำไหลลงเขา” and told it to detect the language and translate to English. It came up with “Water flows downhill.” Google detected Thai and translated correctly.
My program to add a missing name:en or name:trans:en or whatever tag works. As a test I let it handle less then ten nodes in a changeset. All worked ok.
Now to be sure that only places which are in a selected country are handled I need a borderline around the area.
Hmmm… that can be found in OSM data. So I wrote a module that collects the ways representing the border. Starting with a relation id recursively retrieving other relations, ways and nodes. Then build a .gpx file. All took quite a while on 60189 (Russian Federation). So I tested on smaller countries like 102879 Austria, 161033 Mongolia.
For Austria got a lot of ways. And they ‘lay on the frontier’. Ok so far. But the ways are in random order. The next way does not start where the former stopped. To make things worse the direction of the ways is not consistent. Most are from east to west (for the frontierline Germany/Austria). I need a closed curve of a country to determin if a “lat,lon” is inside that country. So work is now on sorting the ways and revert them if needed.
Until now I had no look at mkgmap because all this takes time. It is nice to see that others react. I will study all links later.
There is another solution if mkgmap cannot handle the transliteration. Mkgmap works on raw osm data I read. Those are the data files in xml format. My program (well another version) could add the missing tags to those xml files.
I posted a question about this on the Mkgmap mailinglist but did not get any response so far.
So it indeed seems like there are two options: upload the transliterated tags with name:trans:en or preprocess the data each time a new update is performed. But I’m afraid that preprocessing the entire planet file on each update will take a very long time.
Besides that, you are only transliterating cyrillic languages, there are so many more languages that need transliteration. Maybe this needs to be discussed on the main OpenStreetMap mailinglist…
I just noticed the date changed to 14/10/09 so I tried a download again. All the roads I added are still not showing. Maybe I added the tags after 14/10/09.
The easiest way to find out is to see if the changes are visible in Potlatch…if they are, then there’s a problem on Lambertus’ end. If they’re not, then the problem is at Skywoolf/reinholdM’s end.
Yes, the first time report of missing ways could be the result of bad timing, but those ways should be in the map definately by now. So if they aren’t then there’s a problem somewhere…
It may be that you got Steve wrong. I think he means that you can do a transliteration from e.g. Cyrillic to Latin without knowing the language, but to do a thorough job, you would transliterate a bit different depending on whether you got e.g. Russian or Bulgarian (both languages using Cyrillic script). The links you provided don’t seem to try to figure out the language, they would just detect if something is Cyrillic.
I think it would be a start to complete mkgmap’s transliteration tables with the perl script Avar provided and see if the resulting maps work for people (even if not perfect). I guess an awkward transliteration is still way way better that just seeing question marks…
Pretty much nothing that I’ve done since the beginning of September shows up in the map that I downloaded from Lambertus’ site today. It’s mostly in these areas:
Yes probably, I haven’t eaten much cheese about transliteration (that’s a Dutch proverb )
I hope the wheel is not being reinvented again with this script. My searches show that it requires a lot of knowledge to transliterate all non-roman languages, so we should definitely use the efforts of existing projects to do so.
So, still, it looks like going for preprocessing is the quickest path to transliteration. If this doesn’t double or triple the processing time then I’m open to add such a script into my toolchain. But let me be clear: this is not going to be a new project for me, I’m not going to develop such a transliteration script.
and so on. I had seen dat already some days ago wondering why there were blocks of five and today I added code to my program to show it on the map. Aha 1617 rectangles to see. But using them is not precise as the rectangles overlap with bordering countries. So I use a real frontierline (extracted from osm data). Result for Mongolia: 115 places that have alreay int_name or name:en. 47 places who miss them. Adding a transliteration would be a piece of cake now.
But I ‘discovered’ something else. Well I saw it. For those 115 that had already an int_name or name:en tag the name sometimes contained an international or english name too. 528068123 Gachuurt, 528064318 Terelj.
Now what is the purpose of the name tag in osm data/maps? What should be in the name tag? The name as used in the country? I think so. It is not difficult to detect this automatically and produce a list of node id’s for later treatment.
I think so too. And is not that file 160 GB? But wasn’t it split first? How big are the splits? And aren’t they compressed when offered to mkgmap? Could you give me an indication of those filesizes?
That is indeed amazing. I have no clue why they define a country’s polygon like that. It’s not very useful it seems.
I think the name tag should be the official local name.
The planet file is about 7.3 GB compressed which makes it about 80 GB (or so) uncompressed (but noone should use an uncompressed planet really). I normally split the planet in two sections using Osmosis, but that is because Splitter would need too much memory otherwise. These extracts are then split using Splitter and then rendered with Mkgmap.
What I can envision is that you application loops through the compressed planet once extracting all the nodes and ways and relations with names. Then determine in which country the name is so you know which source language you’d need to transliterate. Then transliterate the name, update the name value and output the changes in a new compressed planet file. That new planet file could then be used for Mkgmap.
BTW. I just saw an Mkgmap commit in which the transliteration code that already existed for the ASCII code-page is also made available for Latin1. I could do a new run after this weeks update to see what the results look like.
And what is what Splitter produces? A compressed file? And in what sizes?
I can handle uncompressed files up to 2GB. Or was the limit four GB? I have to check. To handle larger files I have to use 64 bit filepointers which I never did. But I see no problem implementing this as I saw code for it sometimes. But working on compressed files ? I have no idea how to do that. I did not even know that it was possible to work on a piece of a compressed file. I thought that a compressed file first had to be decompressed before use. If not that is fine but for me no clue where to start.
The whole point with mkgmap and converting from one code tabel to the other is that it places a questionmark ‘?’ if it cannot find a match. If it would just place one byte from the two bytes that it had to convert than it would do much better. Or just the byte if the character was one byte. Then you would not have seen me here as then I could transliterate afterwards the .img files downloaded from your side. Well atleast I think so.
So please find the piece of code where mkgmap places the ‘?’. (I did not deep in it as I did not finish with my tagupdater yet).
Today I extracted the borderline of White Russia. Then made a run for placenames. Found 22518 places which needed an int_name (or name:en). There were 36 places which had already a translation. 36!