Hello All,
I am not new to OSM, but it’s my first post on that forum.
Some time ago an idea came to my mind which I would like to discuss here.
It’s a matter of fact, that “mass data implies power”, and I think that the OSM project could benefit from that fact. Wikipedia is based on mass data, and it’s powerful. Google aquires mass data every minute, and the power of goolge is incredible. I think, that OSM basically has the same opportunities to benefit from the power of mass data.
With more and more GPS hardware becoming available for comparably low prices, the capabilities for acquisition of data for OSM increase rapidly, but unfortunately it’s all raw data which has to be cultivated before it can be used as an accurate OSM database. But the sheer mass of data not only poses a mass of manual work: its a strenght.
I dare to say, that the statistical information stored in all that raw data offers such a lot of analysis capabilities, that it should be possible to do half the work of cultivation of that data without manual editing - just by putting it all together.
To sketch my idea, lets have a look at some roads in and around my village.
- The road where my house is, is a dead end street.
- The road which leads to that dead end street goes through a housing area.
- There are three cross-town highways (Bundesstrassen, I am german) in my village.
Whenever I am recording GPS data whilst driving on these roads, I increase the amount of raw-data for these roads, and I get a lot of information, and its for free.
a) in my dead end street, i have always a speed of 10-40 Km/H.
b) I never leave that street on the dead end.
c) all trackpoints in that street are somewhat different for each track, but statistically they increase precision. The more such points I collect, the lower will the average error be.
d) when I drive along the cross-town highways, my speed is everything from 0 Km/h to 60 Km/h (illegally ), but as an average (collected from a number of tracks/probes), the speed is simply significantly higher then the speed in my home-street or in the housing area.
If we would find a way to put all that data together statistically (and automtically), we could find a lot of facts regarding the above roads.
- every road which is used in only one direction most of the time ( > 98% or so), is presumably a one way street.
- every road which is never left at one of its ends is most likely a dead end street
- every road which is passed with an average speed of not more than 20-30 Km/h (or a similar speed, its just an example) might be a street in a housing area
- roads passed with an average speed 30-40 Km/h are most likely bigger ones than the ones in the housing area
… and so on.
The more data of a particular road we have, the more precise we could calculate the type of road. But the attributes which can be obtained with statistical procedures are much more then only “Is it a road? Which kind of road? Is it a one way street?” …
Let me outline some suprising capabilties of statistical mass data analysis.
Consider a number of 1000 tracks being recorded on a cross-town highway. Let’s say that the tracks were recorded by some pedestrians, some bicycles and quite a number of cars. The tracks would show some different ranges of speed within the distance in question. Some would be between 0 Km/h and 5 Km/h, some up to 30 Km/h and some up to 60 Km/h or more. It’s very likely, that the 60 Km/h were no bicylces or pedestrians, and it’s also very likely that the tracks with up to 30 Km/h were no pedestrians. With some heuristic rules, it might be possible with some precision (i.e. with some average error as well) to identify pedestrian-, car- and bicycle-tracks. The more data we have for such an analysis, the better the result. Any track which contains data which is very far from the average “profiles” (pedestrian/bycycle/car) could be excluded from the calclulation thus gaining precision in our analysis, e.g. by means of “invalidation triggers”.
No let’s filter the tracks which were identified as “car” tracks, and apply some rules. In these rules, I use the term “trigger”, which means that a “configurable amount of probes” fullfills the particular criteria (which is supposed to be a configurable value as well):
- if the percentage of tracks which follow the same direction “fire” a trigger, the street in questions is most likely a one way street
- if the percentage of tracks on the road do not show any stop, there are possibly not very much parking lots in the street or parking is forbidden.
- if we encounter a significant percentage of stops (0 Km/h) at a certain position (with no adjacent junction node), there might be a zebra-crossing or a pedestrian light.
- if the average speed is above 70 Km/h (or “triggers” a specific limit), the road is most likely not a residential street
Or:
- a number of tracks was recorded for a particular route. All of the tracks show a speed below 5-9 Km/h. It is a save bet, that this route is no highway, possibly its even unusable for cars.
Just some offhanded ideas, i am quite sure there are much more things possible.
But is all that really sufficient for accurate maps?
No.
But: it could be a step to push all OSM raw data to a higher level of quality. It could help to develop a procedure to check all manual work for plausibility.
If it is really possible to make all raw data more then just raw data by statistical methods, i could imagine the following scenario:
- raw data (tracks) is loaded up to OSM continuously
- a nightly “build” procedure extracts all the information from various formats (gpx etc.) and loads everthing to a database
- the statistical analysis is run and extracts all the information which can be calculated with a certain degree of probability and writes the results to an “alpha” stage database
- “alpha” data shall then serve as the basis for JOSM and other map editors. With that data available, such editors could offer some kind of a “commit” facility, i.e.: the “alpha” stage data is interpreted as kind of an alpha-map, already in the format of the final map but still having “alpha” status. All “alpha” roads and other map content could then be made “beta” data by just clicking a “confirm” or “commit” button (and/or applying the actions already supported by the editor in question).
- “beta” or “unstable” data could then be the “pre-production” version of the OSM, and be accepted as “stable” or “final” after additional attributes (street-names etc) have been added or the data was reviewed.
Its just an idea to get the most out of all raw data automatically with the aid of statistical methods. There are of course a lot of improvements possible, e.g. the comparison of raw-data against existing beta and/or final data (for plausibility checks), or whether and how younger data is weighed stronger then old data (because roads and traffic rules change permanently), but thats all details. The principle is statistics, a kind of a “profiling” for raw data with the intent to simplify further processing.
Thats the idea. Whats your opinion?
Discussions are welcome!
emax.