OSM PBF data model for adress extractions

scampion · April 13, 2021, 6:46am

Dear all,
I am working on an OpenGeoCode[.0] a GeoCoding project similar to Nominatim but in memory and with fuzzy matching capabilities.

Via the following script opengeocode/extract.py[.1], I extract the addresses from the OSM PBF files to produce a CSV of the form


@lon; @lat; addr: postcode; addr: city; addr: street; addr: housenumber

If you have a good knowledge of the OSM data model, then I am looking for a tip to improve the way I parse this file.
As you will see, it is based on if then else on Nodes and attributes with a double pass … laborious and not clean at all from my point of view.
I think that the model must allow to do it in a more exhaustive and clean way so any advice are welcome.

Thanks in advance,
Sebastien

luisforte · April 13, 2021, 3:31pm

Quite interesting your work!

As far as I know, all the files generated from the OSM database follow this policy of listing all objects, starting with the nodes, followed by the ways and finally relations.
This happens with all data extractions from the main database, whether they are files containing the current situation of data or others that reflect the history of objects or changes in any timeframe.
I believe that if you find a data source where all objects can be extracted in one step (Overpass and others), it shall be data sources that have also been reading the files generated from the OSM database just like you are doing, so it does not seem to make much sense to look for an alternative source.
Anyway, you usually have to take several steps (or several if’s); nodes, ways and relations have different attributes and are conceptually different . Consider the geometry of those elements, one node has a single coordinate pair while one way (a building, wich may have an address, is an OSM way) has several nodes. Here you shall consider computing something like its centroid if you only want one geometry per element. Relations don’t have geometries, they are made up of nodes and/or ways or even other relations, its geometric reference is harder to determine.
By the way, relations, which do not seem to me to be considered in the script you refer to, can also contain addresses (namely in multipolygon type relations), so it should still be considered a third step to add.
Unfortunately, I don’t know about any tool that does this job in a cleaner and logical way.
I’m sorry if I just made things look worst.

GerdP · April 13, 2021, 3:52pm

Your code doesn’t seem to handle the boundary=postal_code tag yet? See https://wiki.openstreetmap.org/wiki/Tag:boundary%3Dpostal_code

GerdP · April 13, 2021, 4:00pm

Reg. order of data: typically you have to parse relations first to be able to collect the needded child ways and nodes, next you parse the ways to collect the needed nodes, and finally the nodes. Reg. performance: The pbf format doesn’t allow to jump to the first way or relation, but it is organized in blocks and and each block contains meta info about the contained data. When looking for relations you can just check the meta info to find out if the block contains any. If not, skip the block.
No idea if this is supported in the libs that you use.