Unscrambling Nominatim hierarchy in Southern California

At first I was going to open a Trac ticket “Nominatim incorrectly lists Rendondo Beach as being in Ventura County”

Then, after a little digging, I was going to write “Nominatim lists way too many cities as being in Ventura County that are really in Los Angeles and other neighboring counties”.

Then, after a little digging, I decide to first ask about some Nomintaim basics.

One of the old-timers prefaced his answer to one of my earlier questions by saying “The Nominatim details.php is more or less a dump of the internal data Nominatim uses, so much of it is probably not very understandable unless you know how Nominatim works internally.”

Fair enough, but in reality, whatever the details.php page shows ends up being reflected in the search results. So when the details page for Ventura County (as node tagged place=county) http://open.mapquestapi.com/nominatim/v1/details.php?place_id=1619288 shows it as a “parent of” dozens of cities, including Beverly Hills, Santa Monica and, um, Los Angeles, you can be assured that searching for those cities in OSM main page (or anywhere else that uses Nominatim explicitly or under the hood) will list them, incorrectly, as being in Ventura County.

I can see that tickets along the lines “Nominatim incorrectly lists place X as being in place Y” have been created, and resolved, in Trac, but I first want to understand what’s going on.

Let me first list what I already know:

  1. In addition to a node tagged name=Ventura, place=county (see link above), there is also a relation tagged name=Ventura County, boundary=administrative, admin_level=6, etc. http://open.mapquestapi.com/nominatim/v1/details.php?place_id=79488864 Unlike the node, this object is “parent of” the exact ten cities found in Ventura County.

  2. This is a fairly typical situation in Southern California (I have not checked the US or the world), where a county or a city gets both a node and a polygon (relation if necessary). I think the node is added for the sake of having a nice looking label on the rendered map, I can’t see much other use since the node’s “parent of” list is usually badly out of whack, while the polygon’s is usually spot on.

  3. The nodes for the three neighboring counties that I checked: Santa Barbara, Ventura, Los Angeles are placed far, far from the geographic center of each respective county and very near the southern border of each county. As a result, each one lists many neighboring county’s cities in the “parent of” list and few of its own. Why should it be this way? Who put them there? Were they placed to coincide with each county’s seat? Doesn’t look to be the case. Were they placed near the statistical center of county’s population? Were they just arbitrarily placed near county borders to help identify where one ends and another begins?

I thought I was beginning to understand the method to this madness until I checked the node for Orange County and found it to be more or less smack in the middle of the county and still listing almost all cities incorrectly under “parent of”. It’s missing the cities that are right near it, but includes far removed cities from Los Angeles County.

So I am left with guesses and questions, which are these:

  1. Am I right in assuming that a polygon lists as “parent of” those object that are wholly contained in it?

  2. What does “parent of” than mean for a node that’s tagged as place=*? Wild guess: it’s objects that are located within X miles from the node, X being different for every admin_level: higher X for lower admin_level. Why then does it include far away object and not include near objects almost, it seems, randomly?

  3. I am guessing that Nominatim developers and data consumers may tell me to ignore the inaccurate parent-child relations involving nodes and concentrate instead on the accurate relations involving polygons. Fine, but shouldn’t the effects of the “bogus” relations then be removed from the search results?

  4. If node’s parentage is not bogus by design, is the situation in Southern California the result of poorly placed county nodes? Is there a way to determine how they were placed, and should someone like me (who has the enthusiasm but not the background) move them to the geographic centers of respective polygons or should it be done by someone wiser?

Well, I guess that’s enough questions for one post. Thanks for reading!

The US county borders and county nodes were created directly from whatever source was available at the time - county node placement is also whatever was used at the time. As you have discovered, Nominatim makes use of these because that’s all that is available, but the result are not correct.

The correct solution is to convert the county boundary to a relation:

type=boundary
border_type=county
admin_level=6
optionally, copy nist:fips_code , nist:state_fips from county boundaries.

Add county border ways, with role= outer.
If the governmental center location is known, add a node with role = admin_centre
Remove the county node

While you’re in there working, you can also combine overlapping ways into a single way and merge any duplicate nodes. If you are working on a state or US border and know where the true boundary lies, you can combine them also. In my case, I had no idea where the state border was, so I had to leave them separate for now.

That has solved Nominatim results for me.

Mike, thanks. I am sure that removing the county node (even without the rest of the work you outlined) will resolve the Nominatim issues.

However, I believe there is a compelling enough reason for having county nodes (or city nodes, or suburb nodes) in addition to outlines. That reason is labeling on rendered maps. Having the node allows us to 1)move it to a meaningful location (someone called to it “perceived center” - as opposed to the mathematical centroid, which, given an irregular enough polygon, can end up in a very interesting spot; I had a 7-shaped suburb tagged landuse=resindential and its name rendered outside the polygon in Mapnik); 2)have the label read differently than the name of the polygon. For example, the name of the polygon (or relation as it would be in your schema) is “Ventura County”, but the name of the node is “Ventura”, which sort of makes sense, especially when you see it rendered in ALL CAPS.

Not to mention that without a place node, there is no rendered label for the county, period. Nor for the city, or a suburb (which is what I am currently working on). I tested all sort of tagging scenarios (out in Nevada desert to be safe) and the only way I could get the place labels rendered was with a place node.

I know we are not tagging for the redereres, but are we tagging for Nominatim? Your solutions solves the problem for Nomintaim and screws the renderers. Ideally, I am looking for a solution that fixes Nominatim and keeps rendered maps pretty. If I knew could tag a relation the way you described and add a node with name only (no place=*) as a role=label, and Mapnik, MapQuest and others would render that label - then I could proceed with your suggestion with clear conscience. But right now it’s not the case.

The bigger question (going back to Nominatim) is why a node has children at all. I think that could be the root of the problem.

The best solution would be to use the ‘label’ role on the county relation on another node. http://wiki.openstreetmap.org/wiki/Relation:boundary I don’t know if the renderers have implemented it yet, but if not, that would be a good candidate for a Mapnik trac ticket.

To me, the county relation with objects in the appropriate roles is the most accurate way to represent the situation in OSM data. It just happens to work with Nominatim, which was written to consume OSM data.

I haven’t worked with Nominatim enough to know, but I would guess that they wanted it to work with parts of the world that have minimal mapping - perhaps only having nodes for cities and places in the beginning.

I think you’re right on most points. My problem is that for any problem I tend to look to OSM data for solution. Not entirely unreasonable, since that’s the only part of the system that’s exposed and easily accessible to me. Even though Nominatim and, presumably, Mapnik are open source, in realistic terms, I am going to keep trying to tag and retag data until Nominatim and Mapnik show the right results (as well as CloudMade and skobbler, to name two more users of OSM data I have encountered), rather than modify the code. I am, of course, aware that tickets are being opened, and acted upon by Nominatim and Mapnik semi-gods, but as far as getting my own suggestion implemented, I tend to assume that it is either obvious and is already being worked on, or stupid, and will get rejected.

I do see mentions, in existing tickets, of Mapnik not supporting role=label rendering. Maybe if I write a compelling enough ticket, citing Nominatim mess that redundant place nodes create, someone will get on it.