Processing the planet pbf from https://planet.openstreetmap.org

I am working on importing the planet data on a machine with 6 cores, 32gb of ram, and 2tb of hd space.

The command I am using is:

osm2pgsql -d gis --create --slim --drop -C 25000 --flat-nodes /home/rendersa/nodes.cache -G --hstore --tag-transform-script ~/src/openstreetmap-carto/openstreetmap-carto.lua -S ~/src/openstreetmap-carto/openstreetmap-carto.style ~/planet-190422.osm.pbf

Without the -C 25000, it was only using ~4gb of ram…with the -C 25000, it is using ~27gb of ram.

It is chugging away and has completed processing the nodes as 776.6k/s and it currently working on the ways. The issue is that the way’s are processing at .16k/s…which seems really slow. It could take several weeks to finish if this rate holds constant.

What I am wondering is if this rate will hold constant or if the rate will increase as time passes? I have noticed a Way rate a couple of orders of magnitude faster when processing smaller datasets.

I had missed the --flat-nodes option before, without which ~1-2tb of ram is required to process the planet data, so, perhaps, there is some other flag that may allows the ways to be processed faster…?

What I am also seeing is that the cores are barely even being used when processing this data (using htop to watch the load).

A couple of things here:

  • Yes, processing usually speeds up quite a bit towards the end, but it shouldn’t be that slow on ways!
  • Are you sure you’ve setup the “postgresql.conf” properly, so PostgreSQL can both benefit from parallel processing when processing the ways and relations, and has enough RAM for all spawned workers? You may actually still be seeing inefficient processing and high IO by PostgreSQL itself, if it doesn’t get enough work memory to do the job.
  • You’ve set C25000. In my last - successful - attempt to import the whole of Europe (+/- 20GB PBF) on a 16GB RAM 4-Core i5, with PostgreSQL running in a Virtualbox instance with just 11,5 GB RAM assigned (but plenty - a few dozen GB - of Ubuntu swap space assigned on an SSD), I set a maximum cache size of just -C 1500, which didn’t really seem to slow down the processing. I guess on your 32GB RAM machine, setting it to something like -C 5000 should be fine, and at least make sure all other processes and PostgreSQL can use the rest of the RAM.

This post by Sarah Hoffman, one of the osm2pgsql maintainers, may be interesting for you, although I have not yet followed all of her advices there (specifically, setting “-C 10” did not seem to give any beneficial effect on processing speed the one time I tried, even seemed to slow it down if I remember it well, and I have also not yet followed “fsync off” and other settings in the same code block there, although these are likely beneficial):

https://github.com/openstreetmap/osm2pgsql/issues/883#issuecomment-443539333

By the way, the successful Europe extract import of an almost 20GB PBF I mentioned on the 11,5 GB RAM Virtualbox Ubuntu 18.04 instance, was with the following “postgresql.conf” settings:

  • shared_buffers = 4000MB
  • work_mem = 800MB
  • maintenance_work_mem = 2000MB
  • effective_io_concurrency = 200 (as this is on SSD!)
  • max_worker_processes / max_parallel_workers_per_gatherer / max_parallel_maintenace_workers / max_parallel_workers = 4

Of course, since you have almost triple the amount of RAM available at 32 GB, doubling or tripling the figures related to memory should be relatively safe, although you must factor in your larger number of cores (6 versus the 4 I had available) as well for the memory settings, so doubling the figures may be safer assuming you set the parallel workers related settings to 6.

And as said dozens of GB of Ubuntu swap space set in Ubuntu itself (although I must admit I haven’t seen the usage of the swap space rise above 6 GB or so during the import the few times I checked, for planet though, you should definitely factor in swap space on SSD in appropriate amounts on low RAM machines).

And for the command line as mentioned in the previous post “-C 1500” and --slim, --drop and --flat_nodes used, pretty much as in the first post here by EricG97477.

I tried the various postgres tuning parameters, but with no effect on speed. I will keep poking around to see what I might be able to do.

Did you actually restart the PostgreSQL service after making changes to the configuration file?:

sudo service postgresql restart

Unless you restart the PostgreSQL service, the changes in “postgresql.conf” will not be applied. That would definitely explain no changes seen.

The fact that you initially only saw some 4 GB RAM usage, was at least an indication you likely hadn’t yet made any changes to the “postgresql.conf”, as the default configuration for RAM usage upon installation of PostgreSQL are really low, and based on multiple websites I read woefully inadequate for running a spatial database.

Anyway, if none of this indicates a problem, then do leave it running a bit longer. The loading of ways definitely speeds up considerably during the way processing.

Yes, I did restart the PostgreSQL service.

I did leave it running over the weekend, but the speed has not increased…it is still running around 2 orders of magnitude slower then what it should be.

  • What 2TB drive are you using? If that is a traditional mechanical hard drive instead of SSD, it would surely explain the slow rate of processing at the way and relation processing stage, as that requires fast random access times, something a HDD is bad at. Your drive might then still be pounded and at 100% usage, while still not devering good speeds for building ways and relations.

If that is the case, by all means switch to SSD. HDD is no longer a really viable option considering the size of planet. At the very least, ensure the nodes file resides on an SSD.

  • And what is your current RAM usage during processing, after having tweaked and reloaded the “postgresql.conf” file, do you actually see a change now (it should definitely be more than the 4GB you mentioned).

  • And what does this return:

$ psql -U postgres -c ‘SHOW config_file’

is that the location of the configuration file you changed?

I do believe the limiting factor here is the physical disk access. It is not SSD. I am working on getting some server hardware with SSD storage, but it may be a bit. Once I have that, I believe things will run a lot faster.

The current ram usage is ~8.5gb.

Yes, I did confirm with SHOW config_file the file I needed to edit.

It definitely will.

Even my Samsung EVO 2TB SATA drive, which only has +/- 90 MB/s steady state performance once it runs for a longer time continuously at high IO (which is actually not that bad for a SATA drive), as is the case with this type of import, shows considerably better performance.

And you won’t reach that value of IO all the time during such an import, as it will be limited by other factors like processor as well depending on the stage of import process.