Change minutely hstore to daily (or hourly)

Hi,

I managed to set up an OSM-Sever, updating itself per minute-replicate by following this tutorial: http://wiki.openstreetmap.org/wiki/DE:HowTo_minutely_hstore

Now I would like to change the updates from minutely to daily (or hourly), because the load of using minutely diffs is a bit heavy on my server. I would prefer to automatically update the database in the evening with a daily diff, so that the load keeps small during the day. Could someone give me a hint, how to change the “load-next” script, currently used for minutely updates to load and import daily updates. If this is not possible, would it be possible to change it to the hour-replicate?

configuration.txt



# The URL of the directory containing change files.
baseUrl=http://planet.openstreetmap.org/minute-replicate/

# Defines the maximum time interval in seconds to download in a single invocation.
# Setting to 0 disables this feature.

# 6 hours
maxInterval = 21600


load-next script to initiate the update



#!/bin/bash
# loads the diffs for the interval read from configuration.txt
# may be executed every 5 minutes or so

PIDFILE=`basename $0`.pid

OSMOSIS=/opt/osmosis/bin/osmosis
OSM2PGSQL=/usr/local/bin/osm2pgsql
STYLE=/usr/local/share/osm2pgsql/default.style

# java proxy settings for osmosis
# JAVACMD_OPTIONS="-Dhttp.proxyHost=ha-proxy.esi -Dhttp.proxyPort=8080"
# export JAVACMD_OPTIONS

OSMOSISLOG=logs/osmosis.log
PSQLLOG=logs/osm2pgsql.log
EXPIRYLOG=logs/expiry.log
RUNLOG=logs/run.log

HOST=/var/run/postgresql
DB=gis
PREFIX=planet
USER=gis

CURRENT=/tmp/osm-load-next.$$.osc
EXPIRE=0

m_info()
{
        echo "[`date +"%Y-%m-%d %H:%M:%S"`] $$ $1" >> "$RUNLOG"
}

m_error()
{
        echo "[`date +"%Y-%m-%d %H:%M:%S"`] $$ [error] $1" >> "$RUNLOG"

        m_info "resetting state"
        /bin/cp last.state.txt state.txt

        rm "$PIDFILE"
        exit 1
}

m_ok()
{
        echo "[`date +"%Y-%m-%d %H:%M:%S"`] $$ $1" >> "$RUNLOG"
}

getlock()
{
        if [ -s $PIDFILE ]; then
                if [ "$(ps -p `cat $PIDFILE` | wc -l)" -gt 1 ]; then
                        return 1 #false
                fi
        fi

        echo $$ >"$PIDFILE"
        return 0 #true
}

freelock()
{
        rm "$PIDFILE"
}

WDIR=`dirname $0`
pushd $WDIR >/dev/null
#m_info "Workingdir $WDIR"

if ! getlock; then
        m_info "pid `cat $PIDFILE` still running"
        exit 3
fi

if [ -e stop -o -e stop.txt ]; then
        m_info "stopped"
        exit 2
fi


m_ok "start import"
echo $$ >"$PIDFILE"

/bin/cp state.txt last.state.txt
m_ok "downloading diff"
if ! $OSMOSIS --read-replication-interval --simplify-change --write-xml-change "$CURRENT" 1>&2 2> "$OSMOSISLOG"; then
        m_error "osmosis error"
fi

NODES=`grep '<node' < "$CURRENT" |wc -l`
WAYS=`grep '<way' < "$CURRENT" |wc -l`
RELS=`grep '<rel' < "$CURRENT" |wc -l`

m_info "expecting Node("$((NODES / 1000))"k) Way("$((WAYS / 1000))"k) Relation("$((RELS / 1000))"k)"

m_ok "importing diff"
if ! $OSM2PGSQL --append --slim --cache 3000 --merc --prefix $PREFIX --style $STYLE --host $HOST --database $DB --username $USER --hstore --verbose "$CURRENT" 1>&2 2> "$PSQLLOG"; then
        m_error "osm2pgsql error"
fi

m_ok "import done"
freelock

if [ $EXPIRE -gt 0 ]; then
        m_ok "expiring tiles"
        if ! sudo -u tirex tile_expiry/do-expire.rb "$CURRENT" 2>&1 > "$EXPIRYLOG"; then
                m_error "tile_expiry error"
        fi
        m_ok "expire done"
fi

/bin/rm -f "$CURRENT"

if [ `wc -l < "$RUNLOG"` -gt 1000 ]; then
        m_ok "truncating runlog"
        tail -n500 "$RUNLOG" > "$RUNLOG".tmp
        mv "$RUNLOG".tmp "$RUNLOG"
fi

popd >/dev/null


I would be very thankful for any help or hint, that helps me solve this matter.

Best regards,
Maroder

I haven’t tried this, but I guess that if you push the configuration.txt maxInterval value up to 86400 (1 day) and run the import script at midnight instead of every 5 minutes, you should have your import running during the night instead of every 5 minutes

Thank you sly.

I managed to switch to the hour-replicate by changing the baseURL in “configuration.txt” and downloading a new hour-replicate “state.txt” reflecting the current state of the database, as the minute-replicate “state.txt” I was using didn’t work anymore. After using the new “state.txt” the hourly-imports seem to be running fine.

The bad thing is however, the hour-replicate is not really any faster than the minute-replicate and the server needs between 6-8 hours for12 hours worth of updates. This is even worse as mapnik/tirex gets hit by major slowdowns, while imports are running. I think choosing a virtualized server wasn’t such a good idea after all :wink:

Here the configuration of the server:

  • 4 x 1.5 GHz (up to 2.5 GHz dynamically, depending on the load of the physical server)
  • 8 GB Ram (up to 16 GB dynamically, depending on the load of the physical server)
  • 500 GB HD (Raid 10)
  • Postgresql 8.3-Settings:
    shared_buffers = 256MB
    temp_buffers = 128MB
    work_mem = 256MB
    max_fsm_pages = 4000000 # This seems weird, but posgresql wanted this amount (log)
    checkpoint_segments = 40
    effective_cache_size = 4096MB
    fsync = off

Is there anything I could do to get a better performance on the imports? I suppose the IO-performacne of the raid is causing the slowdowns, as the vacuum process after the import takes ages?

Thank you very much in advance and best regards,
Maroder

Here are my postgres 8.3 settings, and my minutely are importing much faster than yours :

$ grep MB postgresql.conf
shared_buffers = 100MB # min 128kB or max_connections*16kB
temp_buffers = 5MB # min 800kB
#work_mem = 1MB # min 64kB
maintenance_work_mem = 32MB # min 1MB
#max_stack_depth = 2MB # min 100kB
checkpoint_segments = 30 # in logfile segments, min 1, 16MB each
effective_cache_size = 100MB
#log_rotation_size = 10MB # Automatic rotation of logfiles will

Since yours looks good has well, I’ll suppose there might be some disk IO limitations with your setup.

Have you tried some benchmarks on your RAID 10 ? (sequential write/reads and random write/reads) ?

I’m not an expert in disk benchmarks, but using a simple hdparm -Tt /dev/drive gives an overall idea of the IO sequential speed.

Mine gives :
$ hdparm -tT /dev/md3

/dev/md3:
Timing cached reads: 8398 MB in 2.00 seconds = 4200.81 MB/sec
Timing buffered disk reads: 992 MB in 3.00 seconds = 330.52 MB/sec

I suppose you’re right with the IO limitation, but the problem is that I can’t really benchmark my disk as access to the device is not allowed (it’s a hosted solution).

sudo hdparm -tT /dev/vzfs
/dev/vzfs: Permission denied

Probably this has to do with the virtualized filesystem. Is there any other way to benchmark the IO performance, without accessing the device directly?

Virtualised file systems carry with them the penalty of contention. Other users might be running IO operations as well, and you neither have a way to know that, nor a way to avoid it. You’re also going through additional layers, both an extra virtualisation layer, as probably an iSCSI or FibreChannel layer.

So it’s not surprising your performance differs so much from others that run on bare metal.