MapDB is GREAT! Planet.OSM in 3.5 hours

850 views
Skip to first unread message

MJ CZ

unread,
Jan 30, 2015, 2:58:36 AM1/30/15
to ma...@googlegroups.com
Hi all,

  MapDB is great product. Today I finished importing file planet.osm.pbf from OpenStreetMap project that is ~27GB huge. All data after import with spatial index and simple search index by names took only ~60GB. The best thing is that it only took 3.5 hours!!!!

  Configuration: pre MapDB 2.0 alpha 1, Java 8 x86_64, Xmx: ~5GB, SSD disk, CPU up to 4 cores

  This is only first run with these data but it looks very promising. It's a shame that now I have to work on another project. So I can continue with MapDB and OSM data on spare time :-(. Later on I will test latest MapDB 2.0.

  Thank you very much

   Martin
  
PS latest OSM statistics is here: http://taginfo.openstreetmap.org/reports/database_statistics

Jan Torben Heuer

unread,
Jan 30, 2015, 3:09:00 AM1/30/15
to ma...@googlegroups.com
Hi Martin,

That is great to hear, do you have something to share with us? I’m also interested in importing osm data and would like to know how you configured MapDB, how you insert the data and what kind of spatial index you added.

Cheers,

Jan

--
You received this message because you are subscribed to the Google Groups "MapDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mapdb+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

MJ CZ

unread,
Jan 30, 2015, 4:39:20 AM1/30/15
to ma...@googlegroups.com
1) Nodes
  - I use simple GeoHash (long type) as key for BTreeMap
  - MapDB is great for interval queries (faster than getting every single value)
  - tags are stored in custom Record with coordinates

2) Ways, processed Relations
  - I have 2 spatial indexes both for lines and polygons. 1st is for all data. 2nd smaller is for OSM levels 0 - 9.
  - I use custom R-Tree structure inspired by project HatBox: http://hatbox.sourceforge.net/
  - records (only tags) are stored in separate BTreeMap
  - lines and polygons are stored in separate binary file. I'm using modified WKB format that packs coordinates in similar way like PBF format

3) Tags
  - I strip tags that are not necessary for rendering. (source, created_by, note, ...) and I strip translation from foreign languages. (keeping tags: name, int_name, name:cs, name:en)
  - all others tags are kept to allow render in 2D and 3D

   Martin

Jan Torben Heuer

unread,
Jan 30, 2015, 4:50:21 AM1/30/15
to ma...@googlegroups.com
Wow thanks for the summary. I have some more questions if you don’t mind, very MapDB <-> OSM specific.

I was doing a similar thing with rocksdb-java (needed around 10h for the import). I also tried MapDB 1.x before but the performance was already in the testing phase orders of magnitude slower. Maybe I should give the pre-alpha a try.


Am 30.01.2015 um 10:39 schrieb MJ CZ <aml...@gmail.com>:

> 1) Nodes
> - I use simple GeoHash (long type) as key for BTreeMap
> - MapDB is great for interval queries (faster than getting every single value)
> - tags are stored in custom Record with coordinates

So you just put each node as (geohash)(osm_id) -> node? Or how do you store all nodes that belong to one geohash?


>
> 2) Ways, processed Relations
> - I have 2 spatial indexes both for lines and polygons. 1st is for all data. 2nd smaller is for OSM levels 0 - 9.
> - I use custom R-Tree structure inspired by project HatBox: http://hatbox.sourceforge.net/
> - records (only tags) are stored in separate BTreeMap
> - lines and polygons are stored in separate binary file. I'm using modified WKB format that packs coordinates in similar way like PBF format

For the ways, to create their envelope you need to fetch each node by id, how do you find the nodes by id if you stored them under the geohash as key?


Thanks,

Jan

Jan Kotek

unread,
Jan 30, 2015, 11:27:25 AM1/30/15
to ma...@googlegroups.com

Hi,

 

First thanks. It is nice to hear stories like this.

 

Would you be willing to share your code? Even confidentially would do. Or perhaps describe your configuration algorithms.

 

I just started optimizing last week, and there is still lot of space for improvements. It would be nice to have real world use case. I think in-memory import buffer and indexless store could boost performance bit more.

 

We already talked about your use case bit in Czech language while ago.

 

Thanks

Jan

Andrew Byrd

unread,
Feb 6, 2015, 10:22:12 AM2/6/15
to ma...@googlegroups.com, j...@kotek.net
Hi everyone,

First, thanks for all your work on MapDB. It is becoming an essential part of our work on the OpenTripPlanner project [1] and our other geospatial/transport analysis work [2]. We are currently modifying OpenTripPlanner to use MapDB for its street and public transport schedule import step. I discovered this recent conversation about loading OSM data into MapDB, and coincidentally I was just experimenting with the same thing.

We need the ability to fetch up-to-date OSM data on demand for arbitrary, large zones anywhere in the world. Several tools exist for cloning the full OSM database or performing geographic extracts, but they are coupled to traditional relational databases, can take days to initialize, stumble on country-sized extracts, or don't support the more compact binary formats we need. They don't quite fit our use case.

Therefore I recently wrote some software to parse an OSM planet.pbf into disk-backed storage and perform very quick rectangular extracts of all data necessary for modeling a transportation network [3]. It is still rough around the edges but does the job. The storage and extraction backend is written in C; I did some initial tests with MapDB but wasn't convinced I could meet our speed/space goals in Java. Well, I shouldn't have jumped to conclusions. By modifying some of my DBMaker parameters and using the appropriate kinds of maps for the job, I am now able to achieve planet.pbf load times and on-disk sizes that are comparable to our C code. Loading the 36GB Planet PBF to an SSD takes just under 2 hours, and the files consume 58 GB total. I haven't added the spatial index yet but based on back-of-the-envelope calculations I don't expect it to inflate the size enough to cause problems.

Very impressive! Rather than maintaining a special-purpose codebase it looks like I may be able to move over to a general-purpose storage backend and gain the flexibility of working with Java collections and libraries. The OSM model and PBF loader I am using is here: https://github.com/opentripplanner/OpenTripPlanner/tree/vexFormat/src/main/java/org/opentripplanner/osm

It's just a prototype at this point, but if the performance is as good as it looks I expect it to split this out into a library that we use in several products. For those who care, this also includes an OSM exchange format that is about 20-25% smaller than PBF, much simpler, and in my measurements about 2x faster to write [4]. Writing the planet back out from MapDB in this format took 1 hour 17 minutes, and the resulting file size is about 18GB (though that's stripping metadata).

-Andrew Byrd


To unsubscribe from this group and stop receiving emails from it, send an email to mapdb+unsubscribe@googlegroups.com.

peter.b...@yahoo.com

unread,
Feb 26, 2015, 8:37:14 AM2/26/15
to ma...@googlegroups.com, j...@kotek.net

Interesting Thread. I stumbled across MapBD just the other day for the sole purpose of processing OSM data. I am blown away by the results that Martin and Andrew are reporting.

Last night I crammed all the nodes from osm planet into a simple TreeMap. It took 18 hours to generate a map of 2.7 billion nodes on my Windows 7 box (6 cores, 8GB RAM, x64, 7200 RPM SATA drive).

Here's the code I used to create the node map:

nodeDB = DBMaker.newFileDB(file)

.transactionDisable()

.mmapFileEnableIfSupported()

.closeOnJvmShutdown()

.make();

java.util.Map<Long, Double[]> nodes = nodeDB.getTreeMap("nodes");

 

Here's how I'm populating the map:

nodes.put(node.getID(), new Double[]{node.getLat(), node.getLon()});

 

I'm totally new to MapDB but my first impression was the 18 hours was not bad. I'm sure if I threw in an SSD, added more RAM, and ran Linux I would get a better results. Barring that, perhaps there's something I should do in my code to make things faster? For example, use a different map? Or store strings instead of longs and doubles? Or maybe a different configuration setting?

Regardless of whether the write performance can be improved, my real concern is retrieving values from the map.

It takes my computer approximately 30 minutes to parse 274 million ways. For each of these ways, I need to get a coordinate from the node map (MapDB).

Unfortunately, looking up specific nodes in the map is painfully slow. The lookup is simple:

for (long nodeID : way.getNodes()){

    Double[] coord = nodes.get(nodeID);

}

 

I have tried to find nodes using a single thread and multiple threads. I have also experimented with various cache settings to no avail:

DBMaker maker = DBMaker.newFileDB(file);

maker.transactionDisable();

maker.mmapFileEnableIfSupported();

maker.cacheLRUEnable();

maker.cacheSize(1000000);

if (readOnly) maker.readOnly();

maker.closeOnJvmShutdown();


 

 

After running for 6 hours my app processed less than 1% of the 274 million ways.

Any suggestions on how to perform these lookups faster?

Jan Torben Heuer

unread,
Feb 26, 2015, 8:50:21 AM2/26/15
to ma...@googlegroups.com
Hi Peter,

Glad to hear that there are more people interested in reading osm data with java/mapdb. From my results, they are in a similar range as yours (for writing): around 24h for the full 2.7 billion nodes. Would be very interesting for me as well what the important factors are (apart from disk speed)! There was an older post claiming to import everything in 3.5 hours. Would be interesting if the difference is „only“ the disk. At the moment I haven’t been able to outperform rocksdb so I currently stick with that. Let’s see what version MapDB 2 brings us!

Jan


--
You received this message because you are subscribed to the Google Groups "MapDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mapdb+un...@googlegroups.com.

MJ CZ

unread,
Feb 27, 2015, 12:07:03 PM2/27/15
to ma...@googlegroups.com
Hi,
  the difference is not only the disk.
OSM data has VERY poor locality. You can have 2 near nodes (part of building) but IDs can 1 and 1 000 000 000.
Hints:
1) nodes can be imported by PUMP method that is really fast (so far there is no builder for PUMP) You have to make own workaround
2) overhead for storing single node is huge. It's better to group nodes into chunks
3) I'm processing ways in batches limited by number of nodes. I read ways with maximum 100 000 nodes. Then I sort way nodes by its IDs. I get coordinate for each node. And finally I finalize each way with read coordinates.

  It's not full algorithm I use. Just some hints.

M

Eric Snellman

unread,
Feb 27, 2015, 4:42:15 PM2/27/15
to ma...@googlegroups.com

Peter,

You could try:
maker.createHashMap("test").keySerializer(Serializer.LONG).valueSerializer(Serializer.DOUBLE_ARRAY).makeOrGet();
maker.createTreeMap("test2").keySerializer(Serializer.LONG).valueSerializer(Serializer.DOUBLE_ARRAY).makeOrGet();
Also since every array is the same length you can hand write a valueSerializer which should be even better. see:

Is this what you use to read the pbf file?: https://github.com/scrosby/OSM-binary

Peter Borissow

unread,
Feb 27, 2015, 7:35:40 PM2/27/15
to ma...@googlegroups.com
Thanks Eric-
     I'll try the htreemap as you suggest. Also, I was going to try to split up the nodes into smaller maps to see if I can get better performance looking up coordinates for the ways. I don't know if it makes any sense but its worth a try.

As for the pbf parser, I rolled my own using the Google Protocol Buffers API.

Thanks,
Peter


From: Eric Snellman <esnel...@gmail.com>
To: ma...@googlegroups.com
Sent: Friday, February 27, 2015 4:42 PM
Subject: Re: [MapDB] Loading OSM Planet in MapDB

--

andrew byrd

unread,
Mar 2, 2015, 5:08:34 AM3/2/15
to ma...@googlegroups.com
On Thu, Feb 26, 2015, at 14:37, peter.borissow via MapDB wrote:
> Interesting Thread. I stumbled across MapBD just the other day for the
> sole purpose of processing OSM data. I am blown away by the results that
> Martin and Andrew are reporting.

Hello,

It's good to see so many people working on this problem.

In my experience the results do seem to vary quite a bit depending on
the underlying operating system or other machine characteristics. I
loaded the North America PBF (6.4GB) recently using the same software on
a recent Macbook and it took around two hours, so there's a serious
difference in speed.

The figures I reported before were for a dedicated server machine (not
virtual server). It has 32GB of memory, 4 cores, and is running Ubuntu
14.04LTS, and perhaps most importantly it has two separate SSDs. The PBF
is being read from one SSD, and the MapDB is being written to the other.
It could fit half the resulting MapDB file in its free memory, so I
suspect that when my loading phase is finished the page cache has not
been completely flushed out to disk.

It is also worth noting that I'm discarding all the OSM metadata (author
etc.) and many tags that we don't use such as source=* (not discarding
the OSM entities, but just these tags on the entities). Skipping the
buildings (over half of OSM data in many locations) would also help a
lot.

I'm not sure which of these factors is having the most effect, but speed
does vary a lot from one machine to another.

-Andrew

Jan Kotek

unread,
Mar 2, 2015, 9:09:55 AM3/2/15
to peter.b...@yahoo.com, ma...@googlegroups.com

Hi Peter,

 

MapDB 1.0 has read amplification performance bug, that is probably making your code slow. It is fixed (or improved a lot in 2.0).

 

> new Double[]{

 

Use primitive array, that makes lot of difference. Construct btreemap with something like:

 

nodeDB.createTreeMap("nodes")

.keySerializer(BTreeKeySerializer.POSITIVE_LONG)

.valueSerializer(Serializer.LONG_ARRAY)

.makeOrGet()

 

 

Also use data pump to create BTreeMap. Resulting BTree is already compacted and probably has better performance. Other option is to call db.compact() before running reads.

 

Also drop LRU cache. It is not designed for this stuff and probably slows it down.

 

Jan

peter.b...@yahoo.com

unread,
Mar 3, 2015, 9:48:51 AM3/3/15
to ma...@googlegroups.com

Thanks to Eric's suggestion, I was able to create an HTreeMap with 2.8 billion nodes under 7 hours. That's 2x faster than my first attempt using a simple TreeMap. Also, the overall file size is almost 2x smaller than the original TreeMap. Very impressive!

Unfortunately, retrieving nodes from the map is still too slow. With 2.8 billion nodes in one HTreeMap, I estimate that it's going to take weeks to look up coordinates for all the ways.

I suspect smaller maps might be better than 1 large HTreeMap.

To test this theory, I created 28 HTreeMaps with 100 million nodes per map. It took almost 10 hours to create 28 HTreeMaps with 2.8 billion nodes. That's 3 hours more than a creating a single HTreeMap. The reason for this, is because it takes approximately 10 minutes to close each 100 million node HTreeMap. 10 minutes x 28 maps = 3 extra hours.

Next, I created an HTreeMap for all the way members.

HTreeMap<Long, double[]> wayMembers = wayMembersDB.createHashMap("way_members")

         .keySerializer(Serializer.LONG).valueSerializer(Serializer.DOUBLE_ARRAY).makeOrGet();

 

The keys in the map represent Way IDs and values represent nodes. Each node is represented with 3 doubles (Node ID, Lat, Lon).

Way way = (Way) element;

long wayID = way.getID();

Long[] nodeIDs = way.getNodes();

 

double[] arr = new double[nodeIDs.length*3];

int idx = 0;

for (int i=0; i<arr.length; i++){

    arr[i] = nodeIDs[idx];

    arr[i+1] = nullVal; //lat

    arr[i+2] = nullVal; //lon

    i=i+2;

    idx++;

}

wayMembers.put(wayID, arr);

 

Way Members with null coordinates took 2 hours to generate with 274 million ways.

The final step is to populate the coordinates in the Way Members map. I kicked off that process last night but I had to kill it this morning because I have to do some work today :-)

Anyway, at the rate it was going last night, I estimate that it's going to take approximately 5 days to look up all the coordinates for all the ways. I'm going to run a full test later this week on a dedicated production server and find out for sure. My goal is to get this process down to less than 30 hours.

Why 30 hours? Well, for smaller OSM extracts the process I outlined above works great. I was able to ingest 100 million nodes and look up coordinates for 10 million ways in <1 hour on my machine using the Africa OSM dataset (africa-latest.osm.pbf) from geofabrik. The entire global dataset is 30x larger so I'm hoping my workflow will scale linearly with the data. The initial test I ran last night isn't promising but the only way to know for sure is to do a full run.

In summary, HTreeMaps are great for inserting/storing large data. Inserts are fast. Looks-ups are painfully slow but manageable with smaller tables (e.g. 100 million records/map).

I'll post more updates when I can.

 

Thanks,

Peter

peter.b...@yahoo.com

unread,
Mar 3, 2015, 10:04:28 AM3/3/15
to ma...@googlegroups.com, peter.b...@yahoo.com, j...@kotek.net
Jan-
    Thanks for your help! I ended up using HTreeMaps. Do you think BTreeMap look ups will be faster HTreeMap? The HTreeMap is pretty fast with smaller records (e.g 100M nodes).

I'll try the BTreeMap next and drop the LRU cache and run db.compact() as you suggest.

Thanks,
Peter

andrew byrd

unread,
Mar 3, 2015, 10:30:23 AM3/3/15
to ma...@googlegroups.com
Hi Martin,

It's strange that we are seeing comparable performance / load times yet
I'm just naively loading nodes and ways one by one into a BTree -- I'm
not using any optimized strategy involving large blocks of nodes or
anything. The entities in bulk-export PBF files are often sorted by ID
(but in increasing order), so that may have an effect on performance.
I'll need to pull out the algorithms textbooks to think through that
one. I considered using the data pump method, but it requires the input
data to be presorted in reverse order and I made the (perhaps unfounded)
assumption that applying this sorting process to the whole planet's
worth of OSM data was not realistic.

I get the impression you are not at liberty to give away too much
information, but I do wonder what you mean by "the overhead for storing
an individual node is huge". I suppose you mean the time overhead from
storing nodes one by one instead of bulk importing presorted data.

It is true that a series of nodes in a single building or road will have
very poor locality in terms of IDs (but very good spatial locality of
course). But I don't see how this would have an effect on the load
process since all nodes are generally sent in ascending sorted order
before any ways, and the ways only contain references to the nodes by
key (node ID). Storing a way is just storing an array of longs.

I suspect some of the differences we are seeing may be due to different
use cases and perhaps different indexes we are building. In my case
everything is designed for large rectangular geographic extracts and
uses a tile/grid based spatial index.

-Andrew
> > ma...@googlegroups.com <javascript:>>:
> > email to mapdb+un...@googlegroups.com <javascript:>.

MJ CZ

unread,
Mar 4, 2015, 5:25:52 AM3/4/15
to ma...@googlegroups.com
Hi,

  I'm sorry but I don't have permission from my boss to publish code :-(.
And I don't have much time to talk about strategy behind my approach. Currently I work on different project and maps are not main business for our company.
Map indexing have to work in very constrained environment (~ 1 GB of RAM and slow HDD with 7200 rpm, ~15ms seek, read 30-50MB/s) and it's small part of huge desktop application.

So I have to process all data on slow disk. PBF format very compressed. For example: you need 24 bytes/node. 8 bytes for key and 16 bytes for lat/lon. PBF only need about 25% of this size because of blocks and delta compression.
Another example are address tags: it takes about 2GB to store all 'addr:...' string keys without packing. So it takes 40s to read only keys without values from file on slow disk (50MB/s). It is not acceptable for online rendering.

I would like to publish more information but i as I wrote I don't have time to make proper analysis and compare different approaches. I try to give you some hints for problems that I was facing to.

There are another approaches to index/import map data in several minutes but they are using >8 core CPUs and >32GB of RAM with SSDs. We have such machines too. But I want to give our customers freedom to work with map parts they need.

 M

Jan Kotek

unread,
Mar 6, 2015, 4:17:04 AM3/6/15
to peter.b...@yahoo.com, ma...@googlegroups.com

> Do you think BTreeMap look ups will be faster HTreeMap?

 

BTreeMap is faster in 2.0, not sure about 1.0.

 

jan

MJ CZ

unread,
Oct 8, 2015, 4:58:17 AM10/8/15
to MapDB, peter.b...@yahoo.com, j...@kotek.net
Hi,

  pef update. OSM planet.pbf import in 2h 45 min, ~43GB indexed size. HW same as mentioned above. MapDB 2.0 beta ? (before mapped byte buffer SYNC) with small customizations. Archive format not used.
Czech Republic (680MB) import < 3min, 1.5GB indexed size.

Render Czech Republic in ~3hours. 10 layers (M 1:2 000 000 - 1:3 906), PNG 256x256, OSM carto v.2.35.0

  Martin

Peter Borissow

unread,
Oct 8, 2015, 5:12:01 AM10/8/15
to MJ CZ, MapDB, j...@kotek.net
Are you creating geometries during ingest (e.g. looking up coordinates for each point in a way during to create a linestring) or are you simply dumping the contents of a pdb into mapdb?

How much faster is MapBD 2.x (beta) vs 1.x?

Thanks,
Peter


From: MJ CZ <aml...@gmail.com>
To: MapDB <ma...@googlegroups.com>
Cc: peter.b...@yahoo.com; j...@kotek.net
Sent: Thursday, October 8, 2015 4:58 AM

Subject: Re: [MapDB] Loading OSM Planet in MapDB

MJ CZ

unread,
Oct 9, 2015, 3:41:31 AM10/9/15
to MapDB, aml...@gmail.com, j...@kotek.net, peter.b...@yahoo.com
Hi,

  I create geometries during data import. So I can't use osm diffs. Spatial index, tags and geometries are stored and cached separately. For geometries I use customized WKB format (varbyte encoding similar to PBF).

  M

Jan Kotek

unread,
Oct 9, 2015, 8:23:07 AM10/9/15
to ma...@googlegroups.com

Hi,

 

just a small note, one of my customers is using MapDB for graphs and map data. So I expect there will be some improvements from that direction coming into MapDB soon.

 

Jan

Reply all
Reply to author
Forward
0 new messages