Data - What's the best format for you?

1 view
Skip to first unread message

Grant Ritchie (Locationary.com)

unread,
Apr 20, 2010, 9:10:16 AM4/20/10
to GeoPlaceMatch
We're now preparing to put our place data in an exportable form and I
wanted to get feedback from the tech community on what would be the
most preferred format. The issue is that it's fairly sizable currently
in the 100GB range, and growing. By the nature of what we're doing,
we'll have real-time updates in the future, but wanted to start a
little simpler right now. I was thinking a weekly full dump, and then
ongoing daily deltas. Also complicating the matter in terms of
exporting is that we're always adding to the types of structured data
we collect so we don't want our changing schema to disrupt our
partners.

Formats we were considering include:
1) CSV - always a good standby
2) XML - flexible but you pay the cost in added size
3) Native DB formats - but which one?
4) AWS snapshots - we use AWS so always a possibility
5) Others?

I would really appreciate any feedback if you have a preference...
Thank you!

-Grant

+++++++++++++++++++++++
A bit about our data:
+ We'll start publishing about 20 million places in over 90 countries,
we see this growing to around 100 million. We'll have complete
coverage of US and Canada and we're now working on UK, Australia,
Scandinavia and India.
+ We're language agnostic and can publish data in our supported
languages (currently, we support English, Spanish, French, Finnish,
Norwegian, Dutch and Polish) -- and this will grow as more users
translate
+ We collect the basics like place name, address, contact information,
website, lat / long
+ We collect very place specific data depending on the place type.
(for a restaurant, this includes type of cuisine, payment types
accepted, etc.)
+ We're in the process of verifying the places and adding more
information. Some places are mere stubs, and others are well built
out.

Our mission is to have the cleanest data on every physical place in
the world. We're building an army of users who compete against each
other to add places as soon as they open, remove places that have gone
out of business, and to add more data to places as we need.



--
Subscription settings: http://groups.google.com/group/geoplacematch/subscribe?hl=en

jebarnard

unread,
Apr 20, 2010, 3:40:33 PM4/20/10
to GeoPlaceMatch
As a developer I would like to see a structured output such as XML,
since its very easy to parse in most systems.

Just because you use XML, doesn't mean you have to be very detailed
with the names (although detailed names make it more readable). To
keep the overhead down you could use simplified names such as <n> for
name etc...and provide a reference table which gives detailed
descriptions of what each tag really means.

I also would suggest letting users decide what specific data they
want, not every user will want or need the entire database. Some
people don't care that there is valet parking at a certain location,
all they want is the name/address/phone number. I don't want to have
to download 100gb, when I only NEED 5gb worth of data.

I know you're leaning away from an API approach where users can query,
but this is something I'd really like to see. It gives me the freshest
data, and only the specific data I need. I want to be able to pass in
search values (Location, Name etc.), pass in the number of results,
and pass in what data I want out (Name, Address, Phone)

From a bandwidth perspective the break-even point for something that
needs limited data (Name, Address, Phone), is something like 1 billion
queries....that's a lot of queries that need to be preformed for
someone like me to justify going with a 100gb download vs API
approach.

Joel Barnard

On Apr 20, 9:10 am, "Grant Ritchie (Locationary.com)"

Andrew McCall

unread,
Apr 20, 2010, 4:22:37 PM4/20/10
to geopla...@googlegroups.com
For data this size I'd want to process it as a stream (JAX vs DOM for
example).

In terms of size, XML compressed fairly well and would be my preferred
format. But then I'm biased, I absolutely loath CSV and would import
this sort of data into HBase or similar so every other format would
require a bit of intermediate processing.

Andrew

Webb Sprague

unread,
Apr 20, 2010, 8:27:22 PM4/20/10
to geopla...@googlegroups.com
Let me vote for CSV, in addition to whatever other formats, since it
easy to produce and consume, so you might as well. You need to
document each cvs file really well, though, though that can go in
separate files. Think about how successful the HTTP and Email formats
have been....

If you want to have a single file database, SQLITE is the only way to
go that I know of.

Steven Echtman

unread,
Apr 20, 2010, 8:48:03 PM4/20/10
to geopla...@googlegroups.com
Native PostgreSQL would be a popular DB output mode due to great geo extensions.
Reply all
Reply to author
Forward
0 new messages