[Haskell-cafe] Storing big datasets

45 views
Skip to first unread message

Mikhail Volkhov

unread,
May 6, 2016, 4:29:04 PM5/6/16
to haskel...@haskell.org
Hi!

I'm using ACID package as main database -- it's simple and... ACID
(which is cool of course).

So now I need to store up to ~30GB of data in a single data structure
(!) that will be constantly updated (some kind of a huge tree set).
Here's a question -- how to operate that big structure?
1. It doesn't even fit in RAM
2. It should be updated atomically and frequently (every 10 seconds up
to 500 elements out of 10^7).
3. What structures should I use? I'd like to store up to 10^6~10^7 some
simple elements there too, that will be gigabytes of data. So it seems
to me I can't use Data.Set.

Thanks for any ideas!
--
Volkhov Mikhail
M3#38 ITMO study group 17'
Computer Technologies Department
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Bardur Arantsson

unread,
May 6, 2016, 4:42:22 PM5/6/16
to haskel...@haskell.org
On 05/06/2016 10:28 PM, Mikhail Volkhov wrote:
> Hi!
>
> I'm using ACID package as main database -- it's simple and... ACID
> (which is cool of course).
>

Just for clarification: Do you mean acid-state?

> So now I need to store up to ~30GB of data in a single data structure
> (!) that will be constantly updated (some kind of a huge tree set).
> Here's a question -- how to operate that big structure?
> 1. It doesn't even fit in RAM

Yes it does. My desktop machine has 32GB RAM.

I'm not trying to brag or anything, and *you* may not have that amount
of memory, but it *should* be cheap enough to buy enough RAM if it
avoids radically redesigning an existing system (with all the associated
risk, developer time, etc.).

If this is a one-off or rare occurrence, you could get 160GB RAM on an
Amazon M4 instance (m4.10xlarge) if you wanted to. (Amazon's pricing
structure is incredibly opaque and I didn't do a thorough investigation,
so I apologize if this is wrong.)

> 2. It should be updated atomically and frequently (every 10 seconds up
> to 500 elements out of 10^7).

This is a bit vague. Do you mean that an *average* of 500 (out of 10^7)
will be updated every 10 seconds?

> 3. What structures should I use? I'd like to store up to 10^6~10^7 some
> simple elements there too, that will be gigabytes of data. So it seems
> to me I can't use Data.Set.
>

Well, if storing things in RAM is out of the question, so is using Data.Set.

Regards,

Mikhail Volkhov

unread,
May 6, 2016, 4:55:18 PM5/6/16
to haskel...@haskell.org
Thank you for the answer!

1. acid-state, yes.
2. The application is supposed to work on the regular pc, without 32gb
of ram. I agree that it doesn't cost much to buy 32gb of ram, but it's
not the case.
3. Come to think of it, i think it would be more like about 10k elements
per 10-20 second, yes. On average.
--
Volkhov Mikhail
M3#38 ITMO study group 17'
Computer Technologies Department

Lana Black

unread,
May 6, 2016, 8:28:21 PM5/6/16
to Mikhail Volkhov, haskel...@haskell.org
Hi Mikhail,

Have you considered external database engines? I suppose you could benefit from using Redis in your case.

  Original Message  
From: Mikhail Volkhov
Sent: Friday, May 6, 2016 8:28 PM
To: haskel...@haskell.org
Subject: [Haskell-cafe] Storing big datasets

Joachim Durchholz

unread,
May 7, 2016, 5:17:13 AM5/7/16
to haskel...@haskell.org
Am 07.05.2016 um 02:27 schrieb Lana Black:
> Hi Mikhail,
>
> Have you considered external database engines? I suppose you could benefit from using Redis in your case.

Wikipedia says that while Redis can use the disk as "virtual memory",
that feature is deprecated, so it definitely expects to keep the whole
dataset in memory. Which kind of defeats the whole point for Mikhail.

sqlite comes to mind. HDBC claims to have a driver for it, and it's
linked into the application so there would be no additional setup required.

If Mikhail already has another database in his application, the setup
cost is already paid for, and he might want to check whether that's fast
enough for his purposes.
I'd expect postgresql or mysql to be too slow, but H2 to work just fine.
Of course that's just expectations, so testing would be needed.

Regards,
Jo

David Turner

unread,
May 7, 2016, 6:48:36 AM5/7/16
to Mikhail Volkhov, haskel...@haskell.org

Btrees are good for storing data on disk. And something like postgres is an extremely efficient implementation of a btree supporting atomic updates and the like. I'd use that!

Mikhail Volkhov

unread,
May 7, 2016, 6:58:33 AM5/7/16
to haskel...@haskell.org
First, about external database engines -- I don't really understand how
should I store my own set datatype (let's say it's a red-black tree with
some extra information in nodes, something like that) in the key-value
db. No idea.

And yes, what about Redis, it's not supposed to work with sizes more
than RAM can hold.

Currently it's not the problem to switch application db. I'll check your
solutions (sqlite/HDBC/postgre/H2). Thanks!
--
Volkhov Mikhail
M3#38 ITMO study group 17'
Computer Technologies Department

Imants Cekusins

unread,
May 7, 2016, 7:13:12 AM5/7/16
to Mikhail Volkhov, haskell Cafe
neo4j may be a good fit for trees.

it may be necessary to store large blobs of data in another db.

Joachim Durchholz

unread,
May 7, 2016, 8:44:56 AM5/7/16
to haskel...@haskell.org
Am 07.05.2016 um 12:48 schrieb David Turner:
> Btrees are good for storing data on disk. And something like postgres is an
> extremely efficient implementation of a btree supporting atomic updates and
> the like. I'd use that!

The original question was about standard hardware (i.e. still including
rotating rust) and ~50 updates/second.
I'd assume that that's doable with an ACID-compliant DB on standard
hardware, though it does not leave room for inefficiencies so you need
to know what you're doing in SQL.

On a later update, he corrected the specs to 1,000-2,000 updates/second,
and I believe that it's impossible to do that on a standard single-HDD.
I don't know whether Mikhail considers SSDs standard configuration.

Now transaction rates aren't the same as write rates. If he can batch
multiple writes in one transaction, Postgresql or any other RDBMS might
actually work.

Mikhail Volkhov

unread,
May 7, 2016, 9:08:50 AM5/7/16
to haskel...@haskell.org
> On a later update, he corrected the specs to 1,000-2,000 updates/second,
> and I believe that it's impossible to do that on a standard single-HDD.
> I don't know whether Mikhail considers SSDs standard configuration.

Yes, I suppose SSD is standard configuration -- it costs less than >32Gb
of RAM and it can be installed on any device while it's the problem with
extra RAM on laptops

> Now transaction rates aren't the same as write rates. If he can batch
> multiple writes in one transaction, Postgresql or any other RDBMS might
> actually work.

Thank you, will look into these approaches.

--
Volkhov Mikhail
M3#38 ITMO study group 17', CT Department

Christopher King

unread,
May 7, 2016, 9:51:12 AM5/7/16
to Haskell-cafe, haskel...@haskell.org, volho...@gmail.com
Is it lazy? If so, just using Binary (https://hackage.haskell.org/package/binary) might work. It wouldn't be ACID anymore though.

If that doesn't work, you likely need to change the data structure a lot.

David Turner

unread,
May 7, 2016, 10:42:31 AM5/7/16
to Joachim Durchholz, haskel...@haskell.org


On 7 May 2016 13:45, "Joachim Durchholz" <j...@durchholz.org> wrote:
>
> Am 07.05.2016 um 12:48 schrieb David Turner:
>>
>> Btrees are good for storing data on disk. And something like postgres is an
>> extremely efficient implementation of a btree supporting atomic updates and
>> the like. I'd use that!
>
>
> The original question was about standard hardware (i.e. still including rotating rust) and ~50 updates/second.
> I'd assume that that's doable with an ACID-compliant DB on standard hardware, though it does not leave room for inefficiencies so you need to know what you're doing in SQL.
>
> On a later update, he corrected the specs to 1,000-2,000 updates/second, and I believe that it's impossible to do that on a standard single-HDD. I don't know whether Mikhail considers SSDs standard configuration.
>
> Now transaction rates aren't the same as write rates. If he can batch multiple writes in one transaction, Postgresql or any other RDBMS might actually work.

Thousands of transactions per second is getting into needs-clever-optimisation territory if it can't be done in RAM, but it's not that tricky. Postgres will batch transactions for you: see for instance http://pgeoghegan.blogspot.co.uk/2012/06/towards-14000-write-transactions-on-my.html?m=1

Cheers,

Joachim Durchholz

unread,
May 7, 2016, 5:12:06 PM5/7/16
to haskel...@haskell.org
Am 07.05.2016 um 16:42 schrieb David Turner:
> Thousands of transactions per second is getting into
> needs-clever-optimisation territory if it can't be done in RAM, but it's
> not that tricky. Postgres will batch transactions for you: see for instance
> http://pgeoghegan.blogspot.co.uk/2012/06/towards-14000-write-transactions-on-my.html?m=1

Well, actually it is - that particular guy maintains 14,000 transactions
per second, but with benchmarks that may or may not transfer to
Mikhail's scenario, and with a quite high-end 7,200 RPM HDD which may or
may not fit his definition of standard hardware.

That said, these results are still pretty impressive.

Regards,
Jo

BTW please "reply to list", replying directly and CC'ing to the list
means I have to manually copy/paste the list address to answer.

Lana Black

unread,
May 7, 2016, 9:34:43 PM5/7/16
to haskel...@haskell.org
On 11:17 Sat 07 May , Joachim Durchholz wrote:
> Am 07.05.2016 um 02:27 schrieb Lana Black:
> > Hi Mikhail,
> >
> > Have you considered external database engines? I suppose you could benefit from using Redis in your case.
>
> Wikipedia says that while Redis can use the disk as "virtual memory",
> that feature is deprecated, so it definitely expects to keep the whole
> dataset in memory. Which kind of defeats the whole point for Mikhail.

I believe we're talking about different things here. This page [1] says
that disk persistence is a standard feature in redis. Furthermore, my
employer uses redis as a storage backend for raw video, and it works
just fine. But yes, redis might not be the best choice depending on what
data Mikhail works with.

> sqlite comes to mind. HDBC claims to have a driver for it, and it's
> linked into the application so there would be no additional setup required.
>
> If Mikhail already has another database in his application, the setup
> cost is already paid for, and he might want to check whether that's fast
> enough for his purposes.
> I'd expect postgresql or mysql to be too slow, but H2 to work just fine.
> Of course that's just expectations, so testing would be needed.

It is certainly possible to get 2k requests per second with postgresql [2]
on a fairly limited hardware if network latency is taken out of the picture.
Again this heavily depends on the kind of data and other conditions.

[1] http://redis.io/topics/persistence
[2] https://gist.github.com/chanks/7585810

Joachim Durchholz

unread,
May 8, 2016, 3:41:57 AM5/8/16
to haskel...@haskell.org
Am 08.05.2016 um 03:34 schrieb Lana Black:
> On 11:17 Sat 07 May , Joachim Durchholz wrote:
>> Am 07.05.2016 um 02:27 schrieb Lana Black:
>>> Hi Mikhail,
>>>
>>> Have you considered external database engines? I suppose you could benefit from using Redis in your case.
>>
>> Wikipedia says that while Redis can use the disk as "virtual memory",
>> that feature is deprecated, so it definitely expects to keep the whole
>> dataset in memory. Which kind of defeats the whole point for Mikhail.
>
> I believe we're talking about different things here. This page [1] says
> that disk persistence is a standard feature in redis.
> [1] http://redis.io/topics/persistence

I see. Seems like WP got something wrong here.
Actually the claims look quite nice, even though I'd want to verify
these before actually relying on them as usual with claims :-)

> It is certainly possible to get 2k requests per second with postgresql [2]
> on a fairly limited hardware if network latency is taken out of the picture.
> Again this heavily depends on the kind of data and other conditions.
>
> [2] https://gist.github.com/chanks/7585810

Ah, that's about scaling up under concurrent load. Mikhail's use case
does not have multiple processes per data instance.

It's an interesting data point though. I've been hearing figures of "a
few hundred transactions per second, max, unless you have dedicated
hardware", but now I'm wondering how things might work out in a
single-threading scenario. It could easily be faster, though it could
also easily be slower because the engine can't parallelize work.
I have to admit I have no idea how this would turn out.

Regards,
Jo
Reply all
Reply to author
Forward
0 new messages