Loading/Dumping large data from/to disk in Go

lobo...@gmail.com

unread,

Jul 17, 2015, 10:56:54 AM7/17/15

to golan...@googlegroups.com

        Hello everyone,

        Is there any way of loading/dumping very large data consisting of key-value pair (in GBs) to a file in such a way that the keys in the file are unique and deletion of older keys is feasible?

        Any suggestions or ideas would be highly appreciated.

         Thanking you,
         Elita Lobo

Egon

unread,

Jul 17, 2015, 11:54:44 AM7/17/15

to golan...@googlegroups.com, lobo...@gmail.com

On Friday, 17 July 2015 17:56:54 UTC+3, lobo...@gmail.com wrote:

Hello everyone,

Is there any way of loading/dumping very large data consisting of key-value pair (in GBs) to a file in such a way that the keys in the file are unique and deletion of older keys is feasible?

How large is each key and each value?

What are the performance requirements?

How often is it read vs. written?

You could just use a database - e.g. bolt or other KV store (https://github.com/golang/go/wiki/projects#key-value-stores) or just sqlite. It might be overkill, but it might solve your problem easily.

+ Egon

Roberto Zanotto

unread,

Jul 17, 2015, 5:07:36 PM7/17/15

to golan...@googlegroups.com, lobo...@gmail.com

You could also use a filesystem directory as an hash table. For every key-value pair you create a file with the key as file name and the value in the file itself. May or may not be a good solution, depending on what filesystem and how many key-value pairs you have.

If you _must_ store everything in a single file, you'll have to look for a library that handles that for you. Implementing it yourself efficiently is not as easy as it might seem. Are you familiar with the external memory model?

Elita Lobo

unread,

Jul 17, 2015, 10:27:21 PM7/17/15

to Egon, golan...@googlegroups.com

Hello Egon,

Thanks a lot for the suggestions. The key can be as large as 20bytes and size of value can go upto a few kbs. Read is more frequent than writes. I don't require to fetch the value of a key from the database. I just need to ensure that each key is unique and that I should be able to get a map of all the key-value pairs present in the database at any point of time.

Currently I have implemented filesystem approach wherein each key is a unique file. However I am afraid that open and closing thousands of file for each read request would affect the performance.

I think boltdb and sqlite does not have time to live feature.

Thanks and Regards,

Elita Lobo

unread,

Jul 17, 2015, 10:56:45 PM7/17/15

to Roberto Zanotto, golan...@googlegroups.com

Hi Robert,

Thanks for the ideas. I am currently using a filesystem approach but I am afraid that opening thousands of files and reading from it might affect performance unless I do so concurrently and push it into a channel and evaluate each file one by one. I am not very familiar with the external memory model. I have read that If I need to store everything onto a single file then I would have to implement a radix tree or b+ tree and serialize and save it on disk. While this seems feasible, I am not sure

how would the tree be processed when its size becomes larger than the RAM of the machine.

You are right about its implementation being difficult, however I would love to learn and implementit. It would be great if you could briefly explain how very large files consisting of b+ trees/radix trees are processed or point out to some resources which would help me understand this concept.

Thanks and Regards,

Elita Lobo

Roberto Zanotto

unread,

Jul 18, 2015, 12:13:32 AM7/18/15

to golan...@googlegroups.com, lobo...@gmail.com

You normally don't implement the b+ tree in ram and then serialize it and put it in a file, you want each node to contain a certain number of records and each node is stored in a different file. That's how the tree can grow bigger than the ram. But I only know the theory from an university course, I've never implemented it in practice. Before deciding to spend time implementing such a thing, try out one of the libraries suggested by Egon and compare the performance vs the current code you have so you can quickly get an idea of the potential improvements (if any!).
Regarding the filesystem approach, you could save more than one record in a single file. For each pair compute an hash function on the key and the result will tell you in which file to put it. Should be effective in reducing the number of files needed and is quiet straightforward to implement.

Are the records accessed in a certain order/pattern that could be exploited? If you could keep together records that are normally accessed together it could be a win.

Also, it could be a good idea to keep a map in memory to act as a cache, so that previously accessed items can be read from there without hitting the filesystem again.

Regards.

Egon

unread,

Jul 18, 2015, 3:56:51 AM7/18/15

to golan...@googlegroups.com, lobo...@gmail.com

On Saturday, 18 July 2015 05:27:21 UTC+3, Elita Lobo wrote:

Hello Egon,

Thanks a lot for the suggestions. The key can be as large as 20bytes and size of value can go upto a few kbs. Read is more frequent than writes. I don't require to fetch the value of a key from the database. I just need to ensure that each key is unique and that I should be able to get a map of all the key-value pairs present in the database at any point of time.

Currently I have implemented filesystem approach wherein each key is a unique file. However I am afraid that open and closing thousands of file for each read request would affect the performance.
I think boltdb and sqlite does not have time to live feature.

Neither does the filesystem. You can similarly implement them for boltdb/sqlite.

Here's how I solve such problems:

First lay out the facts:

* ~1kb per entry for 1GB this results about 1e6 entries.

* Read has to be fast and Write can be slow (if I have exact stats, I would use them as well)

* Each key has to be unique

I need Reads to be fast, hence the Index is most important. I need 20bytes * 1e6 ~= 20MB for keys, I can easily and quite fast load it into RAM. Since writes can be slow, I think while adding removing from the index it is the best time to ensure uniqueness and resort/reorganize such that they are fast to read.

For data it wouldn't be great to have 1e6 files - so I should try to fit several into a single file. Which means I also need some kind of "file pointer" (filename + offset + length) for them.

Coming back to Keys, that's all I need to know how to structure them:

type Key [20]byte

type FileID byte

type Entry struct {

Key Key

File FileID // ???

Offset uint64

Length uint64

}

Now for the index I have several choices

1. type Index []Entry // and use binary/linear search

2. type Index map[hash of the key][]Entry // use map lookup

1. can be read/written to the disk very fast. I could memory map the index file into main memory.

2. provides faster (probably) lookup speed, but writing/reading the index file becomes more complicated and slower.

Regarding the Entry.File, I'm not sure whether a single or multiple file based approach would be better, but I'll leave that in just in case... 255 files is probably sufficient, and if I don't need them they can be later removed.

Now putting data into the database, this would mean first writing the value into the database, i.e. appending the data to a file and writing that information along with the Key to the Index.

With pseudo code:

AddEntry(key Key, value []byte)

1. find the appropriate file

2. acquire lock for that file

3. append key header to the file (key and the length of value)

4. append value to the file

5. flush data to disk

6. acquire lock for the index

7. add key to index

8. resort/organize index

9. release index lock

10. release file lock

For determining the file, I could use sha1(key) and use the first byte as the filename. Alternatively you could append to a file and when it reaches some limit, e.g. 32MB, pick a new file to write to. Or it always can be a single file. The sha1 would work better in a concurrent situation because you don't have multiple threads/processes trying to lock a single file. Not sure what would be the best, but something that can be tested and later changed.

Now how to delete data:

1. acquire index lock

2. delete entry from index

3. release index lock

Each deletion will leak around 1kb of disk space, so for deleting 1e5 entries it will leak 100MB. So you need to figure out when to garbage collect all the wasted disk space. To compact the data, you essentially create a new database:

How it works:

1. create new DB

2. for each key/value in the old DB

3. add key/value to the new DB

4. delete the old DB

This can of course optimized by reusing the old files and doing it in place... etc.

And there you have it... a design for a brain-dead key-value store. No B+, Radix trees needed. And it actually should perform quite well.

+ Egon

Klaus Post

unread,

Jul 19, 2015, 5:24:22 AM7/19/15

to golan...@googlegroups.com, egon...@gmail.com, lobo...@gmail.com

On Saturday, 18 July 2015 04:27:21 UTC+2, Elita Lobo wrote:

Thanks a lot for the suggestions. The key can be as large as 20bytes and size of value can go upto a few kbs. Read is more frequent than writes. I don't require to fetch the value of a key from the database. I just need to ensure that each key is unique and that I should be able to get a map of all the key-value pairs present in the database at any point of time.

Currently I have implemented filesystem approach wherein each key is a unique file. However I am afraid that open and closing thousands of file for each read request would affect the performance.
I think boltdb and sqlite does not have time to live feature.

This sound exactly for what Bolt is good at. I would look at that, and find some other way of handling time-to-live - something like inserting a timestamp before your data and check that on read. In Bolt you dump a byte array as value, so there is nothing preventing you from inserting an 8 byte timestamp before the data. With bolt you get persistent views and transactional writes, and all that good stuff.

Bolt is really fast on reads, and writes are reasonable. When you insert your initial data, try if you can do it in key-sorted order, that makes insertion speed jump from 2k to 200k inserts/sec

Egon's solution is really nice, but I cannot really see why you would implement it yourself, unless you want the headaches of maintaining/debugging it.

/Klaus

Egon

unread,

Jul 19, 2015, 6:57:10 AM7/19/15

to golan...@googlegroups.com, egon...@gmail.com, lobo...@gmail.com

I can see two reasons:

1. You are in a situation where you need very predictable performance behavior. Writing yourself is the best way of understanding where each ns goes.

2. You want to learn.

And as a note... in production I would use bolt instead of writing it myself - I rarely need such high performance. :)

+ Egon

Egon

unread,

Jul 19, 2015, 8:06:29 AM7/19/15

to golan...@googlegroups.com, egon...@gmail.com

On Sunday, 19 July 2015 13:57:10 UTC+3, Egon wrote:

On Sunday, 19 July 2015 12:24:22 UTC+3, Klaus Post wrote:
On Saturday, 18 July 2015 04:27:21 UTC+2, Elita Lobo wrote:
Thanks a lot for the suggestions. The key can be as large as 20bytes and size of value can go upto a few kbs. Read is more frequent than writes. I don't require to fetch the value of a key from the database. I just need to ensure that each key is unique and that I should be able to get a map of all the key-value pairs present in the database at any point of time.

Currently I have implemented filesystem approach wherein each key is a unique file. However I am afraid that open and closing thousands of file for each read request would affect the performance.
I think boltdb and sqlite does not have time to live feature.

This sound exactly for what Bolt is good at. I would look at that, and find some other way of handling time-to-live - something like inserting a timestamp before your data and check that on read. In Bolt you dump a byte array as value, so there is nothing preventing you from inserting an 8 byte timestamp before the data. With bolt you get persistent views and transactional writes, and all that good stuff.

Bolt is really fast on reads, and writes are reasonable. When you insert your initial data, try if you can do it in key-sorted order, that makes insertion speed jump from 2k to 200k inserts/sec
Egon's solution is really nice, but I cannot really see why you would implement it yourself, unless you want the headaches of maintaining/debugging it.

I was interested to see what the ideal performance for my approach would be in Go.

https://github.com/egonelbre/exp/blob/master/aq/example/main.go

Write 511.058MB in 103.0059ms

4961.445MB/s

246404 entries

Flush 3.4121952s

Read 511.058MB in 425.0243ms

1202.421MB/s

246404 entries

This is pretty much the maximum you would be able to squeeze out of it. Obviously I'm taking a lot of shortcuts - i.e. ignoring the index.

The tests were:

Write random chunks between 256B and 4KB and flush to disk.

Flush it to disk.

Read chunks in written order and sum the bytes written. (I had to do something with the data)

+ Egon

ca...@doxsey.net

unread,

Jul 19, 2015, 10:04:00 AM7/19/15

to golan...@googlegroups.com

Bolt is a good solution. Having worked on a similar problem before, here's another strategy:

For a unique set when you can't fit the whole set in memory, you can use sorted files.

Start with a sorted, in memory unique set. Periodically (perhaps every 10MB or 100MB) write this set to a file and clear it. After you've processed all the records you will have a directory of many files, each of which is sorted and unique, but may contain keys found in other files.

You can then efficiently merge the sorted files into a single unique file.

This approach works well for daily, one-time processing. Although it's not the most efficient, it is very easy to implement, and also very easy to debug. You can even use standard unix command line tools. (`sort -m` with `uniq`)

It would not work well for on-the-fly checks though. Bolt would make more sense for that.

Another thing worth checking out is a Bloom Filter which you could put in front of Bolt to improve performance.

Aliaksandr Valialkin

unread,

Jul 19, 2015, 2:40:40 PM7/19/15

to golan...@googlegroups.com

You just described LSM tree algorithm. There is good implementation of this algo from Google - LevelDB https://en.m.wikipedia.org/wiki/LevelDB . There are various bindings and implementations in Go for this lib - just google for "LevelDB golang".

antoine...@gmail.com

unread,

Jul 19, 2015, 5:54:46 PM7/19/15

to golan...@googlegroups.com, lobo...@gmail.com

The TTL feature you seem to be looking for means you'll have a very large amount of writes, at least some factor of your read load, since every deleting will involve a write.

If you want something like that, I'd just use a DB that supports TTLs out of the gate, and depending on your use case, there are many choices with different tradeoffs. If you just need an embeddable database, you could cook something up with Bolt by encoding a TTL in the values you serialize, and manually manage and restart the TTLs each time you load the DB.

For uniqueness of keys, an easy way is to use a hash function over your values to define the key. So say you take the SHA1 sum of your value, and make that your key. Then your keys will be (with very high probability) unique. Depending on the hash function, collisions are less probable than many apocalyptic events, so you can pretty much assume that all keys are truly unique. A good compromise of speed and hard-to-construct-keys is the blake2 hashing algorithm. There's a few implementation laying around. Datastores deriving keys from values are called 'content addressable', meaning that if you know the content, you can retrieve it. It provides a bunch of nice features, such as automatic deduplication of values.

Elita Lobo

unread,

Jul 20, 2015, 2:47:10 AM7/20/15

to Egon, golan...@googlegroups.com

Hi,

Thanks Egon , Robert, Klaus and everyone else for the suggestions and ideas, I really appreciate your help. :) I'll implement the filesystem model using the approach suggested by Egon and others as well as try using goleveldb (Ledisdb) which has ttl option is quite fast too and then choose the one which gives the best performance!

Reply all

Reply to author

Forward