Bcolz performance tuning on AWS

42 views
Skip to first unread message

Carst Vaartjes

unread,
Feb 5, 2017, 8:41:30 AM2/5/17
to bcolz
Hi,

We recently introduced a new, zeromq based distributed system for bquery (see https://github.com/visualfabriq/bqueryd). In short: we chunk large bcolz files into smaller 'shards' (normally 1 million records) and divide those over servers, creating a massively parallel querying system. So we have tables with over 3 billion records, leading to over 3000 shards being available over a series of servers where all the work is done in parallel.
We are finishing the last touches there, but in general it has gone live and it is working well. (entered a cfp for pydata amsterdam with this, let's see if they are interested).
Now we run on AWS, meaning we use c4.2xlarge instances, which means 8 cpus, 8000 IOPS with a 256kB io block size. From a previous cpu-limited operation we are now running into IO limitations (meaning we are a lot faster than before, but still can improve this bottleneck a lot).
So I think we came to the point where we should not use "expected_len" any more, but instead should start fine tuning with chunklen and compression methods. But this is a classical "communicating vessels" problem, so I wondered if anyone else has experience here? I will try to share ours with some follow-up posts in here if people are interested).
In short:
  • We want to maximize cpu and disk bottlenecks
  • The latest bquery can push everything except aggregation result itself to out-of-core, so we run on a very low-memory profile -> memory is not an issue
  • CPU is now underused, with disk io being the main bottleneck
  • After compression we want to end up with a chunk size that is ideally just under 256kb. It's more important to be under the limit than above it (because exceeding the limit doubles the IO count)
  • Ideally we would have a variable chunklength that is set by testing it during writing/compression but 1) that is not supported now and 2) would make writing slower (not really an issue though normally, as we WORM-style write normally just once and then read many times)
  • So we need to adjust the chunklen so that most often the compressed chunk file is near yet under 256kb (let's try to 90% of the time be under it for instance)
    • We have mixed int64 and float64 ctables, which might also complicate matters even more because they might have different disk lengths and compression rates
  • After this, we might still be IO limited, but it should be a lot less than before (I will do some research into the current average on-disk size of our chunks)
  • If we are still IO limited, we can turn up compression so we offload more to the cpu and less to the file system
    • That would also mean that we need to change the chunklen again though to stay near 256kB chunks
Anyone have experience with this? Or else, interested in the results?

BR,

Carst

Carst Vaartjes

unread,
Feb 5, 2017, 1:08:51 PM2/5/17
to bcolz
Some more investigation into the average sizing:

import fnmatch
import os

sizes = []
for root, dirnames, filenames in os.walk('/srv/bcolz'):
    for filename in fnmatch.filter(filenames, '*.blp'):
        file_path = os.path.join(root, filename)        
        if root.split('/')[-2][0:1] == 'm':
            dtype = 'float'
        else:
            dtype = 'int' 
        sizes.append((dtype, os.path.getsize(file_path)))

In [17]: df[df[0]=='int'][1].describe()
Out[17]: 
count    6.952276e+06
mean     5.161004e+04
std      2.174342e+05
min      4.000000e+01
25%      1.984000e+03
50%      4.012000e+03
75%      2.774000e+04
max      3.157280e+06
Name: 1, dtype: float64

7 million files averaging around 4kb

In [18]: df[df[0]=='float'][1].describe()
Out[18]: 
count    1.069894e+06
mean     1.757018e+05
std      4.178645e+05
min      4.000000e+01
25%      4.064000e+03
50%      1.001450e+05
75%      2.029160e+05
max      4.194336e+06
Name: 1, dtype: float64

1 million files average around 100kb

it's not weird that the integer columns are smaller, as we use them to file with unique attribute references, so they have a lot more repetition than the floats which are unique measure values.
Hmmm

Carst Vaartjes

unread,
Feb 5, 2017, 5:50:11 PM2/5/17
to bcolz


Left axis is the size (in bytes), right is compression. The left one is float64, the right is int64 (which have a lot smaller files as you can see).

So larger chunklens make the compression drop quite a bit, leading to an increase in size but not of the entire file size. NB: the compression shown was the entire ctable compression, not just the individual columns, but I tried to do a real life example for me here.

It's rather unfortunate that the compression drops so much, as that will not really give a performance increase (unlike the ideal "high compression with 256kB size files")

Alistair Miles

unread,
Feb 8, 2017, 2:45:05 PM2/8/17
to bc...@googlegroups.com
Not sure I fully understand all the details here, but FWIW I'm surprised that compression ratio drops with increased chunklen. Chunk size should in theory not have much impact on compression ratio for 1D data (e.g., a column in a table). Block size should make a difference, with larger block sizes giving higher compression, as the compressor is given larger blocks of data to compress so can find more repetition/correlation. 
--
You received this message because you are subscribed to the Google Groups "bcolz" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bcolz+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alim...@googlemail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

Reply all
Reply to author
Forward
0 new messages