efficiency of query by using python thrift api

149 views
Skip to first unread message

Xun

unread,
Dec 8, 2009, 2:36:30 PM12/8/09
to Hypertable User
Hi guys!
consider the following scenario:

$create table test(a);
#insert 20000 qualifier for column 'a'
for i in range(20000):
client.set_cell(mutator, Cell("py-k1", "a", str(time.time()), "py-
v1"))
client.close_mutator(mutator, 1)

when I quey all the qualifiers of column 'a', it takes about 30
seconds to return no matter what python thrift functions used(get_cells
(), next_cells iteration).
$uname -a
Linux hadoop1 2.6.29.6-server-2mnb #1 SMP Sun Aug 16 23:47:22 EDT 2009
i686 Genuine Intel(R) CPU T2500 @ 2.00GHz GNU/Linux

Is this a reasonable performance of hypertable?
thanks!

Luke

unread,
Dec 8, 2009, 2:58:05 PM12/8/09
to hyperta...@googlegroups.com
No it's not normal. What does your query look like? Have you tried
get_cells_as_arrays?
> --
>
> You received this message because you are subscribed to the Google Groups "Hypertable User" group.
> To post to this group, send email to hyperta...@googlegroups.com.
> To unsubscribe from this group, send email to hypertable-us...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/hypertable-user?hl=en.
>
>
>

Xun

unread,
Dec 8, 2009, 8:41:57 PM12/8/09
to hyperta...@googlegroups.com
Hi Luke,
Thanks for your suggestion!
Here is test script:
client = ThriftClient('192.168.6.246', 38080)
mutator = client.open_mutator("test", 0, 0);
for i in range(20000):
client.set_cell(mutator, Cell("py-k1", "a", str(time.time()), "py-v1"))
client.flush_mutator(mutator);

st = time.time()
ss = ScanSpec()
ci = CellInterval(start_row = 'pk-k1', end_row = 'pk-k1',
start_column = 'a', end_column = 'a')
ss.cell_intervals = [ss]
cells = client.get_cells_as_arrays('test', None)
print len(cells)
et = time.time()
print 'time:%s'%str(et-st)

When I use get_cells_as_array, the performance better:
get_cells_as_array():
20000 cells
time:9.70299983025
40000 cells
time:19.4529998302

get_cells():
20000
time:23.9219999313
40000
time:47.4529998302

what about the results this time?
thanks!

2009/12/9 Luke <vic...@gmail.com>:

Luke

unread,
Dec 8, 2009, 11:54:28 PM12/8/09
to hyperta...@googlegroups.com
Yeah, the slowdown mostly is in the python's object creation overhead.
Ruby has similar problems. That's why *_as_array* methods were
introduced. It still feels slow to me. What does 'top' say during the
query? Have you tried to use the hypertable shell to see the hql
select speed? Is hypertable compiled in Release mode?

Xun

unread,
Dec 9, 2009, 12:39:23 AM12/9/09
to hyperta...@googlegroups.com
The hql speed is 3653.10 cells/s which is faster than the script query
(about 2050 cells/s).
select * into file is the fastest way (232173.43 cells/s) , but not
the way I want.

Luke, what's your suggestion? where is the bottleneck do you think?
network or the limitation of server's hardware?
thank you:)

Here is the output:
hypertable> select * from test where row = "py-k1" display_timestamps
into file "dump.tsv";

Elapsed time: 0.17 s
Avg value size: 5.00 bytes
Avg key size: 18.91 bytes
Throughput: 5550326.49 bytes/s
Total cells: 40000
Throughput: 232173.43 cells/s

hypertable> select * from test where row = "py-k1" display_timestamps;
Elapsed time: 10.95 s
Avg value size: 5.00 bytes
Avg key size: 18.91 bytes
Throughput: 87330.90 bytes/s
Total cells: 40000
Throughput: 3653.10 cells/s

my hypertable is binary packages hypertable-0.9.2.7-linux-i386, run in
local mode.
while run the query script the top of server is:

Tasks: 100 total, 1 running, 99 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.5%sy, 0.0%ni, 99.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 4063796k total, 4042852k used, 20944k free, 24184k buffers
Swap: 4192924k total, 16908k used, 4176016k free, 1155340k cached

PID USER PR NI RES VIRT SHR S %CPU %MEM TIME+ COMMAND
3822 zhanghon 20 0 2.6g 3020m 1944 S 0 65.9 274:33.18
Hypertable.Rang
3686 zhanghon 20 0 16m 226m 720 S 0 0.4 9:17.92
localBroker
3685 zhanghon 20 0 4148 195m 1264 S 0 0.1 4:15.08
Hyperspace.Mast
3769 zhanghon 20 0 7044 223m 1296 S 0 0.2 0:29.39
Hypertable.Mast
21344 zhanghon 20 0 19m 70124 1796 S 0 0.5 0:25.06
ThriftBroker
14084 zhanghon 20 0 1188 12624 896 R 0 0.0 0:00.37 top
8412 zhanghon 20 0 1412 49792 824 S 0 0.0 0:00.16 sshd
8413 zhanghon 20 0 2116 12960 1212 S 0 0.1 0:00.04 bash

2009/12/9 Luke <vic...@gmail.com>:

Sanjit Jhala

unread,
Dec 9, 2009, 1:54:37 AM12/9/09
to hyperta...@googlegroups.com
Hi Xun,

Can you try running the HQL query in your script via the hql_query API
? Something like:
res = client.hql_query('select * from test where row = "py-k1"
display_timestamps into file "dump.tsv"');

That should help in estimating the Thrift overhead irrespective of any
other object creation inefficiencies.

-Sanjit

Xun

unread,
Dec 9, 2009, 3:59:52 AM12/9/09
to hyperta...@googlegroups.com
Hi Sanjit Jhala,
Here is the output of my client side script:

HqlResult(mutator=None, cells=None, results=None, scanner=None)
time:0.155999898911

I think it make no different to shell hql cause they all generate tsv
file in the server side, and no network communication between server
and client.

I run following script too:
res = client.hql_query('select * from test where row = "py-k1"
display_timestamps ")

time:48.6399998665
actually this is slower than get_cells_as_arrays() cause it create
40000 Cell objects.

I am sorry for my pure English expression:), and thanks to all your
patient help!

2009/12/9 Sanjit Jhala <sjh...@gmail.com>:

Luke

unread,
Dec 9, 2009, 1:14:11 PM12/9/09
to hyperta...@googlegroups.com
On Wed, Dec 9, 2009 at 12:59 AM, Xun <zhx...@gmail.com> wrote:
> Hi Sanjit Jhala,
> Here is the output of my client side script:
>
> HqlResult(mutator=None, cells=None, results=None, scanner=None)
> time:0.155999898911
>
> I think it make no different to shell hql cause they all generate tsv
> file in the server side, and no network communication between server
> and client.
>
> I run following script too:
> res = client.hql_query('select * from test where row = "py-k1"
> display_timestamps ")
>
> time:48.6399998665
> actually this is slower than get_cells_as_arrays() cause it create
> 40000 Cell objects.

Can you also try hql_query2, which results in HqlResult2 that contains
cells as arrays.

Xun

unread,
Dec 9, 2009, 8:24:44 PM12/9/09
to hyperta...@googlegroups.com
Luke, here is the result:

"res = client.hql_exec2(r'select * from test where row = "py-k1"
display_timestamps', 0, 0)
print len(res.cells)"
40000
time:19.7969999313


2009/12/10 Luke <vic...@gmail.com>:
Reply all
Reply to author
Forward
0 new messages