Re: Very slow PyStarDict

Serge Matveenko

unread,

Dec 18, 2008, 3:37:27 AM12/18/08

to cocobear, pysta...@googlegroups.com

On Thu, Dec 18, 2008 at 10:54 AM, cocobear <cocob...@gmail.com> wrote:
> Hi,
> I found that it took more then 13 seconds to load a
> dictionary:
> -rw-r--r-- 1 root root 10651674 2003-11-14 langdao-ec-gb.idx

Hello!

Thank you for answer.

And yes, it might be that huge uncompressed dictionary is loading very
long time now, because of it is loading in memory record by record in
this version.
Please, try to gzip your .idx file and load the same dictionary again.

I'm adding .idx loading optimization in my v0.4 to do list.

--
Serge Matveenko
mailto:se...@matveenko.ru
http://serge.matveenko.ru/

cocobear

unread,

Dec 18, 2008, 4:21:51 AM12/18/08

to s...@matveenko.ru, se...@matveenko.ru, pysta...@googlegroups.com

于 Thu, 18 Dec 2008 11:37:27 +0300
"Serge Matveenko" <se...@matveenko.ru> 写道:

> On Thu, Dec 18, 2008 at 10:54 AM, cocobear <cocob...@gmail.com>
> wrote:
> > Hi,
> > I found that it took more then 13 seconds to load a
> > dictionary:
> > -rw-r--r-- 1 root root 10651674 2003-11-14 langdao-ec-gb.idx
>
> Hello!
>
> Thank you for answer.
>
> And yes, it might be that huge uncompressed dictionary is loading very
> long time now, because of it is loading in memory record by record in
> this version.
> Please, try to gzip your .idx file and load the same dictionary again.
>

I tried this, but it's the same as ungziped.

> I'm adding .idx loading optimization in my v0.4 to do list.
>
>

I think it's very IMPORTANT, no one want to look up a word in 12
seconds.

it took only 0.017s in sdcv(http://sdcv.sourceforge.net).
I think we should make "lookup" in 1 second.

Serge Matveenko

unread,

Dec 18, 2008, 5:03:54 AM12/18/08

to pysta...@googlegroups.com

On Thu, Dec 18, 2008 at 12:21 PM, cocobear <cocob...@gmail.com> wrote:
> I think it's very IMPORTANT, no one want to look up a word in 12
> seconds.

i agree

but this is important to understand that loading dictionary index and
looking up the word is two different operations

it will be great if you could modify demo.py from examples to run it
with your dictionary and then post results here

thank you for your help!

Serge Matveenko

unread,

Dec 18, 2008, 6:06:23 AM12/18/08

to pysta...@googlegroups.com

On Thu, Dec 18, 2008 at 11:37 AM, Serge Matveenko <se...@matveenko.ru> wrote:
> I'm adding .idx loading optimization in my v0.4 to do list.

Ok, i've found after profiling that the longest time is needed for
unpacking data from records

I'm going to rewrite some code to making 1 unpack for record instead
of three. Also this rewrite will affect dropping some lists.

Than we could use NumPy's array interface
http://numpy.scipy.org/array_interface.shtml instead of unpack method
from struct module as i was advised by Alexey Smirnov.

Serge Matveenko

unread,

Dec 19, 2008, 6:54:58 PM12/19/08

to pysta...@googlegroups.com

On Thu, Dec 18, 2008 at 2:06 PM, Serge Matveenko <se...@matveenko.ru> wrote:
> On Thu, Dec 18, 2008 at 11:37 AM, Serge Matveenko <se...@matveenko.ru> wrote:
>> I'm adding .idx loading optimization in my v0.4 to do list.
>
> Ok, i've found after profiling that the longest time is needed for
> unpacking data from records
>
> I'm going to rewrite some code to making 1 unpack for record instead
> of three. Also this rewrite will affect dropping some lists.

rewrited. anyone can checkout 'speedup' tag.
we have now one big unpack instead of three small
i've got speedup from 4 seconds to 2.7 seconds on my PC

> Than we could use NumPy's array interface
> http://numpy.scipy.org/array_interface.shtml instead of unpack method
> from struct module as i was advised by Alexey Smirnov.

i will look at it later

cocobear

unread,

Dec 21, 2008, 8:43:03 PM12/21/08

to pysta...@googlegroups.com, se...@matveenko.ru

于 Sat, 20 Dec 2008 02:54:58 +0300
"Serge Matveenko" <se...@matveenko.ru> 写道:

>

> On Thu, Dec 18, 2008 at 2:06 PM, Serge Matveenko <se...@matveenko.ru>
> wrote:
> > On Thu, Dec 18, 2008 at 11:37 AM, Serge Matveenko
> > <se...@matveenko.ru> wrote:
> >> I'm adding .idx loading optimization in my v0.4 to do list.
> >
> > Ok, i've found after profiling that the longest time is needed for
> > unpacking data from records
> >
> > I'm going to rewrite some code to making 1 unpack for record instead
> > of three. Also this rewrite will affect dropping some lists.
>
> rewrited. anyone can checkout 'speedup' tag.
> we have now one big unpack instead of three small
> i've got speedup from 4 seconds to 2.7 seconds on my PC
>

1 dicts load: 0:00:10.023205
(5887265, 74)
1 cords getters: 0:00:00.000241
*[hi:]
pron. 他
n. 男孩, 男人, 雄性动物
【医】氦(2号元素)
1 direct data getters (w'out cache): 0:00:00.114292
*[hi:]
pron. 他
n. 男孩, 男人, 雄性动物
【医】氦(2号元素)
1 high level data getters (not cached): 0:00:00.113353
*[hi:]
pron. 他
n. 男孩, 男人, 雄性动物
【医】氦(2号元素)
1 high level data getters (cached): 0:00:00.000213

About 2 seconds on my PC.

Serge Matveenko

unread,

Dec 22, 2008, 3:36:39 AM12/22/08

to pysta...@googlegroups.com

2008/12/22 cocobear <cocob...@gmail.com>:

> 于 Sat, 20 Dec 2008 02:54:58 +0300
> "Serge Matveenko" <se...@matveenko.ru> 写道:

>> rewrited. anyone can checkout 'speedup' tag.
>> we have now one big unpack instead of three small
>> i've got speedup from 4 seconds to 2.7 seconds on my PC
>>
>
> 1 dicts load: 0:00:10.023205
> (5887265, 74)
> 1 cords getters: 0:00:00.000241
> *[hi:]
> pron. 他
> n. 男孩, 男人, 雄性动物
> 【医】氦(2号元素)
> 1 direct data getters (w'out cache): 0:00:00.114292
> *[hi:]
> pron. 他
> n. 男孩, 男人, 雄性动物
> 【医】氦(2号元素)
> 1 high level data getters (not cached): 0:00:00.113353
> *[hi:]
> pron. 他
> n. 男孩, 男人, 雄性动物
> 【医】氦(2号元素)
> 1 high level data getters (cached): 0:00:00.000213
>
> About 2 seconds on my PC.

thank you for another test
it looks unbelievable slow especially on your really fast configuration
i will make test script via profiler to look deeper into problem

however 10 seconds is fair result for loading such big amount of data
with fields of various size into memory

cocobear

unread,

Dec 24, 2008, 12:58:42 AM12/24/08

to pysta...@googlegroups.com, se...@matveenko.ru

I think read one byte by a time is wrong, even I did nothing, this will
take 5seconds to finished when reading a large dictionary.

Probably there is a better way.

Serge Matveenko

unread,

Dec 24, 2008, 4:31:09 AM12/24/08

to pysta...@googlegroups.com

On Wed, Dec 24, 2008 at 8:58 AM, cocobear <cocob...@gmail.com> wrote:
> I think read one byte by a time is wrong, even I did nothing, this will
> take 5seconds to finished when reading a large dictionary.

there is no byte by byte file reading
there is only byte by byte parsing directly in memory of whole file
read at once into byte buffer

> Probably there is a better way.

i willl be glad to know it

cocobear

unread,

Dec 24, 2008, 4:46:36 AM12/24/08

to pysta...@googlegroups.com, se...@matveenko.ru

On Wed, 24 Dec 2008 12:31:09 +0300
"Serge Matveenko" <se...@matveenko.ru> wrote:

>
> On Wed, Dec 24, 2008 at 8:58 AM, cocobear <cocob...@gmail.com>
> wrote:
> > I think read one byte by a time is wrong, even I did nothing, this
> > will take 5seconds to finished when reading a large dictionary.
>
> there is no byte by byte file reading
> there is only byte by byte parsing directly in memory of whole file
> read at once into byte buffer
>

It's what I mean, but I express it not exactly.

> > Probably there is a better way.
>
> i willl be glad to know it
>

I'm working on it.

>

Reply all

Reply to author

Forward