Sampling elements from a vector space

57 views
Skip to first unread message

Gerli Viikmaa

unread,
May 11, 2014, 2:36:54 PM5/11/14
to sage-s...@googlegroups.com
Hi,

I am trying to analyse sets of random vectors from a vector space GF(4)^36. For that I need to sample rather large sets (up to 10^6 vectors) many times (I would like it to be 1000, depending on how fast I can analyse).

I first thought my analysis was slow on such large sets but apparently just generating the sets takes an incredibly long time!

What is the preferred method for efficiently generating sets of random vectors?

Setup:
space = VectorSpace(GF(4, 'a'), 36)
n
= 10^6


I have tried the following methods:

sample(space, n)
gives me
OverflowError: Python int too large to convert to C long

An attempt to sample indexes and then ask for space[i] instead:
sample(range(4^36), n)
also results in
OverflowError: range() result has too many items

Trying to use space.random_element():
First I tried to get unique samples:
sequences = []
while len(sequences) < n:
    elem
= space.random_element()
   
if elem not in sequences:
        sequences
.append(elem)
but this takes forever (and is impossible to interrupt). I let it run for several minutes and realized this was not going to work. I don't know how long it would actually take. I cannot use a set since the vectors are mutable (although, I must admit I haven't tried turning them immutable and seeing if using a set works better).

Then I decided not to care about the uniqueness:
%time sequences=[space.random_element() for __ in range(n)]
Best so far (in that it actually gives me a result). This takes about 60 seconds on my computer (based on a couple runs). Using xrange didn't affect the time.

Is it possible to improve this time? And I would prefer it if the set didn't contain any duplicates. If generating the dataset takes a whole minute, it would take me over two weeks just to generate 1000 datasets of this size...

Thanks in advance,
Gerli

Dima Pasechnik

unread,
May 11, 2014, 3:23:01 PM5/11/14
to sage-s...@googlegroups.com
yes, you should make them immutable.
See below how.

>
> Then I decided not to care about the uniqueness:
> %time sequences=[space.random_element() for __ in range(n)]
> Best so far (in that it actually gives me a result). This takes about *60
> seconds* on my computer (based on a couple runs). Using xrange didn't
> affect the time.
>
> Is it possible to improve this time? And I would prefer it if the set
> didn't contain any duplicates.
just do the following:

sequences=[space.random_element() for __ in range(n)]
for i in sequences:
i.set_immutable()
seq_noreps=set(sequences) # now there are no repetitions

HTH,
Dima

Gerli Viikmaa

unread,
May 11, 2014, 3:29:56 PM5/11/14
to sage-s...@googlegroups.com
Hi,

Thank you for your reply.

This doesn't improve the time at all (which is my main concern). I am
still looking for a different way of generating this data, something
that would be at least 5-10 times faster than my proposed way.

Gerli

Dima Pasechnik

unread,
May 11, 2014, 4:34:58 PM5/11/14
to sage-s...@googlegroups.com
On 2014-05-11, Gerli Viikmaa <gerliv...@gmail.com> wrote:
> Hi,
>
> Thank you for your reply.
>
> This doesn't improve the time at all (which is my main concern). I am
> still looking for a different way of generating this data, something
> that would be at least 5-10 times faster than my proposed way.

the problem is apparently the slow creation of elements of the space,
not the process of generating random elements.
Here are results of profiling:

sage: %prun sequences=[space.random_element() for __ in range(50000)]
6050003 function calls in 393.762 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)
50000 386.048 0.008 386.185 0.008 free_module.py:1903(zero_vector)
50000 4.533 0.000 393.574 0.008 free_module.py:4609(random_element)
1800000 1.092 0.000 1.755 0.000 finite_field_givaro.py:208(random_element)
1800000 0.663 0.000 0.663 0.000 {method 'random_element' of 'sage.rings.finite_rings.element_givaro.Cache_gi}
1800000 0.391 0.000 0.391 0.000 {method 'random' of '_random.Random' objects}
50000 0.380 0.000 386.742 0.008 free_module.py:5012(__call__)
...... some entries, not taking much time at all, deleted...

If you look at the random_element() code in sage/modules/free_module.py
(starting at the line 4609, see e.g.
https://github.com/sagemath/sage/blob/master/src/sage/modules/free_module.py)

then you see that it calls self(0), i.e. zero_vector(),
and this is what eats up almost all the time.

It will be much faster to allocate a zero matrix
(zero_matrix(GF(4, 'a'), 36,10^6))
and fill it in, basically using the random_element() code,
adapted appropriately.

HTH,
Dima

Gerli Viikmaa

unread,
May 12, 2014, 4:56:14 AM5/12/14
to sage-s...@googlegroups.com
Thank you - this has sped up the dataset creation 2 times.

This is the resulting function:

def generate_data(field, length, n):
data = zero_matrix(field, n, length)
for i in range(n):
for j in range(length):
data[i,j] = field.random_element()
return data

Here the vectors are rows, so I can call sequences[i] to get the ith
vector.

%time sequences = generate_data(GF(4,'a'), 36, 10^6)
CPU times: user 26.28 s, sys: 0.56 s, total: 26.84 s
Wall time: 26.93 s

compared to

space = VectorSpace(GF(4,'a'), 36)
%time sequences=[space.random_element() for __ in range(10^6)]
CPU times: user 58.41 s, sys: 0.20 s, total: 58.60 s
Wall time: 58.75 s

I do wonder if the filling up can be done even better...

Gerli

John Cremona

unread,
May 12, 2014, 5:01:34 AM5/12/14
to SAGE support
On 12 May 2014 09:56, Gerli Viikmaa <gerliv...@gmail.com> wrote:
Thank you - this has sped up the dataset creation 2 times.

This is the resulting function:

def generate_data(field, length, n):
    data = zero_matrix(field, n, length)
    for i in range(n):
        for j in range(length):
            data[i,j] = field.random_element()
    return data

Here the vectors are rows, so I can call sequences[i] to get the ith vector.

%time sequences = generate_data(GF(4,'a'), 36, 10^6)
CPU times: user 26.28 s, sys: 0.56 s, total: 26.84 s
Wall time: 26.93 s

compared to


space = VectorSpace(GF(4,'a'), 36)
%time sequences=[space.random_element() for __ in range(10^6)]
CPU times: user 58.41 s, sys: 0.20 s, total: 58.60 s
Wall time: 58.75 s

I do wonder if the filling up can be done even better...

To save memory, though not necessarily time, do not create the whole n-row matrix.  Instead, create and return one random vector using "yield", with a count so you can stop the iteration when n vectors have been returned.  In other words, make this into an iterator function.

John Cremona
 

Gerli
--
You received this message because you are subscribed to the Google Groups "sage-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sage-support+unsubscribe@googlegroups.com.
To post to this group, send email to sage-s...@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-support.
For more options, visit https://groups.google.com/d/optout.

Martin Albrecht

unread,
May 12, 2014, 5:12:53 AM5/12/14
to sage-s...@googlegroups.com
Why not:

%time A = random_matrix(GF(4,'a'), 36, 10^6)
CPU times: user 568 ms, sys: 12 ms, total: 580 ms
Wall time: 578 ms

However, getting the rows out takes ages. The reason is that vectors over
GF(4) are generic, i.e. noone sat down and wrote up a simple class which
implements these vectors as matrices with one row:

sage: A = matrix(GF(4,'a'),10, 10)
sage: type(A)
<type 'sage.matrix.matrix_mod2e_dense.Matrix_mod2e_dense'>

sage: v = vector(GF(4,'a'),10)
sage: type(v) # generic type
<type 'sage.modules.free_module_element.FreeModuleElement_generic_dense'>


Compare that with GF(2):

sage: %time A = random_matrix(GF(2), 36, 10^6)
CPU times: user 16 ms, sys: 0 ns, total: 16 ms
Wall time: 16 ms
sage: %time V = A.rows()
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 2.55 ms

sage: A = matrix(GF(2),10, 10)
sage: type(A)
<type 'sage.matrix.matrix_mod2_dense.Matrix_mod2_dense'>

sage: v = vector(GF(2),10)
sage: type(v) # specialised type
<type 'sage.modules.vector_mod2_dense.Vector_mod2_dense'>

So I'm afraid the answer is: "implement it and send a patch" :)
signature.asc

Gerli Viikmaa

unread,
May 12, 2014, 5:41:11 AM5/12/14
to sage-s...@googlegroups.com
Hmm, I didn't know about the random_matrix command.

So far it works very good for me:
%time sequences = random_matrix(GF(4,'a'), 10^6, 36) # notice the order
CPU times: user 0.79 s, sys: 0.00 s, total: 0.79 s
Wall time: 0.77 s

%time sequences[0]
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
(0, a + 1, a, a, a, a, 0, a, a, a, 0, a, 1, a, a + 1, a, 1, a, 0, a, 0, a, a + 1, 1, a, 1, a + 1, 0, 0, a, 0, a + 1, a, 1, a + 1, 0)

This part is slow, though, yes:
%time V = sequences.rows()
CPU times: user 38.08 s, sys: 0.42 s, total: 38.50 s
Wall time: 38.63 s

And I did manage to fill up my memory when creating the transposed matrix (as you suggested) and printing out the first row (containing a million elements).

Too bad I'm not working in GF(2)...

I believe I can work with this now. Thank you all for your help.

Gerli

Dima Pasechnik

unread,
May 12, 2014, 7:45:38 AM5/12/14
to sage-s...@googlegroups.com
On 2014-05-12, Gerli Viikmaa <gerliv...@gmail.com> wrote:
> Thank you - this has sped up the dataset creation 2 times.
>
> This is the resulting function:
>
> def generate_data(field, length, n):
> data = zero_matrix(field, n, length)
> for i in range(n):
> for j in range(length):
> data[i,j] = field.random_element()
> return data
>
> Here the vectors are rows, so I can call sequences[i] to get the ith
> vector.
>
> %time sequences = generate_data(GF(4,'a'), 36, 10^6)
> CPU times: user 26.28 s, sys: 0.56 s, total: 26.84 s
> Wall time: 26.93 s
>
> compared to
>
> space = VectorSpace(GF(4,'a'), 36)
> %time sequences=[space.random_element() for __ in range(10^6)]
> CPU times: user 58.41 s, sys: 0.20 s, total: 58.60 s
> Wall time: 58.75 s
>
> I do wonder if the filling up can be done even better...

you can get another 30-50% or so of speedup if you replace
field.random_element() by field._cache.random_element()
in your code.
It is because finite_field_givaro.py:208(random_element)
is a python function which just calls _cache.random_element()
and does nothing else.
(again, use %prun to see where the bottleneck in your code is).

You might then also try using Cython.
Finally, I don't know how these random elements are found; it could be a
generic procedure that does not do too well for small fields.
It is done by Givaro (cf http://givaro.forge.imag.fr/givaro-html/)
so you might want to dig into it a bit.
But probably you'd then need to code in Cython or in C/C++ to
get an improvement...
Reply all
Reply to author
Forward
0 new messages