On Thursday, September 19, 2013 04:45:39 AM
program...@gmail.com wrote:
> data such as the number 10 ^ 6 rows, 10 ^ 3 col, (float), with the data
> matrices. Algerba, covariance , statistical tests with your own code. Sums,
> sqrt, and conditional moments of input...
>
> Export table summarizing (10 ^ 3 x 10 ^ 3 for example, matrix, )
> Paul
I'm not exactly sure what you're asking, but in general a memory-mapped array
acts just like a regular array; any algorithm written for an Array should (in
principle) work. So I recommend just starting to try to do whatever it is
you're interested in doing, and see how it goes. One recommendation would be
to start with files that are not huge, so that it doesn't take too long to
complete.
There are two likely gotchas:
1. By and large, if the outputs are also big, you'll want to use algorithms
with "pre-allocated outputs," where the output is allocated with mmap_array.
That way you don't need the memory to compute the entire thing in one shot.
For example, you should be able to do
C = A*B
this way:
sA = open("matrixA.bin")
A = mmap_array(Float64, (m,k), sA)
sB = open("matrixB.bin")
B = mmap_array(Float64, (k,n), sB)
sC = open("result.bin", "w+")
C = mmap_array(Float64, (m,n), sC)
A_mul_B(C, A, B)
The key part being that A_mul_B is a function, built in to Julia, to compute
A*B using a pre-allocated output C. It should automatically write the disk file
as it goes, you don't need to do any real work to achieve this.
2. When using mmapped arrays, you need to pay particular attention to "cache-
efficiency," because disk<->RAM is even slower than RAM<->cache. Many of the
algorithms in Julia are cache-efficient already, but there are also many that
are not. You'll have to experiment to find the bottlenecks, and when you find
them you're encouraged to contribute improvements to Julia and various
packages.
For the kinds of things you're describing, Dahua's excellent
https://github.com/lindahua/NumericExtensions.jl
is a collection of some cache-efficient algorithms. There are many others in
other packages (e.g., just in the last 24 hours I rewrote Image's gaussian
filtering algorithm to be more cache-efficient, with a 5x speed improvement in
real-world tests even when working from RAM).
But that's about all there is to it---memory-mapping makes everything much
easier when you're dealing with big data sets. So just start writing your
algorithms and see how it goes.
Best,
--Tim
>
>
>
> W dniu czwartek, 19 września 2013 12:20:15 UTC+2 użytkownik Tim Holy
>
> napisał:
> > On Wednesday, September 18, 2013 01:26:41 PM
program...@gmail.com<javascript:>wrote: