big matrices in Julia

1,262 views
Skip to first unread message

Charles Novaes de Santana

unread,
Aug 13, 2015, 6:26:55 AM8/13/15
to julia...@googlegroups.com
Hi all,

Do you recommend a way to work with bit matrices in Julia. By "big" I mean a 65600 x 65600 symmetric matrix (the upper triangular matrix is equal to the lower triangular one).

I am studying the distances between sites at different resolution maps. For low resolution we have few sites, and for big resolution we have more sites (S).

For few sites (small matrices) I was doing something like this:

S = 100;#number of sites
M = zeros(S,S);
for i in 1:(S-1)
   for j in (i+1):S
      M[i,j] = dist(i,j);#where dist(i,j) is the distance between sites i and j
   end
end

However, for big matrices I get the following message:

S=65600;
M = zeros(S,S);
ERROR: OutOfMemoryError()
 in call at essentials.jl:201
 in zeros at array.jl:233

I am using Julia Version 0.4.0-dev+5920 in Ubuntu 14.04.

Thanks for any tip!

Best,

Charles
--
Um axé! :)

--
Charles Novaes de Santana, PhD
http://www.imedea.uib-csic.es/~charles

Andreas Lobinger

unread,
Aug 13, 2015, 7:27:17 AM8/13/15
to julia-users
If i read your example correctly, you are asking for a matrix size=65600x65600 of Float64 which is roughly 32GB. My first order assumption is, that julia asks the unterlying OS about getting this memory (btw: as a single block). It would be more intersting what actually your OS supports as continuos memory allocation (read as: julia cannot do magic on your computer). You can think about using something shorter than Float64 or build some infrastructure to put all your upper diagonal into a vector and deal with the indices.


Alex Ames

unread,
Aug 13, 2015, 10:01:02 AM8/13/15
to julia-users
If you're willing to pay the performance cost associated with disk I/O, a memory-mapped array (mmap_array) might be what you're looking for.

Stefan Karpinski

unread,
Aug 13, 2015, 10:25:46 AM8/13/15
to Julia Users
Memory mapped arrays are going to be a performance nightmare if you actually need to write to the entire array. If you want a dense 65600 x 65600 matrix, you need at least 4GB times your element size, which by default is 8 bytes for Float64, which means you need a machine with at least 32GB of RAM, which shouldn't be too hard to find these days. The fact that this is upper triangular can save you half of that, but you'd need to implement custom upper triangular storage for that. The index computations are a bit tricky, but it shouldn't be that bad. If you can figure out some way that the matrix can be made sparse, then you could reduce this considerably, but you'd need to do something clever since most points are not distance zero from each other.  The Distances package may be of interest to you as well, but it won't help with the memory issues.

Marcio Sales

unread,
Aug 13, 2015, 6:18:38 PM8/13/15
to julia-users
Charles,
If your work ends with the matrix, then you could write a binary file on dink, and make your loop calculate and hold a bunch of values each time and save it to the binary file on disk periodically, until all values are saved. Try to make the loop hold as many values as possible, so you don't need to save as frequently, to improve performance.

But If you then need to use this matrix for other operations, then you have a problem... unless you buy enough memory, or you can somehow do the operation in parts.
Marcio

Charles Novaes de Santana

unread,
Aug 13, 2015, 7:03:32 PM8/13/15
to julia...@googlegroups.com
Dear all,

Thank you very much for your help and suggestions! I have learned a lot from you. Indeed, I want to use this matrix and  not only to read and store the data.

I should have written in my email that I was wondering if Julia had something similar to bigmemory and bigalgebra in R (https://sites.google.com/site/bigmemoryorg/home/bigalgebra). I know they make the allocation of huge amounts of data easier, but I didn't try to do it for my data yet (all my project is in Julia, so I want to try it until the infinite before using other languages).

As it seems to be a problem related to my computational resources, my first shot will be to try to run it in Julia in a cluster with more memory. If it doesn't work I will take some time to think in a different way to solve my problem (in worst case I will need to avoid this high-resolution dataset for now).

Thanks a lot!

Charles

John Gibson

unread,
Aug 13, 2015, 9:48:56 PM8/13/15
to julia-users
What do you mean by "distances between sites at different resolution maps"? The word "resolution" suggests that the N x N size of the matrix results from the discretization of a continuous function into N data points and the computation of N^2 distances between those data points. If that's the case, there's almost certainly a more compact representation of the distance function than the N^2 matrix. For example, you can probably represent those N data points with an expansion over  m << N continuous expansion functions, and the distance function with an expansion over the m x m tensor product of those function.

John

Charles Novaes de Santana

unread,
Aug 14, 2015, 3:20:12 AM8/14/15
to julia...@googlegroups.com
Hi John,

Thanks for writing and for your suggestion. Sorry if my email was not clear.

I am working with global discrete maps at resolution varying from 0.5x0.5 to 2.5x2.5 degrees per grid point. Those maps are discrete and represent the presence or absence of suitable habitat, so many of the points are actually 0 (the authors of the maps considered some environmental conditions to define if the sites were suitable or unsuitable habitats for a given species).

I would like to know the distance between any pair of suitable habitats to perform a task. My first and simplest try was to use a matrix to record the distances between any pair of points, and to access this matrix whenever I needed to use such a distance in my model. This approach woks fine with lower resolution data, but I understand now that this idea is not that good when I have so many sites.

I am thinking in different possibilities now. I see at least three (for sure there are many more :)):

1) to use only the subset of suitable habitats to build the matrix of distances (and then to use sparse matrix as suggested by Stefan)
2) to use a machine with more memory and try to run my models using the matrices with all the sites
3) to try another language/library that might work better with such big amount of data (like python, or R).

Thank you all for your feedbacks and your time!

Best,

Charles

Mauro

unread,
Aug 14, 2015, 5:29:00 AM8/14/15
to julia...@googlegroups.com
Maybe you could use a ragged array, which would only store the half of
the matrix you need: https://bitbucket.org/maurow/ragged.jl

Tim Holy

unread,
Aug 14, 2015, 5:39:51 AM8/14/15
to julia...@googlegroups.com
bigmemory is presumably the same thing as mmap_array.

--Tim

John Gibson

unread,
Aug 14, 2015, 10:48:47 AM8/14/15
to julia-users
Charles: 

Thanks for your response. I understand now that your application is ecological, but the mathematical nature and the discretization of your system is still unclear. You might consider a 'lazy' approach, in which you compute a distance when you need it, rather than computing them all and storing in a matrix. Or perhaps you should consult an applied mathematician at your university and get some help on how to represent your data set efficiently. I suspect there's a better way than just increasing the number of grid points.

John

On Thursday, August 13, 2015 at 6:26:55 AM UTC-4, Charles Santana wrote:

Stefan Karpinski

unread,
Aug 14, 2015, 11:03:06 AM8/14/15
to julia...@googlegroups.com
On Friday, August 14, 2015, Charles Novaes de Santana <charles...@gmail.com> wrote:

1) to use only the subset of suitable habitats to build the matrix of distances (and then to use sparse matrix as suggested by Stefan)

Distance matrices are not usually sparse – since the farthest apart pairs of points have large distances and are the most common and least interesting. However, you could store only distances for close points in a sparse matrix and use zero to represent the distance between pairs of points that are not close enough to be of interest. Either that or you could store 1/d instead of d and then closer points have higher weights and you can threshold 1/distance so that far apart points have zero entries.
 
2) to use a machine with more memory and try to run my models using the matrices with all the sites

This is probably the easiest thing to do since your data set is not of a truly unreasonable size, just largish. However, you may be much happier if you can make your problem smaller than O(n^2).
 
3) to try another language/library that might work better with such big amount of data (like python, or R).

This problem isn't going to be fundamentally different no matter what language you use: you have more data than fits in memory. Spilling memory to disk is going to be *much* slower than just recomputing distances – orders of magnitude slower. As John suggested, is there any particular reason you need to materialize all of these values in a matrix? What computation are you going to perform over that matrix?

Matt Bauman

unread,
Aug 14, 2015, 12:08:59 PM8/14/15
to julia-users
You could create a "phony" 2-dimensional array that computes the distances on the fly… but you won't be able to pass this matrix to, e.g., BLAS.

immutable DistanceMatrix <: AbstractArray{Float64, 2}
    locs
::Array{Float64, 2} # a 2xN or 3xN matrix containing the location coordinates
end
Base.size(A::DistanceMatrix) = (length(A.locs), length(A.locs))
Base.getindex(A::DistanceMatrix, i::Int, j::Int) = dist(A.locs[:, i], A.locs[:, j]) # could be further optimized

(Untested)
Reply all
Reply to author
Forward
0 new messages