Big data , Open stream , A = mmap_array(Int64, (25,30000), s) data type

1,051 views
Skip to first unread message

program...@gmail.com

unread,
Sep 17, 2013, 3:33:03 PM9/17/13
to julia...@googlegroups.com
Help ;)
data.txt=
1,0
0,1

(linux64, UTF8)

julia> s=open("data.txt","r")
IOStream(<file data.txt>)

is ok,
but:

julia> (A = mmap_array(Int, (2,2), s))
2x2 Array{Int64,2}:

 950589581243198513  0
                  0  0
:/
 
Is the data file can be a text? Do not necessarily binary? How to Introduce the matrix A = mmap_array data from the file data.txt?
He would like there to process several Giga, these four digits as an example.
Paul

Stefan Karpinski

unread,
Sep 17, 2013, 3:36:44 PM9/17/13
to Julia Users
When you mmap a file to an array, the exact binary content of the file is the memory that represents the array. You can't take a text representation of numbers and "mmap" it to a float array. If the data is text, you have to parse it and construct an array.

program...@gmail.com

unread,
Sep 18, 2013, 9:25:15 AM9/18/13
to julia...@googlegroups.com
With great thanks for quick respons. Example of documents open (ReadAll, "file.txt") suggests a txt file. Maybe it's better to write the documentation: file.bin ... ?
Paul

Ivar Nesje

unread,
Sep 18, 2013, 9:57:34 AM9/18/13
to julia...@googlegroups.com
Where does it suggest that mmap_array should get a stream from a "file.txt"? I can not find a high level description of how to use mmap in the documentation? open() can be used can definitely be used with text files, but then you have to read() from it and if you want to do calculations you have to convert the strings with numbers to computer number first. 

mmap_array is convenient for dealing with arrays that are too big to fit in memory, thus you must have a file on disk that mirrors the internal memory representation. 

Stefan Karpinski

unread,
Sep 18, 2013, 12:42:11 PM9/18/13
to Julia Users
Try readcsv:

julia> readcsv("tmp/data.csv")
2x2 Array{Float64,2}:
 1.0  0.0
 0.0  1.0

If you'd prefer the data is Ints, you can do this:

julia> readcsv("tmp/data.csv",Int)
2x2 Array{Int64,2}:
 1  0
 0  1

Hope that helps.

program...@gmail.com

unread,
Sep 18, 2013, 4:22:04 PM9/18/13
to julia...@googlegroups.com
ok, but i have 8GB data.csv file and only 8 GB RAM :)

program...@gmail.com

unread,
Sep 18, 2013, 4:26:41 PM9/18/13
to julia...@googlegroups.com
in this place: http://docs.julialang.org/en/release-0.1-0/stdlib/base/?highlight=open#Base.dlopen
+-

Example: open(readall, "file.txt"), but it is only sugestion ;)

Paul

program...@gmail.com

unread,
Sep 18, 2013, 4:32:57 PM9/18/13
to julia...@googlegroups.com
Working on large files is very attractive for new enthusiasts Julia. It would be nice to read a synthetic guide how to do it ... After R is going to be very simple;)
Paul

Stefan Karpinski

unread,
Sep 18, 2013, 4:33:32 PM9/18/13
to Julia Users
There are many options, but mmap is not one of them. You could write a program that streams over the data instead of loading it all into memory at once. If there is a limited range of the values, such as 0-255, then you could load them into memory as Uint8s and that would save memory. It's also possible that the text representation of the data is larger than representing them as numbers in memory would be but that is by no means guaranteed. The best option is probably buying more RAM.

John Myles White

unread,
Sep 18, 2013, 6:15:43 PM9/18/13
to julia...@googlegroups.com
There is basic support for streaming in the DataFrames package. I need to finish some updates that give this a nicer interface, but it is already possible to open up an IO stream and pull in a small number of rows in a loop that will eventually work through the whole file. The current implementation is a little slow, but I have 95% of the work done for a much faster stream processing tool.

 -- John

program...@gmail.com

unread,
Sep 19, 2013, 2:53:17 AM9/19/13
to julia...@googlegroups.com
I would like to process the collection of more than 100 GB. Do you think that the package DATAframe is this possible? How did you guess need fast tool that reads and writes to files.
Paul

Ivar Nesje

unread,
Sep 19, 2013, 4:22:16 AM9/19/13
to julia...@googlegroups.com
As the americans are probably asleep now, I try to give my guess on your question.

It depends on what kind of processing you want to do with the data, and what performance you need. If you don't have access to a computer with enough memory, you have to process it chunk by chunk. Just reading and writing 100 GB (like in a copy of a file) takes time, and it is important to reduce the number of full scans. There are lots of databases that are specially made to do simple operations on huge data on disk that might be easier to use and optimize than the current set of packages for julia. Can you try to explain how the data looks like and what kind of processing you want to do?

Ivar

Tim Holy

unread,
Sep 19, 2013, 6:20:15 AM9/19/13
to julia...@googlegroups.com
On Wednesday, September 18, 2013 01:26:41 PM program...@gmail.com wrote:
> in this place:
> http://docs.julialang.org/en/release-0.1-0/stdlib/base/?highlight=open#Base.
> dlopen +-
>
> *Example*: open(readall, "file.txt"), but it is only sugestion ;)

First to explain: mmap_array() takes a stream input, and indeed open() returns
a stream. But open() is more generic, and can open both binary and ASCII files.
The example for open() happens to illustrate it for an ASCII file.

If you're new to memory mapping then indeed there are pitfalls, of which this
is one. So, I expanded the documentation on mmap_array considerably, including
a complete example. Hopefully that gives you something close to the "synthetic
guide" you were asking about in a different message.

Note there are two versions of the documentation, you'll want to access the
latest (hover over the icon at the lower right of the page). This is
especially true if you're running 0.2pre.

Best,
--Tim

Tim Holy

unread,
Sep 19, 2013, 6:22:04 AM9/19/13
to julia...@googlegroups.com
For big data, unless each data point will be accessed only once, there's going
to be a huge benefit to having the data stored in binary format---string
conversion is slow, and you don't want to have to do it repeatedly each time
you access the same value. In such cases you're best off writing a converter
program to write a separate file, in binary format. Then you can use mmap
easily.

--Tim

On Wednesday, September 18, 2013 11:53:17 PM program...@gmail.com wrote:
> I would like to process the collection of more than 100 GB. Do you think
> that the package DATAframe is this possible? How did you guess need fast
> tool that reads and writes to files.
> Paul
>
> W dniu czwartek, 19 września 2013 00:15:43 UTC+2 użytkownik John Myles
>
> White napisał:
> > There is basic support for streaming in the DataFrames package. I need to
> > finish some updates that give this a nicer interface, but it is already
> > possible to open up an IO stream and pull in a small number of rows in a
> > loop that will eventually work through the whole file. The current
> > implementation is a little slow, but I have 95% of the work done for a
> > much
> > faster stream processing tool.
> >
> > -- John
> >
> > On Sep 18, 2013, at 4:33 PM, Stefan Karpinski
> > <ste...@karpinski.org<javascript:>> wrote:
> >
> > There are many options, but mmap is not one of them. You could write a
> > program that streams over the data instead of loading it all into memory
> > at
> > once. If there is a limited range of the values, such as 0-255, then you
> > could load them into memory as Uint8s and that would save memory. It's
> > also
> > possible that the text representation of the data is larger than
> > representing them as numbers in memory would be but that is by no means
> > guaranteed. The best option is probably buying more RAM.
> >
> > On Wed, Sep 18, 2013 at 4:22 PM, <program...@gmail.com
<javascript:>>wrote:

program...@gmail.com

unread,
Sep 19, 2013, 7:45:39 AM9/19/13
to julia...@googlegroups.com
Nice, I dont sleepe also , GMT time:)

data
such as the number 10 ^ 6 rows, 10 ^ 3 col, (float), with the data matrices. Algerba, covariance , statistical tests with your own code. Sums, sqrt,  and conditional moments of input...

Export
table summarizing (10 ^ 3 x 10 ^ 3 for example, matrix, )

Paul

program...@gmail.com

unread,
Sep 19, 2013, 7:49:14 AM9/19/13
to julia...@googlegroups.com
P.s. ... Unfortunately, analysis of covariance needs couple after couple quick access of all the pairs of variables ...
Paul

Tim Holy

unread,
Sep 19, 2013, 10:15:22 AM9/19/13
to julia...@googlegroups.com
On Thursday, September 19, 2013 04:45:39 AM program...@gmail.com wrote:
> data such as the number 10 ^ 6 rows, 10 ^ 3 col, (float), with the data
> matrices. Algerba, covariance , statistical tests with your own code. Sums,
> sqrt, and conditional moments of input...
>
> Export table summarizing (10 ^ 3 x 10 ^ 3 for example, matrix, )
> Paul

I'm not exactly sure what you're asking, but in general a memory-mapped array
acts just like a regular array; any algorithm written for an Array should (in
principle) work. So I recommend just starting to try to do whatever it is
you're interested in doing, and see how it goes. One recommendation would be
to start with files that are not huge, so that it doesn't take too long to
complete.

There are two likely gotchas:
1. By and large, if the outputs are also big, you'll want to use algorithms
with "pre-allocated outputs," where the output is allocated with mmap_array.
That way you don't need the memory to compute the entire thing in one shot.
For example, you should be able to do

C = A*B

this way:

sA = open("matrixA.bin")
A = mmap_array(Float64, (m,k), sA)
sB = open("matrixB.bin")
B = mmap_array(Float64, (k,n), sB)
sC = open("result.bin", "w+")
C = mmap_array(Float64, (m,n), sC)
A_mul_B(C, A, B)

The key part being that A_mul_B is a function, built in to Julia, to compute
A*B using a pre-allocated output C. It should automatically write the disk file
as it goes, you don't need to do any real work to achieve this.

2. When using mmapped arrays, you need to pay particular attention to "cache-
efficiency," because disk<->RAM is even slower than RAM<->cache. Many of the
algorithms in Julia are cache-efficient already, but there are also many that
are not. You'll have to experiment to find the bottlenecks, and when you find
them you're encouraged to contribute improvements to Julia and various
packages.
For the kinds of things you're describing, Dahua's excellent
https://github.com/lindahua/NumericExtensions.jl
is a collection of some cache-efficient algorithms. There are many others in
other packages (e.g., just in the last 24 hours I rewrote Image's gaussian
filtering algorithm to be more cache-efficient, with a 5x speed improvement in
real-world tests even when working from RAM).


But that's about all there is to it---memory-mapping makes everything much
easier when you're dealing with big data sets. So just start writing your
algorithms and see how it goes.

Best,
--Tim

>
>
>
> W dniu czwartek, 19 września 2013 12:20:15 UTC+2 użytkownik Tim Holy
>
> napisał:
> > On Wednesday, September 18, 2013 01:26:41 PM
program...@gmail.com<javascript:>wrote:

programista wpf

unread,
Sep 19, 2013, 2:18:28 PM9/19/13
to julia...@googlegroups.com
Tim, Many thanks for the fast dense summary. I'm an analyst and I'm not
a programmer but I'll try :) Please, suggestions for those who have
already done
Paul

W dniu 2013-09-19 16:15, Tim Holy pisze:

program...@gmail.com

unread,
Sep 28, 2013, 11:13:09 AM9/28/13
to julia...@googlegroups.com
How to parse the txt file to binary ?
Paul

Ivar Nesje

unread,
Sep 28, 2013, 2:10:46 PM9/28/13
to julia...@googlegroups.com
You write or use an existing parser.

readcsv or readdlm in the standard library, might be good options. I'm not certain how well they will handle a large dataset, because it does not look like it can read into a memory mapped array. That would be feature request, but I don't know whether it is reasonable.

Ivar

Kevin Squire

unread,
Sep 28, 2013, 2:14:33 PM9/28/13
to julia...@googlegroups.com
Those actually use mmap by default unless you turn it off with "mmap=false".  I'd be curious how well they work with large files.

Kevin

programista wpf

unread,
Sep 28, 2013, 2:23:21 PM9/28/13
to julia...@googlegroups.com
Hi Ivar
My
file (more then 10GB) does not fit in RAM, "readcsv". is not an option.
From one hour looking for the number of txt files converter> bin, but I have not found.
I'm an analyst, not a programmer - do not write themselves;) Can you point out something clever to win or linux?
Paul

W dniu 2013-09-28 20:10, Ivar Nesje pisze:

Stefan Karpinski

unread,
Sep 28, 2013, 2:33:25 PM9/28/13
to Julia Users
I'm afraid this is pretty off-topic at this point – this problem is in no way Julia-specific. If you don't have enough RAM to load your data into memory you're going to have to do some programming to work around that issue. This is true of any system you would want to use – R, Python, Matlab, Excel. 10GB is not a terribly large amount of data these days; I would recommend finding a machine with more RAM as by far the easiest route – especially if you don't have to ability to write programs yourself that can analyze the data without loading it all into memory.

Ivar Nesje

unread,
Sep 28, 2013, 2:46:50 PM9/28/13
to julia...@googlegroups.com
There are also lots of database tools (SQL, NoSQL) that are finished products that have preprogrammed functions that work on large datasets on disc. 

programista wpf

unread,
Sep 28, 2013, 2:47:02 PM9/28/13
to julia...@googlegroups.com
Is doing:
s = open ("data.bin", "r"),
A = mmap_array (Int64, (szie, size), s)
I can not read files (data.bin) larger than RAM?
Paul ?


W dniu 2013-09-28 20:33, Stefan Karpinski pisze:

programista wpf

unread,
Sep 28, 2013, 2:50:07 PM9/28/13
to julia...@googlegroups.com
But do not turn on the matrix algebra ...
Paul
W dniu 2013-09-28 20:46, Ivar Nesje pisze:

program...@gmail.com

unread,
Sep 28, 2013, 3:12:09 PM9/28/13
to julia...@googlegroups.com
Is doing:
s = open ("data.bin", "r"),
A = mmap_array (Int64, (szie, size), s)
I can not read files (data.bin) larger than RAM?
Paul ?


Stefan Karpinski

unread,
Sep 28, 2013, 3:37:08 PM9/28/13
to julia...@googlegroups.com
No, that will not work.

Andre R

unread,
Sep 28, 2013, 7:56:16 PM9/28/13
to julia...@googlegroups.com
The right tool for working with big data is the HDF file format. It's made for super fast IO and allows partial reads. Once my data is >10k rows in CSV I always first use h5tools to convert the CSV to HDF:

(One of my projects is used CSV for a while which took ~10s to parse and then did the conversion and it now takes <0.5s)

Use h5import from the tools or h5fromtxt:

You don't have anything lost with this step, you can dump it back to ASCII any time.

From there, you can then easily access slices:

or maybe now your entire data fits into RAM?

HTH

Stefan Karpinski

unread,
Sep 28, 2013, 8:08:45 PM9/28/13
to julia...@googlegroups.com
Good tool. And the author is none other than our very own Steve Johnson.

Tim Holy

unread,
Sep 29, 2013, 6:51:56 AM9/29/13
to julia...@googlegroups.com
> s = open ("data.bin", "r"),
> A = mmap_array (Int64, (szie, size), s)
> I can not read files (data.bin) larger than RAM?

You can read files much bigger than RAM this way. You just can't process them
all at once. But many algorithms, like sum(A), will work just fine, because
they access the data one item at a time, and the OS will swap things in and
out for you.

On Saturday, September 28, 2013 04:56:16 PM Andre R wrote:
> The right tool for working with big data is the HDF file format. It's made
> for super fast IO and allows partial reads.

Agreed this is the right way to store the binary data.

One point to make is that the HDF5 slicing operation is fantastic as long as
you're retrieving sizable chunks of the array at once. Asking HDF5 to give you
data item-by-item, e.g.,

dset = g["mydata"]
s = 0.0
for j = size(dset, 2), i = size(dset,1)
s += dset[i,j]
end

would be quite slow compared to mmap. However, Simon Kornblith recently
implemented support for accessing "simple" HDF5 datasets (no compression, not
chunked) via mmap:

https://github.com/timholy/HDF5.jl/blob/master/doc/hdf5.md#memory-mapping

With JLD you can set this to be the default for a whole file, see
https://github.com/timholy/HDF5.jl/blob/master/doc/jld.md#usage

So in the end you have available the best of both worlds: the portable, self-
documenting format of HDF5, and the speed of mmap.

--Tim

program...@gmail.com

unread,
Sep 29, 2013, 9:03:07 AM9/29/13
to julia...@googlegroups.com

Thanks a lot, for a clear set of tips! 'll Let you know how it goes.
Paul

W dniu wtorek, 17 września 2013 21:33:03 UTC+2 użytkownik program...@gmail.com napisał:

Stefan Karpinski

unread,
Sep 29, 2013, 2:21:48 PM9/29/13
to julia...@googlegroups.com
I missed the fact that you were asking about mmapping binary data, not text. In that case mmap is an option but still is no silver bullet.

sig...@kcl.ac.uk

unread,
Oct 25, 2013, 9:47:58 AM10/25/13
to julia...@googlegroups.com
Apologies if some of this sounds naive - I am new to Julia and just getting my head around it.

Some of the features being suggested here have been implemented by me in MATLAB using the built-in memmapfile function that seems similar to to Julia's mmap and have proven useful for passing memory mapped data to standard MATLAB m-code including e.g. MATLAB's own libraries without the need to modify them

I wrote a MATLAB OOP class that wrapped a memmapfile object - the 'nakhur' class. This overloads standard MATLAB operations like subsref, size, etc and effectively lies to MATLAB about what the object is. A case use might illustrate best:

1. Memory map a huge 16 bit integer array perhaps created by sampling with an ADC.
2. Overload the subsref method to scale and offset the data returning double results in real world units so that e.g X(:,100) returns the 100th column as a double vector.

Accessing the map data now returns double data and the MATLAB isfloat method is overloaded to return true. Size overloaded to return the dimensions of the mapped data. As far as MATLAB is concerned, the nakhur object is a MATLAB double matrix so I can pass it e,g, to most of their Signal Processing Toolbox functions (not to 'mex' functions of course). As long as those functions access only part of the data, e.g. when filtering across columns of a matrix, the mapped data is accessed via subsref so only the data required at any one time will be loaded into the MATLAB memory space.

This proves to be both fast and effectively lets me use a memory map object as though it were a native matrix. Whether such as scheme is compatible with Julia I don't know.




Tim Holy

unread,
Oct 25, 2013, 10:26:25 AM10/25/13
to julia...@googlegroups.com
mmap_array already does this.

--Tim

On Friday, October 25, 2013 06:47:58 AM sig...@kcl.ac.uk wrote:
> Apologies if some of this sounds naive - I am new to Julia and just getting
> my head around it.
>
> Some of the features being suggested here have been implemented by me in
> MATLAB using the built-in *memmapfile* function that seems similar to to
> Julia's *mmap *and have* *proven useful for passing memory mapped data to
> standard MATLAB m-code including e.g. MATLAB's own libraries without the
> need to modify them
>
> I wrote a MATLAB OOP class that wrapped a *memmapfile* object - the
> 'nakhur' class. This overloads standard MATLAB operations like subsref,
> size, etc and effectively lies to MATLAB about what the object is. A case
> use might illustrate best:
>
> 1. Memory map a huge 16 bit integer array perhaps created by sampling with
> an ADC.
> 2. Overload the subsref method to scale and offset the data returning
> double results in real world units so that e.g X(:,100) returns the 100th
> column as a double vector.
>
> Accessing the map data now returns double data and the MATLAB
> *isfloat*method is overloaded to return true. *Size *overloaded to return
> the dimensions of the mapped data. As far as MATLAB is concerned, the
> *nakhur* object is a MATLAB double matrix so I can pass it e,g, to most of
> their Signal Processing Toolbox functions (not to 'mex' functions of
> course). As long as those functions access only part of the data, e.g. when
> filtering across columns of a matrix, the mapped data is accessed via
> *subsref* so only the data required at any one time will be loaded into the

Stefan Karpinski

unread,
Oct 25, 2013, 11:54:23 AM10/25/13
to Julia Users
As Tim says, there's already very good support for mmap in Base Julia:

http://docs.julialang.org/en/latest/search/?q=mmap

mmap is my favorite system call :-)

sig...@kcl.ac.uk

unread,
Oct 25, 2013, 12:06:22 PM10/25/13
to julia...@googlegroups.com
@Tim and @Stefan

Many thanks. It seems I have misunderstood some previous comments here so I'll take a more careful look at mmap.

I agree that memory mapping can be extremely powerful.

Malcolm

Reply all
Reply to author
Forward
0 new messages