reading compressed csv file?

449 views
Skip to first unread message

ivo welch

unread,
Jan 4, 2015, 6:51:18 PM1/4/15
to julia...@googlegroups.com

dear julia users:  beginner's question (apologies, more will be coming).  it's probably obvious.

I am storing files in compressed csv form.  I want to use the built-in julia readcsv() function.  but I also need to pipe through a decompressor first.  so, I tried a variety of forms, like

   d= readcsv("/usr/bin/gzcat ./myfile.csv.gz |")
   d= readcsv("`/usr/bin/gzcat ./myfile.csv.gz`")

I can type the file with run(`/usr/bin/gzcat ./crsp90.csv.gz"), but wrapping a readcsv around it does not capture it.  how does one do this?

regards,

/iaw

ele...@gmail.com

unread,
Jan 4, 2015, 7:55:08 PM1/4/15
to julia...@googlegroups.com
Can you run the command with open() http://docs.julialang.org/en/latest/stdlib/base/?highlight=spawn#Base.open and pass the stream it returns to readcsv?

Cheers
Lex

 

regards,

/iaw

ivo welch

unread,
Jan 4, 2015, 8:12:13 PM1/4/15
to julia...@googlegroups.com
still not obviois. readcsv does have a dispatch for a stream (good),
but I really need a popen function.
x=readcsv(open(`gzcat myfile.csv.gz`, "r"))
is wrong. x=run(`gzcat myfiles.csv.gz`) doesn't send the output to x
for further piping as far as I can see, so readcsv(x) doesn't do it.

/iaw

----
Ivo Welch (ivo....@gmail.com)
http://www.ivo-welch.info/
----
Ivo Welch (ivo....@gmail.com)
http://www.ivo-welch.info/
J. Fred Weston Distinguished Professor of Finance
Anderson School at UCLA, C519
Director, UCLA Anderson Fink Center for Finance and Investments
Free Finance Textbook, http://book.ivo-welch.info/
Exec Editor, Critical Finance Review, http://www.critical-finance-review.org/
Editor and Publisher, FAMe, http://www.fame-jagazine.com/

Tim Holy

unread,
Jan 4, 2015, 8:56:17 PM1/4/15
to julia...@googlegroups.com
I wonder if the GZip.jl package would help?

--Tim

ele...@gmail.com

unread,
Jan 4, 2015, 8:56:47 PM1/4/15
to julia...@googlegroups.com, ivo....@gmail.com


On Monday, January 5, 2015 11:12:13 AM UTC+10, ivo welch wrote:
still not obviois.  readcsv does have a dispatch for a stream (good),
but I really need a popen function.
  x=readcsv(open(`gzcat myfile.csv.gz`, "r"))
is wrong.  x=run(`gzcat myfiles.csv.gz`) doesn't send the output to x
for further piping as far as I can see, so readcsv(x) doesn't do it.

The documentation I linked said:

open(commandmode::AbstractString="r"stdio=DevNull)

Start running command asynchronously, and return a tuple (stream,process)

 you need to pass the stream element of the tuple to readcsv()

Cheers
Lex

Todd Leo

unread,
Jan 4, 2015, 9:29:18 PM1/4/15
to julia...@googlegroups.com
An intuitive thought is, uncompress your csv file via bash utility zcat, pipe it to STDIN and use readline(STDIN) in julia.

Jiahao Chen

unread,
Jan 5, 2015, 12:43:16 AM1/5/15
to julia...@googlegroups.com
This is how I used GZip.jl in the tests for the MatrixMarket package


Perhaps it might be useful for you.

Thanks,

Jiahao Chen
Staff Research Scientist
MIT Computer Science and Artificial Intelligence Laboratory

ivo welch

unread,
Jan 5, 2015, 1:46:15 AM1/5/15
to julia...@googlegroups.com
dear tim, lex, todd (&others): thanks for responding. I really want
to learn how to preprocess input from somewhere else into the
readcsv() function. it's a good starting exercise for me to learn how
to accomplish tasks in general. there is so much to learn. [I did
not experiment with GZip.jl --- modules are new to me, and this one is
not included. I could make too many errors in this process. It will
probably make the specific task easier.]

now, the first mistake which tripped me up for a while is that I did
not grasp the difference between a string and a command. that is, I
should not have used " for my command. I had needed to use `. this
is why open("echo hi") did not work, but open(`echo hi`) does.

x=open(`gzcat myfile.csv.gz`)

is a good start. I see it contains a tuple of a Pipe and a Process.
this is printed by default on the command line. I learned I can make
this work with

d=readcsv( x[1] )

but I have a whole bunch of new questions, beyond question now.
first, try this:

julia> x1=open(`gzcat d.csv.gz`)
(Pipe(closed, 35 bytes waiting),Process(`gzcat d.csv.gz`, ProcessExited(0)))

julia> x2=open(`gzcat d.csv.gz`)
(Pipe(active, 0 bytes waiting),Process(`gzcat d.csv.gz`, ProcessRunning))

how strange---the claims are different. even stranger, the first
readcsv(x2[1]) is very slow now (I am talking 3 seconds on a 3 by 4
data file!); but following it with readcsv(x1[1]) is fast. I can't
imagine readcsv has intelligence built-in to cache past specific
conversions.

another strange definition from a novice perspective: close(x1) is
not defined. close(x1[1]) is. julia is the first language I have
seen where a close(open("file")) is wrong. this is esp surprising
because julia has the dispatch ability to understand what it could do
with a close(Pipe,Process) tuple. the same holds true for other
functions that expect a part of open. julia should be smart enough to
know this.

regards,
J. Fred Weston Distinguished Professor of Finance
Anderson School at UCLA, C519
Director, UCLA Anderson Fink Center for Finance and Investments
Free Finance Textbook, http://book.ivo-welch.info/
Exec Editor, Critical Finance Review, http://www.critical-finance-review.org/
Editor and Publisher, FAMe, http://www.fame-jagazine.com/


ele...@gmail.com

unread,
Jan 5, 2015, 2:47:28 AM1/5/15
to julia...@googlegroups.com, ivo....@gmail.com


On Monday, January 5, 2015 4:46:15 PM UTC+10, ivo welch wrote:
dear tim, lex, todd (&others):  thanks for responding.  I really want
to learn how to preprocess input from somewhere else into the
readcsv() function.  it's a good starting exercise for me to learn how
to accomplish tasks in general.  there is so much to learn.  [I did
not experiment with GZip.jl --- modules are new to me, and this one is
not included.  I could make too many errors in this process.  It will
probably make the specific task easier.]

now, the first mistake which tripped me up for a while is that I did
not grasp the difference between a string and a command.  that is, I
should not have used " for my command.  I had needed to use `.  this
is why open("echo hi") did not work, but open(`echo hi`) does.

Yep correct.
 

    x=open(`gzcat myfile.csv.gz`)

is a good start.  I see it contains a tuple of a Pipe and a Process.
this is printed by default on the command line.  I learned I can make
this work with

   d=readcsv( x[1] )

Yes
 

but I have a whole bunch of new questions, beyond question now.
first, try this:

julia> x1=open(`gzcat d.csv.gz`)
(Pipe(closed, 35 bytes waiting),Process(`gzcat d.csv.gz`, ProcessExited(0)))

julia> x2=open(`gzcat d.csv.gz`)
(Pipe(active, 0 bytes waiting),Process(`gzcat d.csv.gz`, ProcessRunning))

how strange---the claims are different.  

That may just be sampling effect, the gzcat is being run in another process so it runs at the same time as the current process.  Also see below for why the first call to open(command) may have been slower than the second and so the open has not completed until after the other process completed, but ran much faster the second time and beat the other process.
 
even stranger, the first
readcsv(x2[1]) is very slow now (I am talking 3 seconds on a 3 by 4
data file!); but following it with readcsv(x1[1]) is fast.  I can't
imagine readcsv has intelligence built-in to cache past specific
conversions.

No but the first time you do anything its possible that you are hitting compile delays from the JIT (of open and readcsv and all its dependents), subsequent runs are faster. 
 

another strange definition from a novice perspective:  close(x1) is
not defined.  close(x1[1]) is.  

close() is defined for a stream, not a tuple (stream, process).
 
julia is the first language I have
seen where a close(open("file")) is wrong.

close(open("filenamestring")) is fine, close(open(command)) is not because open(command) returns a tuple of two things, not just the stream.  This is Julia's primary paradigm, multi-dispatch means that the same named function can have several methods that do different things depending on the *type* of the arguments to the call, string or command.
 
 this is esp surprising
because julia has the dispatch ability to understand what it could do
with a close(Pipe,Process) tuple.

But only if such a close() method is defined, which it is not.  Maybe it should be, but open(command) is significantly less used than open(file).

Cheers
Lex

Kevin Squire

unread,
Jan 5, 2015, 9:09:41 AM1/5/15
to julia...@googlegroups.com


another strange definition from a novice perspective:  close(x1) is
not defined.  close(x1[1]) is.  

close() is defined for a stream, not a tuple (stream, process).
 
julia is the first language I have
seen where a close(open("file")) is wrong. 

FWIW, I believe that there was concern that the behavior of open(process) might cause confusion when it was defined in this way. (A quick search didn't locate the issue.)

The goal was to minimize the number of methods, but it might be worth exploring alternative interface. A simple change would be to create and return a typed object (say, Handle), instead of a tuple, which would both allow easy closing directly and give access to the opened process. 

Ivo, would you be willing to open an issue regarding your confusion here (and point back to this thread)?

Cheers,
   Kevin 

Steven G. Johnson

unread,
Jan 5, 2015, 10:01:07 AM1/5/15
to julia...@googlegroups.com


On Monday, January 5, 2015 12:43:16 AM UTC-5, Jiahao Chen wrote:
This is how I used GZip.jl in the tests for the MatrixMarket package

In the present case, seems like it would be easier to do:

data = GZip.open(fname) do g
      readcsv(g)   
end
 

ivo welch

unread,
Jan 5, 2015, 2:23:02 PM1/5/15
to julia...@googlegroups.com
hi kevin---I would be happy to open an issue, but I would prefer if
the "honor" was left to someone (you?) who can articulate it better.
I am a true novice here.

if I understand it right, the fix is easy. is a "Handle" change
complex and/or needed? just overload all functions that expect a Pipe
to work also with the (Pipe,Process) tuple. otoh, maybe doing this
with a Handle simply automates this everywhere?! not sure. I can't
weigh in on a discussion. I just don't know enough.

regards,

/iaw

----
Ivo Welch (ivo....@gmail.com)
http://www.ivo-welch.info/
J. Fred Weston Distinguished Professor of Finance
Anderson School at UCLA, C519
Director, UCLA Anderson Fink Center for Finance and Investments
Free Finance Textbook, http://book.ivo-welch.info/
Exec Editor, Critical Finance Review, http://www.critical-finance-review.org/
Editor and Publisher, FAMe, http://www.fame-jagazine.com/


Steven G. Johnson

unread,
Jan 5, 2015, 2:32:56 PM1/5/15
to julia...@googlegroups.com
On Monday, January 5, 2015 9:09:41 AM UTC-5, Kevin Squire wrote:
FWIW, I believe that there was concern that the behavior of open(process) might cause confusion when it was defined in this way. (A quick search didn't locate the issue.)

Jameson Nash

unread,
Jan 5, 2015, 2:39:13 PM1/5/15
to julia...@googlegroups.com
It seems perhaps that each Process instance should remember its IO streams, so that it could be used directly as an IO object.

ivo welch

unread,
Feb 27, 2016, 11:14:41 PM2/27/16
to julia-users

apologies for bothering everyone again.  is an easier solution planned for the following R-equivalent construct?

d <- read.csv(pipe("gzcat mygzippedfile.gz"))

where gzcat could be an arbitrary alternative decompressor or input filter, or is this likely to remain difficult for starters?

Tony Kelman

unread,
Feb 28, 2016, 2:50:03 AM2/28/16
to julia-users
Ivo,

It looks like https://github.com/JuliaLang/julia/pull/12807 would implement the suggestion to change open(command) to return just the process instead of a tuple, so indexing into the return from open(`gzcat myzippedfile.gz`) would no longer be necessary.

-Tony

ivo welch

unread,
Feb 28, 2016, 10:53:14 AM2/28/16
to julia-users

hi tony---thanks.  I will keep an eye on the docs (presumably streams).  from a novice end-user (not developer) perspective, solving the specific snippet that I noted would be great in the documentation pages.  regards, /iaw


----
Ivo Welch (ivo....@gmail.com)
http://www.ivo-welch.info/
J. Fred Weston Distinguished Professor of Finance
Anderson School at UCLA, C519
Free Finance Textbook, http://book.ivo-welch.info/
Exec Editor, Critical Finance Review, http://www.critical-finance-review.org/
Editor and Publisher, FAMe, http://www.fame-jagazine.com/
Reply all
Reply to author
Forward
0 new messages