zipped csv reader for kona

110 views
Skip to first unread message

Neeraj Rai

unread,
May 9, 2013, 8:32:43 PM5/9/13
to kona...@googlegroups.com
Hi,

I have a need to load lots of csv in kona for data processing .
The files could be zip, gz, bz2, xz etc. Some of the zips are given  while some others are zipped by me.
And in some cases, the input file in certain format (zip) is converted by to other format for storage.

To handle the situation, I customized the the csv reader to replace mmap with popen the right tool and stream the file in.
If there is any interest in this code, I can make it available.

thanks
Neeraj

Tom Szczesny

unread,
May 10, 2013, 9:22:02 AM5/10/13
to kona...@googlegroups.com
Yes, I would be interested.  I also have lots of csv files.  
They are currently not zipped, but it may make sense to zip them for storage.

Kevin Lawler

unread,
May 10, 2013, 4:34:51 PM5/10/13
to kona...@googlegroups.com
Reading from zipped files is a common use case.

I think you can handle zip files by detecting them. This means you can support zip files transparently without adding a configuration option.

Stream versus mmap is not important.

What I worry about is adding library dependencies. This is against the K way. Is there a short BSD style zip reader in the same fashion as our nice and short PRNG code? 

Otherwise it might not be restricted to being an extension. Zlib is too much.
--
You received this message because you are subscribed to the Google Groups "Kona Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kona-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Bakul Shah

unread,
May 10, 2013, 5:49:11 PM5/10/13
to kona...@googlegroups.com
popening an external zipper/unzipper (as Neeraj seems to have done) is perhaps the best choice.

Neeraj Rai

unread,
May 10, 2013, 8:30:48 PM5/10/13
to kona...@googlegroups.com
Hi,

here's the branch with the changes. I would like to get feedback on the code (convention/efficiency, everything).
https://github.com/rneeraj/kona/commit/5878b078ce7cbd5a1cbe86c96628e4493ef13b40

It doesn't pull in any libs. for example, for zip, it would issue "popen /bin/zcat filename" and streams in the results.
The processing creates similar struct as the existing csv reader.
The code is linux specific. Maybe the paths can be made configurable but I am not familiar with how to do that.

thanks
Neeraj

Neeraj Rai

unread,
May 10, 2013, 8:31:52 PM5/10/13
to kona...@googlegroups.com
The function to handle different cases is : fileReadCmd

Neeraj Rai

unread,
May 10, 2013, 8:33:47 PM5/10/13
to kona...@googlegroups.com
where can I see PRNG code ?

Kevin Lawler

unread,
May 10, 2013, 10:35:43 PM5/10/13
to kona...@googlegroups.com

Neeraj Rai

unread,
May 10, 2013, 11:30:10 PM5/10/13
to kona...@googlegroups.com
I am not sure I followed this thought - the relation between random number code and reading zip files. 
Is popen an unacceptable way to stream in zip file?

thanks
Neeraj

On Friday, 10 May 2013 22:35:43 UTC-4, Kevin wrote:
https://github.com/kevinlawler/kona/blob/master/mt.c

Neeraj Rai

unread,
May 12, 2013, 6:23:20 PM5/12/13
to kona...@googlegroups.com
In the current code (which is only for review), the linux specific portion is in fileReadCmd.
That can be called based on platform. we can add linux and OSX calls and extend to other platforms as they are known.
for unknown ones, NYI is the fall back.
It is not pulling in any libs.

Bakul Shah

unread,
May 13, 2013, 4:11:58 PM5/13/13
to kona...@googlegroups.com
I have a more general idea. Why not fold piping into regular IO ops? If the file "name" starts with a !, the data gets piped. For example

x: 0:"!ls" / output of ls goes to x
"!gzip > file" 0:x / x is piped into gzip to a compressed file
c: 0: "!wc" 0:x / feed x to wc and store its result in c
If you don't like a magic "!" prefix, may be another kind of IO mode can be added?  8 & 9 (for text and data).

On May 10, 2013, at 1:34 PM, Kevin Lawler wrote:

Neeraj Rai

unread,
May 13, 2013, 9:06:12 PM5/13/13
to kona...@googlegroups.com
That looks more K like.
Do you think  Kevin has objections to the way it is being done ? popen part?
For now, I am using the code as a local patch while more acceptable implementation is being sought.
Reply all
Reply to author
Forward
0 new messages