Idea for daru-io: partial requires

Sameer Deshmukh

unread,

May 22, 2017, 8:02:05 AM5/22/17

to Athitya Kumar, SciRuby Mailing List, Giuseppe Cuccu

Dear Athithya,

I was talking to a daru committer today, and it struck that your daru-io gem's architecture can be easily modified to also support partial requiring of IO libraries that are specific to a particular environment or interpreter.

For example, we can have a wrapper over the `fastest_csv` gem thats written in C and have the importer/exporter inside a class FastestCsv. The user can just use this module using `require 'daru/io/fastest_csv'` which will load this functionality.

This kind of functionality will be required when doing `require 'daru/io'` so that dependencies can be reduced and normal workflows can be followed.

Think about this further and add it to your proposal.

Regards,

Sameer Deshmukh

unread,

May 22, 2017, 8:24:28 AM5/22/17

to Athitya Kumar, SciRuby Mailing List, Giuseppe Cuccu

Also have a look at this link from Giuse: http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing

Regards,

Sameer Deshmukh

Pjotr Prins

unread,

May 22, 2017, 9:36:34 AM5/22/17

to sciru...@googlegroups.com, Athitya Kumar, Giuseppe Cuccu

On Mon, May 22, 2017 at 05:54:26PM +0530, Sameer Deshmukh wrote:
> Also have a look at this link from Giuse:

> [1]http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-para
> llel-processing

The problem with that approach is that it does not stream that well.
What you really want is the moment data comes in it farms out into a
thread/process.

I created pcows (parallel copy-on-write) for bioruby-vcf - which great
speed improvements compared to using the parallel gem on GB size files
(I tried that first).

https://github.com/pjotrp/bioruby-vcf/blob/master/lib/bio-vcf/pcows.rb

We can turn pcows into a separate gem if anyone is interested.

Pj.

Giuseppe Cuccu

unread,

May 22, 2017, 9:52:13 AM5/22/17

to Pjotr Prins, sciru...@googlegroups.com, Athitya Kumar

(@Pjotr: I don't want to step in, this is just my opinion, always refer to Sameer!!)

Sure, my initial suggestion was just for inspiration, there's many ways it could be improved -- personally, I was considering a `fastest_csv`-based approach.

But what is most important: since all the code is out there, before starting splitting gems or working out new ones you could simply benchmark this `parallel+smarter_csv` approach against your `pcows`. Numbers tell the story ;)

cheers,

- Giuse

Pjotr Prins

unread,

May 22, 2017, 10:02:05 AM5/22/17

to Giuseppe Cuccu, Pjotr Prins, sciru...@googlegroups.com, Athitya Kumar

Sure. Feel free! Note that I did both myself.

On Mon, May 22, 2017 at 03:51:31PM +0200, Giuseppe Cuccu wrote:
> (@Pjotr: I don't want to step in, this is just my opinion, always refer
> to Sameer!!)
> Sure, my initial suggestion was just for inspiration, there's many ways
> it could be improved -- personally, I was considering a
> `fastest_csv`-based approach.
> But what is most important: since all the code is out there, before
> starting splitting gems or working out new ones you could simply
> benchmark this `parallel+smarter_csv` approach against your `pcows`.
> Numbers tell the story ;)
> cheers,
> - Giuse

> On Mon, May 22, 2017 at 3:36 PM, Pjotr Prins <[1]pjotr...@gmail.com>

> wrote:
>
> On Mon, May 22, 2017 at 05:54:26PM +0530, Sameer Deshmukh wrote:

> >A A Also have a look at this link from Giuse:
> >A A [1][2]http://xjlin0.github.io/tech/2015/05/25/faster-
> parsing-csv-with-para
> >A A llel-processing

> The problem with that approach is that it does not stream that well.
> What you really want is the moment data comes in it farms out into a
> thread/process.
> I created pcows (parallel copy-on-write) for bioruby-vcf - which
> great
> speed improvements compared to using the parallel gem on GB size
> files
> (I tried that first).

> A [3]https://github.com/pjotrp/bioruby-vcf/blob/master/lib/
> bio-vcf/pcows.rb

> We can turn pcows into a separate gem if anyone is interested.
> Pj.
>

> References
>
> 1. mailto:pjotr...@gmail.com
> 2. http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-para
> 3. https://github.com/pjotrp/bioruby-vcf/blob/master/lib/bio-vcf/pcows.rb

--

Pjotr Prins

unread,

May 22, 2017, 10:14:37 AM5/22/17

to Pjotr Prins, Giuseppe Cuccu, sciru...@googlegroups.com, Athitya Kumar

On Mon, May 22, 2017 at 04:01:40PM +0200, Pjotr Prins wrote:
> Sure. Feel free! Note that I did both myself.

bio-vcf was actually a fun story. I wrote it to process TBs of VCF
files. The first version was a naive pcows. Then I decided to use the
parallel gem and was quite pleased, until the researchers started
complaining! It took forever to process data. That is when pcows2 came
back. Speed difference is actually magnitudes and when you have to
wait 14 days instead of 3 it makes a huge difference! The actual
explanation is that most parallel implementations do a scatter-gather.

Between those phases is where you can loose a lot of time, especially
when reading the file there. I actually played with parallel do
read/write on separate threads, but still the scatter/gather kicked
in. It is typical for 'map' implementations.

Anyway, one thing to keep in mind is that pcows is for MRI on
Linux/Mac. You may have trouble with JRuby or non-shell Windows. With
JRuby I would use JVM threading - that is probably the best approach,
but then it leaves MRI in the cold.

Pj.
--

Giuseppe Cuccu

unread,

May 22, 2017, 10:21:42 AM5/22/17

to Pjotr Prins, sciru...@googlegroups.com, Athitya Kumar

Cool! :D personally I have no experience with such large files, I usually switch to DBMS as soon as it gets over 1GB.

(ps incidentally, the "@pjotr" of my last email was meant to be an "@athitya", apologies for any consequent misreading)

- Giuse

Athitya Kumar

unread,

May 22, 2017, 12:02:51 PM5/22/17

to Giuseppe Cuccu, Pjotr Prins, SciRuby Mailing List

@Sameer - Sure, the suggested architecture supports partial requiring of IO modules in format `require 'daru/io/importers/fastest_csv' (due to separation of from_ and to_ modules).

However, I'm not sure If wrapping C libraries directly in the daru-io gem itself would be the right way to go. I mean, wouldn't it be better to have the C wrappers as separate gems (say, fastest_csv gem) and include these gems to have IO modules?

Also, instead of using partial requires like `require 'daru/io/importers/fastest_csv'`, would providing an option of which library to use as an argument make it more easier for the user? I mean, something like this for the user,

```
require 'daru/io/importers/csv' #! format-based require rather than library-based require

Daru::DataFrame.from_csv path, col_sep: ..., lib: :fastest_csv
#! This redirects to Daru::IO::Importers::CSV#load
```

Meanwhile, the below redirect happens inside the daru-io gem,

```

#! daru/io/importers/csv.rb

module Daru::IO::Importers::CSV

class << self

def load path, opts={}

CSVHelper.csv path, opts[:lib].nil?

CSVHelper.fastest_csv path, opts if opts[:lib] == :fastest_csv

...

end

module Daru::IO::Importers::CSVHelper

class << self

def fastest_csv path, opts={}
require 'fastest_csv'

# Use fastest_csv gem here

end

def csv path, opts={}

require 'csv'

# Use standard library csv gem here

end

```

Even in this case, only the fastest_csv gem / module is required. So, would format-based partial requiring be preferred or gem-based partial requiring?

@Guiseppe : Interesting read. Incidentally, I read your article just yesterday while trying to find something faster at reading / writing CSV files than Rcsv gem. (Rcsv gem came out to be 5-6 times faster than stand library csv gem at reading CSVs, according to this benchmark of mine.)

Regards,

Athitya Kumar

unread,

May 24, 2017, 1:32:09 PM5/24/17

to Sameer Deshmukh, Victor Shepelev, SciRuby Mailing List, Giuseppe Cuccu, Pjotr Prins

Hey all. Sorry Sameer & Victor, I had overlooked this issue regarding the partial-requires discussion during today's video conference session and it struck me just now.

This is regarding multiple gems (say, faster_csv, csv, fastest_csv) that are used for the same format (says, csv) in both partial require (require ‘daru/io/importers/faster_csv’`) as well as full require (require ‘daru/io’`).

Now, according to today’s video conference and the proposed architecture, we discussed to have them as separate files like daru/io/importers/{csv.rb, faster_csv.rb, etc.}.rb, with Daru::DataFrame#from_csv linkage, importer and helper functions in each file. This definitely works in the case of partial require - but poses a problem of overriding within these Daru::DataFrame#from_csv linkages in the case of full require.

That is, while requiring daru/io, all the same-format different-gem files like csv.rb, faster_csv.rb, fastest_csv.rb will separately override the Daru::DataFrame#from_csv. And as they’re required by the main importers.rb file in a certain order, two (all but one) of these overrides will be overridden, and only one Dare:DataFrame#from_csv linkage will survive.

This was what I had in mind while posing the idea that all different gems should be in the same csv.rb file, but as different modules (if required), or just different functions within the core CSV module. I'm imagining something like this -

```

#! daru/io/importers/csv.rb

module Daru

class DataFrame

class << self

def from_csv path, opts={}, &block

if opts[:gem].nil? || opts[:gem] == :csv # Default gem

Daru::IO::Importers::CSV.load path, opts, &block

elsif opts[:gem] == :faster_csv

Daru::IO::Importers::FasterCSV.load path, opts, &block

...
end

end

# Followed by Daru::IO::Importers::CSV#load, Daru:IO:Importers::CSVHelper

# Followed by Daru::IO::Importers::FasterCSV#load, Daru::IO::Importers::FasterCSVHelper (No specific order)

```
Please let me know if you have a better architecture in mind that works in both cases - partial requires as well as full require.

Regards,

Athitya Kumar

Pjotr Prins

unread,

May 24, 2017, 1:57:27 PM5/24/17

to Athitya Kumar, Sameer Deshmukh, Victor Shepelev, SciRuby Mailing List, Giuseppe Cuccu

Hi Athitya,

I would not be too fixated on organisation of files etc. That is what
refactoring is for. Start simple is always my advice. People tend to
overengineer anyway ;)

Pj.

On Wed, May 24, 2017 at 11:01:23PM +0530, Athitya Kumar wrote:
>
>
> Hey all. Sorry Sameer & Victor, I had overlooked this issue regarding
> the partial-requires discussion during today's video conference session
> and it struck me just now.
>

> A

>
> This is regarding multiple gems (say, faster_csv, csv, fastest_csv)
> that are used for the same format (says, csv) in both partial require

> (require adaru/io/importers/faster_csva`) as well as full require
> (require adaru/ioa`).
>
> A
>
> Now, according to todayas video conference and [1]the proposed

> architecture, we discussed to have them as separate files like
> daru/io/importers/{csv.rb, faster_csv.rb, etc.}.rb, with
> Daru::DataFrame#from_csv linkage, importer and helper functions in each
> file. This definitely works in the case of partial require - but poses
> a problem of overriding within these Daru::DataFrame#from_csv linkages
> in the case of full require.
>

> A

>
> That is, while requiring daru/io, all the same-format different-gem
> files like csv.rb, faster_csv.rb, fastest_csv.rb will separately

> override the Daru::DataFrame#from_csv. AndA as theyare required by the

> main importers.rb file in a certain order, two (all but one) of these
> overrides will be overridden, and only one Dare:DataFrame#from_csv
> linkage will survive.
>

> A

>
> This was what I had in mind while posing the idea that all different
> gems should be in the same csv.rb file, but as different modules (if
> required), or just different functions within the core CSV module. I'm
> imagining something like this -
>
> ```
>
> #! daru/io/importers/csv.rb
>
> module Daru
>

> A A A class DataFrame
>
> A A A A A A A class << self
>
> A A A A A A A A A A A def from_csv path, opts={}, &block
>
> A A A A A A A A A A A A A A A if opts[:gem].nil? || opts[:gem] == :csv

> # Default gem
>
> Daru::IO::Importers::CSV.load path, opts, &block
>

> A A A A A A A A A A A A A A A elsif opts[:gem] == :faster_csv
>
> A A A A Daru::IO::Importers::FasterCSV.load path, opts, &block
>
> A A A A ...
> A A A A end
>
> end
>
> A A A A A A A end
>
> A A A end
>
> end
>
> A

>
> # Followed by Daru::IO::Importers::CSV#load,
> Daru:IO:Importers::CSVHelper
>
> # Followed by Daru::IO::Importers::FasterCSV#load,

> Daru::IO::Importers::FasterCSVHelper (No specific order)A
>
> ```A

> Please let me know if you have a better architecture in mind that works
> in both cases - partial requires as well as full require.
>
> Regards,
>
> Athitya Kumar
>

> References
>
> 1. https://mailtrack.io/trace/link/bfd040e36f2a060e36b7401ff94e9eee62f39ed1?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F19j4GLO-560kGqjZ2dfz8ZfW5Wsr2lqvHg35CVmbjWU4%2Fedit%23heading%3Dh.11pxjuc2xuk&userId=691437&signature=09a9110ce4f4c55d

--

Giuseppe Cuccu

unread,

May 24, 2017, 5:03:33 PM5/24/17

to Pjotr Prins, Athitya Kumar, Sameer Deshmukh, Victor Shepelev, SciRuby Mailing List

Hi Athitya,

- faster_csv is now integrated into the standard library ('csv')

- another alternative is 'smarter_csv', but for its features rather than performance. I would consider speed a priority ( => fastest_csv)

- 100% with Pjotr on underengineering :) get one thing to work (say `fastest_csv`), in one file, and refactor + expand later. You need to stay flexible, and ready to improve the architecture as you find new needs.

One last point: both the focus on file structure and the amount of indentation almost suggest a Python background :) consider something simpler for legibility, e.g.:

class Daru::DataFrame

def self.from_csv path, opts={}, &block

Importers = Daru::IO::Importers

case opts[:gem]

when nil, :csv

Importers::CSV.load path, opts, &block

when :faster_csv

Importers::FasterCSV.load path, opts, &block

...

end

Remember: KISS is just as (if not more) important than DRY :)

Have fun!

- Giuse

-- Giuseppe Cuccu
exascale.info/giuse

Victor Shepelev

unread,

May 26, 2017, 7:34:14 AM5/26/17

to Giuseppe Cuccu, Pjotr Prins, Athitya Kumar, Sameer Deshmukh, SciRuby Mailing List

Guiseppe, we've discussed the approach on yesterday's video meeting, and especially voted against "which library to use" option for `from_csv`.

Reason is: typically, in complicated data processing scripts, you don't think of it in terms of "oh, now I am reading CSV! which library should I use?", but rather in terms "I will read a lot of complicated CSV files, so I include daru and once configure it to use some fancy CSV processor".

So, possible synopsys of "configure Daru to use fastest_csv" could be:

1.

require 'daru/io'

require 'daru/io/fastest_csv' # does not auto-required, and on this require checks `fastest_csv gem availability

2.

require 'daru/io'

Daru::IO::CSV.use :fastest # does the same as (1) underneath: requires implementation, checks gem availability

Option (1) in fact is KISS. The only problem with this approach that gem should not do just `require 'daru/io/*'`, because some of the files stay optional and should not be required. I believe that's the problem Athithya asks.

But I, too, propose not to overthink it. For starters, just do daru/io/fastest_csv file, and let daru/io.rb require one-by-one all "default" implementations only, without optional ones.

V.

Giuseppe Cuccu

unread,

May 26, 2017, 7:56:05 AM5/26/17

to Victor Shepelev, Pjotr Prins, Athitya Kumar, Sameer Deshmukh, SciRuby Mailing List

Hi Victor,

I love option 1. And I agree it's the minimum prototype to go for, everything else would go into refactoring anyway.

It seems you and Sameer (and Athitya of course!) are running a tight ship. Consider anything I write as IMO, though I appreciate you keeping me in the loop. And I'm available i.e. for pesky code reviews and such.

The only thing I'd like to contribute is my use case: I want to base my ML work on Daru, but FastestCSV is actually ridiculously faster, so the only way I can use it is if I have an option to tell Daru to use it. I sure don't mind adding `require 'daru/io/fastest_csv'` to my code :) it actually reminds me of `nmatrix/atlas`, I can easily consider it a SciRuby standard.

Thank you all so much!

- Giuse

-- Giuseppe Cuccu
exascale.info/giuse

Athitya Kumar

unread,

May 26, 2017, 11:37:47 AM5/26/17

to Victor Shepelev, Sameer Deshmukh, Pjotr Prins, SciRuby Mailing List, Giuseppe Cuccu

Thanks for sharing your opinions, Pjtor, Victor & Giuseppe. I didn't mean to overemphasize on the gem architecture, but rather wanted to discuss on how the (partial) requiring should / will be used by the user.

I especially liked one thing about option (2) despite option (1) adapting KISS. And that is the ability of the user to switch the gem being used at ANY point of time in their codebase without worrying about the require order.

For example, in (1), if a user requires daru/io/smarter_csv and later requires daru/io/fastest_csv, the user is forced to NOT being able to use smarter_csv and default csv gems for the rest of their code in that "session". This might be an undesirable behavior especially when using daru / daru-io with Rails, where simultaneous usage of different gems in a single "session" matters. This is something I wanted to point out before fixating on whether to continue with (1) or (2).

Athitya Kumar

unread,

May 30, 2017, 5:05:21 AM5/30/17

to Victor Shepelev, Sameer Deshmukh, Pjotr Prins, SciRuby Mailing List, Giuseppe Cuccu

Dear Victor & Sameer,

I just wanted to bring the above mail to your attention. Please read through the mail and share your opinions on whether we can go ahead with (1) or (2). I've currently setup the repository here as per (2) (with dummy methods for now), but we can certainly refractor the codebase if option (1) is preferred more.

Regards,
Athitya Kumar

Giuseppe Cuccu

unread,

May 30, 2017, 5:48:17 AM5/30/17

to Athitya Kumar, Victor Shepelev, Sameer Deshmukh, Pjotr Prins, SciRuby Mailing List

Dear Athitya,

Option (2) offers 2 features: (a). library loading, and (b). hot swapping. I have not seen feature (b) being requested yet.

Placeholder files are unnecessary (and shackling) in Ruby. You might consider populating this file with feature (a) as a solid starting point.

Cheers,

- Giuse

-- Giuseppe Cuccu
exascale.info/giuse

Sameer Deshmukh

unread,

May 30, 2017, 11:40:55 AM5/30/17

to Giuseppe Cuccu, Athitya Kumar, Victor Shepelev, Pjotr Prins, SciRuby Mailing List

Athithya,

Does that answer your question?

Regards,

Sameer Deshmukh

Athitya Kumar

unread,

May 30, 2017, 11:53:36 AM5/30/17

to Sameer Deshmukh, Victor Shepelev, Giuseppe Cuccu, Pjotr Prins, SciRuby Mailing List

Yes Sameer, thanks for asking.

As per Giuseppe's previous reply, I'd like to continue proceeding with option (2) - also keeping in mind the issues that option (1) might create with tests (as sequence of requiring non-default libraries matters).

Reply all

Reply to author

Forward