My Wish List

Eric Springer

unread,

Apr 8, 2013, 9:00:21 PM4/8/13

to scoob...@googlegroups.com

Doing a fair bit of scoobi as of late, and it's been extremely useful. I've been able to implement map reduce versions of algorithms fast enough, that people don't believe I did it (and think i must've found some off the shelf software to do it). So major props.

But in terms of my wish list:

-- Speed --

1. Faster in-memory mode. I love the in-memory mode, but its a couple order of magnitudes slower than I'd like.

2. Snapshotting [ I remember seeing some commits on this? But I didn't see anything in the user-guide or examples ? ] Anyway, that'd be totally awesome.

-- Awesomeness --

3. The implementation of a .flatMap would be kick ass. Especially if you could use it for doing joins and filters when working on multiple DLists [Although, maybe that isn't really possible without doing cartesian joins -- which is obviously not realistic]

-- API ---

4. Single pass iterable, gets me every time

5. Grouping[K] is a bit of a mess, and I'd like to see it just totally removed (See my suggestion in that bug, involving only requiring an ordering -- and a special function for secondary sorting

6. .groupByKey should return a DMap[K,V] imo. For both explicitly saying the uniqueness constraint, and because constantly writing .map { case (k,v) => .... } etc.

Eric Torreborre

unread,

Apr 8, 2013, 10:14:18 PM4/8/13

to scoob...@googlegroups.com

Hi Eric,

1. Speed

My understanding is that you have some ideas on how to use Streams instead of Vector as a backend. Are you thinking of just a collection substitution or a more elaborate change?

2. Snapshotting

This is indeed implemented and documented here. It is definitely not battle-tested so please give it a go and report any issues you might find.

3. FlatMap

Interesting point, I recently had the idea that there was some kind of affinity between flatMaps and joins. The reason is that "join" in things like Slick is implemented with flatMap and filter. So I'm wondering if we could go the other way: flatMap -> join + filter. And stay efficient.

4. Single pass

We're still thinking about this. A solution might be to provide a more elaborate API where we statically enforce that one pass will be done OR that you might load all elements in memory.

5. Grouping

We also had this discussion along the same lines but we don't have a clear plan yet

6. GroupByKey return type

Tony is working on this.

In terms of release, the objective is for now to get a robust 0.7.0 out (hopefully we'll have a RC release soon), then to deal with those items for 0.8.0.

Cheers,

Eric.

Eric Springer

unread,

Apr 8, 2013, 10:27:08 PM4/8/13

to scoob...@googlegroups.com

On Mon, Apr 8, 2013 at 7:14 PM, Eric Torreborre <etorr...@gmail.com> wrote:

Hi Eric,

1. Speed

My understanding is that you have some ideas on how to use Streams instead of Vector as a backend. Are you thinking of just a collection substitution or a more elaborate change?

No ideas really. I just prototype using scala streams, before I port it to scoobi -- and that is generally a lot faster. If I get time, I'll look into what is causing scoobi's in memory mode to be slow and that should give us a better idea

2. Snapshotting

This is indeed implemented and documented here. It is definitely not battle-tested so please give it a go and report any issues you might find.

Awesome! I look forward to giving that a go

3. FlatMap

Interesting point, I recently had the idea that there was some kind of affinity between flatMaps and joins. The reason is that "join" in things like Slick is implemented with flatMap and filter. So I'm wondering if we could go the other way: flatMap -> join + filter. And stay efficient.

I'm not sure how it would work, but it would be awesome. I suspect you will need some very fancy filter that doesn't take in arbitrary functions (so you can look at how they are filtering, and not pointlessly do the cross product)

4. Single pass

We're still thinking about this. A solution might be to provide a more elaborate API where we statically enforce that one pass will be done OR that you might load all elements in memory.

Yeah, that would be awesome. It could even be a feature of the DMap xD The big problem now, is just how (like hadoop) the API lies and gives you the idea it's multiple pass. And it has some really really annoying consequences. Like is pattern matching the head safe? (Nope) But is .toSet safe (i think so, lol) etc.

5. Grouping

We also had this discussion along the same lines but we don't have a clear plan yet

6. GroupByKey return type

Tony is working on this.

In terms of release, the objective is for now to get a robust 0.7.0 out (hopefully we'll have a RC release soon), then to deal with those items for 0.8.0.

All sounds very promising! BTW one feature I miss in 0.7.0 (unless it's getting drowned in the logs, and I haven't noticed) is the "Running job N of M". That is useful for no other reason than just telling people how far through you are, and how awesome you are for creating complex map-reduce job chains so easily :d

And I have been known on occasion to have done some really stupid things, that ended up creating a thousand map reduce jobs (like looping through a bunch of things, thinking they'll all get fused together) and what not. So it'd be nice to get a bit of a early heads up

Cheers,

Eric.

On Tuesday, April 9, 2013 11:00:21 AM UTC+10, Eric Springer wrote:

Doing a fair bit of scoobi as of late, and it's been extremely useful. I've been able to implement map reduce versions of algorithms fast enough, that people don't believe I did it (and think i must've found some off the shelf software to do it). So major props.

But in terms of my wish list:

-- Speed --

1. Faster in-memory mode. I love the in-memory mode, but its a couple order of magnitudes slower than I'd like.

2. Snapshotting [ I remember seeing some commits on this? But I didn't see anything in the user-guide or examples ? ] Anyway, that'd be totally awesome.

-- Awesomeness --

3. The implementation of a .flatMap would be kick ass. Especially if you could use it for doing joins and filters when working on multiple DLists [Although, maybe that isn't really possible without doing cartesian joins -- which is obviously not realistic]

-- API ---

4. Single pass iterable, gets me every time

5. Grouping[K] is a bit of a mess, and I'd like to see it just totally removed (See my suggestion in that bug, involving only requiring an ordering -- and a special function for secondary sorting

6. .groupByKey should return a DMap[K,V] imo. For both explicitly saying the uniqueness constraint, and because constantly writing .map { case (k,v) => .... } etc.

--
You received this message because you are subscribed to the Google Groups "scoobi-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scoobi-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Eric Torreborre

unread,

Apr 8, 2013, 11:00:45 PM4/8/13

to scoob...@googlegroups.com

> "Running job N of M"

Did you try the DEBUG logs? Maybe I should level this up to INFO but it displays things like:

====== layers of computation in the graph ======

[DEBUG] HadoopMode - Executing layers

Layer(1

ParallelDo (189)[(String,Traversable[Int]),(Int,Int),(((Unit,Unit),Unit),Unit)] (bridge 6968c))

Layer(2

ParallelDo (192)[(Int,Int),Int,Unit] (bridge 72d65))

Layer(3

ParallelDo (241)[(String,Int),((String,Boolean),Either[au.com.cba.omnia.cheetah.feature.flattenCheeta$$anon$1@1badd463,Float]),(((Int,Unit),Unit),Unit)] (bridge f7936))

===== Mscrs for each layer ======

DEBUG] HadoopMode - executing layer 1

[DEBUG] HadoopMode - Executing layer

Layer(1

ParallelDo (189)[(String,Traversable[Int]),(Int,Int),(((Unit,Unit),Unit),Unit)] (bridge 6968c))

[DEBUG] HadoopMode - executing mscrs

Mscr(1

inputs: GbkInputChannel(Load (1)[String] (TextSource(1)

/user/petterja/CHEETAH_FEAT_DATA_CON

))

mappers

ParallelDo (112)[au.com.cba.omnia.cheetah.feature.flattenCheeta$$anon$1@3f3a0212,(String,Int),(Unit,Unit)] (bridge ed8f5)

ParallelDo (76)[String,au.com.cba.omnia.cheetah.feature.flattenCheeta$$anon$1@3f3a0212,(Unit,Unit)] (bridge ac0ff)

last mappers

ParallelDo (112)[au.com.cba.omnia.cheetah.feature.flattenCheeta$$anon$1@3f3a0212,(String,Int),(Unit,Unit)] (bridge ed8f5)

outputs: GbkOutputChannel(GroupByKey (113)[String,Int] (bridge 7fbf4), reducer = ParallelDo (189)[(String,Traversable[Int]),(Int,Int),(((Unit,Unit),Unit),Unit)] (bridge 6968c)))

===== Execution of each Mscr ======

[DEBUG] MapReduceJob - executing MSCR 1 on layer 1

And so on.

E.

On Tuesday, April 9, 2013 11:00:21 AM UTC+10, Eric Springer wrote:

Eric Springer

unread,

Apr 8, 2013, 11:44:02 PM4/8/13

to scoob...@googlegroups.com

On Mon, Apr 8, 2013 at 8:00 PM, Eric Torreborre <etorr...@gmail.com> wrote:

Did you try the DEBUG logs? Maybe I should level this up to INFO but it displays things like:

+1 on moving it up to Info. Debug is way too noisy for anything but debugging. I'd normally say that progress should be at the DEBUG level, but since its not like you can check the job-tracker for how far through you are -- it's very useful. Lots of times, I've started jobs, and have no idea if it's going to be like 2 hours more, or 20 hours more :D

Eric Torreborre

unread,

Apr 9, 2013, 2:34:08 AM4/9/13

to scoob...@googlegroups.com

I just moved some of the execution information to the INFO level (available in the latest SNAPSHOT).

E.

Eric Torreborre

unread,

Apr 15, 2013, 4:43:21 AM4/15/13

to scoob...@googlegroups.com

Re: in memory mode speed, I just realised that way to much data was saved to disk so with the latest snapshot (pending Jenkins) the performances should be better.