Map Reduce Question

36 views
Skip to first unread message

Valeriob

unread,
Apr 10, 2012, 9:16:28 AM4/10/12
to rav...@googlegroups.com
Given the following model and indexes, the store contains 6M Parents with about 2 Children for each one.
Map_Recuce_Parent is working quite fast for the amount of data stored, while the Map_Reduce_Children is taking days and i do not know when it will finish :D
The SelectMany inside the map reduce is the real killer i guess, my question is : how mapping works ? what is raven actually doing that way?

Thanks
Valerio
 
    public class Parent
    {
        public List<Child> Children { getset; }
        public string Attribute_P { getset; }
    }
 
    public class Child
    {
        public string Attribute_C { getset; }
        public double Amount { getset; }
    }


    
    public class Result
    {
        public string Attribute { getset; }
        public double Amount { getset; }
        public int Count { getset; }
    }

    public class Map_Reduce_Children : AbstractIndexCreationTask<ParentResult>
    {
        public Map_Reduce_Children()
        {
            Map = docs => from parent in docs.SelectMany(a => a.Children)
                           select new { Attribute = parent.Attribute_C, Count = 1 };
 
            Reduce = results => from result in results
                                group result by new { result.Attribute }
                                into g
                                select new { g.Key.Attribute, Amount = g.Sum(r => r.Amount), Count = g.Sum(r => r.Count) };
        }
    }
 
    public class Map_Reduce_Parent : AbstractIndexCreationTask<ParentResult>
    {
        public Map_Reduce_Parent()
        {
            Map = docs => from parent in docs
                          select new { Attribute = parent.Attribute_P, Count = 1 };
 
            Reduce = results => from result in results
                                group result by new { result.Attribute }
                                into g
                                select new { g.Key.Attribute, Amount = g.Sum(r => r.Amount), Count = g.Sum(r => r.Count) };
        }
    }

Matt Warren

unread,
Apr 10, 2012, 9:35:12 AM4/10/12
to rav...@googlegroups.com
What build number are you using, are you using an older build? 

In later builds there was a lot of speed improvements for indexing.

Also, you are now forced to have the output of your Map statement match the output of your Reduce, so it should now look like this:

    Map = docs => from parent in
 docs
                 select new { Attribute = parent.Attribute_P, Count = 1, Amount = 0 };
 
    Reduce = results => from result in results
                  group result by new { result.Attribute }
                  into g
                  select new { g.Key.Attribute, Amount = g.Sum(r => r.Amount), Count = g.Sum(r => r.Count) };

Matt Warren

unread,
Apr 10, 2012, 9:37:34 AM4/10/12
to rav...@googlegroups.com
Given the following model and indexes, the store contains 6M Parents with about 2 Children for each one.
Map_Recuce_Parent is working quite fast for the amount of data stored, while the Map_Reduce_Children is taking days and i do not know when it will finish :D
The SelectMany inside the map reduce is the real killer i guess, my question is : how mapping works ? what is raven actually doing that way?

RavenDB runs the Map statement over all the docs in the data store and then write the intermediate results back into the data store.Then it takes batches of results and runs the Reduce statement over them.

So the SelectMany will make a difference, but it shouldn't be taking days compared to hours without it. If you look at the "\stats" http endpoint what does it tell you, are there any errors, how many docs has it indexed?

Valeriob

unread,
Apr 10, 2012, 10:32:02 AM4/10/12
to rav...@googlegroups.com
I'm on build 888, and yes the mapping got lost translating from my domain to the generic model, 
but that is not the problem :D 

The error log is empty, this is a piece of log that i considered significant, does it ring any bells ?

Consider Articolo = Attribute in the prev example.
It takes 4 minutes between two "Found 512 mapped results for keys", 
It uses up to 1.5g memory (maybe that's good) 
On every shutdown it has to fix lucene indexes "Unclean shutdown detected on Peso/Per/Articolo, checking the index for errors. This may take a while."
I hope this info helps !

Thanks
Valerio

Matt Warren

unread,
Apr 10, 2012, 11:12:59 AM4/10/12
to rav...@googlegroups.com
How are you shutting down your app, is seems strange that is doesn't shut down cleanly every time and so has to be fixed.

You might be running into the problem in this thread, but from what I remember that was the opposite scenario, i.e. the indexes not being checked when they should be.

In the Lucene index directory, is there a file called "write.lock" after a shutdown?

Valeriob

unread,
Apr 10, 2012, 12:13:20 PM4/10/12
to rav...@googlegroups.com
Nope there is not, but i get this on restart: http://img404.imageshack.us/img404/7848/startupi.png
i dont think this is related to the poor performances.

Thanks
Valerio

Itamar Syn-Hershko

unread,
Apr 10, 2012, 8:12:16 PM4/10/12
to rav...@googlegroups.com
So how _do_ you shut it down?

Valeriob

unread,
Apr 11, 2012, 2:50:54 AM4/11/12
to rav...@googlegroups.com
i just press q.

Valerio

Oren Eini (Ayende Rahien)

unread,
Apr 11, 2012, 4:09:04 AM4/11/12
to rav...@googlegroups.com
Hm,
That should do an orderly shutdown.
Is this reproducable to you?

Valeriob

unread,
Apr 11, 2012, 4:46:55 AM4/11/12
to rav...@googlegroups.com
Hi Oren, 
i'll try to build a test solution to replicate the behavior and let you know ! 

Valerio

Oren Eini (Ayende Rahien)

unread,
Apr 11, 2012, 4:59:53 AM4/11/12
to rav...@googlegroups.com
Thanks
Reply all
Reply to author
Forward
0 new messages