reducing - bug or mistake?

20 views
Skip to first unread message

Roland Gude

unread,
Jan 10, 2012, 10:23:03 AM1/10/12
to peregrine...@googlegroups.com
Hi,

i am trying to do some simple evaluations of peregrine still.

while doing so i encountered some strange behaviour.

given a ,apper that for a number of numbers does this:

    private void doEmit(long a, long b){
   
    if (a != b){
    long min = Math.min(a, b);
    long max = Math.max(a, b);
    emit(  new StructWriter(24).writeHashcode(min).writeLong(min).writeLong(max).toStructReader() , StructReaders.wrap(1));
    }
    }
        
and a reducer that reduces them like this:

        @Override
        public void reduce( StructReader key, List<StructReader> values ) {
        final long hashcode = key.readLong();
       
        final long a = key.readLong();
        final long b = key.readLong();

        long sum = 0L;
        for (StructReader val : values){
        sum += val.readInt();
        }
                log.info( "reducing key of length %d hash %d min %d max %d, numValues was %d.", key.length(), hashcode, a, b, values.size());
        emit(new StructWriter(16).writeHashcode(a).writeLong(a).toStructReader(), new StructWriter(16).writeLong(b).writeLong(cor).toStructReader());
        emit(new StructWriter(16).writeHashcode(b).writeLong(b).toStructReader(), new StructWriter(16).writeLong(a).writeLong(cor).toStructReader());
        }

i can see in the logs that the reduce function is called multiple times for the same key with a subset of the values for that key instead of once with all values for that key.
Of course this completely screws up the expected results.

I am not sure whether this is due to an error of my code or a bug in peregrine.

Kevin Burton

unread,
Jan 10, 2012, 1:24:09 PM1/10/12
to peregrine...@googlegroups.com
Oh … this is  a simple fix.

Make your key a unique hash code representing the record.

Then put the rest in the value.

Right now keys must be exactly 8 bytes.

I am going to add support for variable width keys which shouldn't be too hard but the initial algorithm was easier to code as fixed width.

That should fix the problem.

Kevin
--
--

Founder/CEO Spinn3r.com

Location: San Francisco, CA
Skype: burtonator

Skype-in: (415) 871-0687


Roland Gude

unread,
Jan 10, 2012, 2:40:16 PM1/10/12
to peregrine...@googlegroups.com
I guess this should be in the javadocs, and logger should give a warning.

Thanks die the gelöst.

Roland Gude

unread,
Jan 10, 2012, 2:41:58 PM1/10/12
to peregrine...@googlegroups.com
Damn autocorrection.

Last sentence should've been
Thanks for the help

Kevin Burton

unread,
Jan 10, 2012, 3:27:58 PM1/10/12
to peregrine...@googlegroups.com
It's documented in the source … I also just committed a branch that won't let you do this any more.  

I'll back it out once we add support for variable width keys.

Using a URL as a key for example I think would be reasonable… then we would compute the hash code from the URL at runtime.

Kevin

On Tue, Jan 10, 2012 at 11:40 AM, Roland Gude <r...@ndgu.de> wrote:
I guess this should be in the javadocs, and logger should give a warning.

Thanks die the gelöst.



Roland Gude

unread,
Jan 12, 2012, 6:07:54 PM1/12/12
to peregrine...@googlegroups.com
Everything works Now, but i Kind of Not like the limitation.

Kevin Burton

unread,
Jan 12, 2012, 6:31:53 PM1/12/12
to peregrine...@googlegroups.com
Is it biting you now?  I think it should be pretty easy to back out / fix now… 

But honestly I wasn't expecting it to bite anyone for typical peregrine optimal jobs.

URLs as keys would be nice though…

Kevin


On Thu, Jan 12, 2012 at 3:07 PM, Roland Gude <r...@ndgu.de> wrote:
Everything works Now, but i Kind of Not like the limitation.



Roland Gude

unread,
Jan 13, 2012, 4:24:31 PM1/13/12
to peregrine...@googlegroups.com
no its not a real problem
it just strikes me as counterintuitive for most things i tried rieght now.

for instance i want to build some correlations between numbers which appear together in a context.
so when i find a correlation i emit it like this
emit(hash(a+"/"+b), [a,b,1])

leads to reduce code like this

check if "a" is still the same
check if "b" is still the same

add 1 to their correlation


if i could emit liek this
emit([a,b],1)

reducer would be much less code

but its ok i guess.

Kevin Burton

unread,
Jan 13, 2012, 5:45:45 PM1/13/12
to peregrine...@googlegroups.com
Sure… but I think your job will be slower… I guess it depends on the size of a+b though… 

computing hashcodes is pretty fast… writing 2x more data is pretty expensive. :-( 

BTW… I did an audit and I think there is just one place where the fixed width requirement is in place.

I might just play with removing it in a branch and see if it is easy.

Should be…

this is one of the main reasons I wanted peregrine to be small and tight.  I can keep the entire stack in my head at this point.  

well almost :) 

Roland Gude

unread,
Jan 16, 2012, 4:51:06 PM1/16/12
to peregrine...@googlegroups.com
hmm why would this mean that more data is written?
in my case i think i would write less data

currently:
key: 8 byte for hashcode
value: 8 byte for A, 8 byte for B and 4 byte for a integer 1

sum: 28 byte

if a and b could go into the key

key: 8 byte for A, 8 byte for B
value: 4 byte for int 1

sum: 20 byte

sure this could be optimized by better serialization, but still i don't see why it is more data


Kevin Burton

unread,
Jan 17, 2012, 6:54:19 PM1/17/12
to peregrine...@googlegroups.com
hm… ok.  Well I will work on getting variable width keys here shortly anyway.
Reply all
Reply to author
Forward
0 new messages