parallel speedup more than the number of cores!!

arijit

unread,

Apr 22, 2011, 4:16:29 PM4/22/11

to UCSB Computer Science 240A Spring 2011

Hi,
In ques 3 of HW 3, we set the number of core as 4; however the
parallel speedup is more than 4 for certain coarseness values. For
example, when the input vector size=10^6, coarseness=500, the speedup
for rec_cilkified is about 7.8. This is how we are running the code.

qsub -I -l walltime=00:30:00 -l nodes=1:ppn=4
./innerproduct 1000000

Are we doing something wrong; or it is expected to get a speedup more
than 4 for certain coarseness values?

Thanks,
Arijit

Lijie Ren

unread,

Apr 23, 2011, 12:18:33 AM4/23/11

to UCSB Computer Science 240A Spring 2011

I get speedup>3 on 1 core(!) and speedup=23 on 4 cores.

-Lijie

Lijie Ren

unread,

Apr 23, 2011, 12:29:01 AM4/23/11

to UCSB Computer Science 240A Spring 2011

never mind. my code was wrong...

-Lijie

Matt Weiden (Reader)

unread,

Apr 23, 2011, 12:31:46 AM4/23/11

to UCSB Computer Science 240A Spring 2011

I would guess that this has something to do with simultaneous
multithreading in the Nehalem processors. Actually Nehalem is
mentioned in the following article:

http://en.wikipedia.org/wiki/Simultaneous_multithreading

Intel rebranded this functionality "Hyper-Threading". Professor
Sherwood calls this relabeling "Marketecture".

Matt

On Apr 22, 9:18 pm, Lijie Ren <lijie...@gmail.com> wrote:

Kyle Klein

unread,

Apr 24, 2011, 1:10:04 PM4/24/11

to ucsb-computer-scien...@googlegroups.com

Has anyone been able to get reasonable speeds up on the 32 core PDAF nodes? Whenever I run on them the performance is extremely poor even set at 4 nodes, when compared to the single 8 core nodes, and little changes at 32 nodes.

best,

Kyle

--
Kyle Klein

Ph.D. Student

Department of Computer Science UC Santa Barbara

http://www.cs.ucsb.edu/~kyleklein

Kevin Francis

unread,

Apr 24, 2011, 1:17:11 PM4/24/11

to ucsb-computer-scien...@googlegroups.com

Hi Kyle,

I saw the same problem with PDAF nodes. With the regular node, it showed better results.

Regards,

Kevin.

Kyle Klein

unread,

Apr 24, 2011, 1:28:01 PM4/24/11

to ucsb-computer-scien...@googlegroups.com

Glad it's not just me. Matt / Professor, unless we are doing something wrong (someone please speak up if you aren't having this issue), should we even bother reporting these results?

John Gilbert

unread,

Apr 24, 2011, 4:27:20 PM4/24/11

to ucsb-computer-scien...@googlegroups.com

The PDAF nodes have different processors (AMD Shanghai vs. Intel Nehalem), and probably different memory subsystems too; that might be part of the reason for what you're seeing.

In all cases ddot is presumably limited by bandwidth to memory, not by processor speed.

You can still scale up to larger problems on PDAF, right, since there's more memory? Does the flop rate of ddot stop going up beyond 4 or 8 cores even when you fill all the memory you can get with the vectors?

Please go ahead and submit both regular-node and PDAF-node results if you have them; it will be interesting to try to figure out what's really going on, and it might be useful data for people who are thinking of buying/using large-memory nodes (which includes both CNSI and my lab here). However we'll count that as extra credit; it's ok if you just submit regular-node runs.

Cheers,

- John

Reply all

Reply to author

Forward