The PDAF nodes have different processors (AMD Shanghai vs. Intel Nehalem), and probably different memory subsystems too; that might be part of the reason for what you're seeing.
In all cases ddot is presumably limited by bandwidth to memory, not by processor speed.
You can still scale up to larger problems on PDAF, right, since
there's more memory? Does the flop rate of ddot stop going up
beyond 4 or 8 cores even when you fill all the memory you can get with
the vectors?
Please go ahead and submit both regular-node and PDAF-node results if you have them; it will be interesting to try to figure out what's really going on, and it might be useful data for people who are thinking of buying/using large-memory nodes (which includes both CNSI and my lab here). However we'll count that as extra credit; it's ok if you just submit regular-node runs.
Cheers,
- John