For parallel processing, it will be useful to have a package or support in Base itself to time
- aggregate CPUTime used across all workers for a code block executing in parallel
- calculate the communication/serialization overhead when executing across different nodes
- list CPU counters like L1/L2 cache misses
This information can help in quickly detecting the main cause of poor scaling with multiple cores for certain workloads.