iree back-end 3-10 times slower than TF benchmark_model on CPU?

231 views
Skip to first unread message

Do Po

unread,
Jan 17, 2022, 8:15:12 AM1/17/22
to iree-discuss

Hello,

While benchmarking some Keras models for prediction (among them resnet50) we have discovered that iree-based dylib execution (compiled with the iree-transformation-pipeline option) is 3-10 times slower  than the execution using the benchmark-model tool of TensorFlow, depending on the architecture.

Is this normal? Are there iree compilation or VM options that would reduce the gap ? 

Regards,
Dumitru

Stella Laurenzo

unread,
Jan 17, 2022, 11:33:56 AM1/17/22
to Do Po, iree-discuss
Without getting into details, it is hard to be precise, but that roughly matches my current expectations for some classes of models on x86. At head, we do quite a bit better, beating TF by a lot and approaching best in class on A100 and IceLake for transformer models (huggingface bert-l has been the focus of our most recent performance sprints). Even here, our kernel code generation is up to ~2x slower than it can/should be, but other optimizations offset that. Work continues to improve that on several fronts.

Further performance work on CNNs is underway in our upstream codegen sandbox and that will trickle down to IREE in the coming months, and that will trigger a similar level of performance work and tracking as we have done for the transformer based models. We are also doing more work on x86 broadly that will help -- previous performance work had been more focused on GPUs. There are also a number of CPU arch specific tuning flags which are not currently on by default yet but help x86 performance (but I don't have any details of your setup and an not going to speculate).

To be honest, we don't track performance against TF much anymore because it is not competitive outside of Google Cloud TPUs on any workload we have measured. While a higher bar, we prefer to baseline on implementations that are closer to best in class. PyTorch tends to make good showings, and the ONNX runtime often has quite good performance. I expect these will be even less favorable to us on these CNN cases we are still heavily working on, but we do prefer to have the most aggressive targets when comparing ourselves. If you have details on experimental setups, it works be great if they could be shared as it will influence our work in the coming months.

--
You received this message because you are subscribed to the Google Groups "iree-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/f7db5fb3-ab4e-4b9f-b618-04e04b03a9cbn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages