[llvm-dev] Performance benefits shown in [RFC: CSSPGO with Pseudo-Instrumentation] can't be reproduced.

453 views
Skip to first unread message

徐青青 via llvm-dev

unread,
Oct 28, 2021, 4:26:21 AM10/28/21
to via llvm-dev
Hi All,

I am using CSSPGO with Pseudo-Instrumentation. But I found that the performance benefits shown in [RFC: CSSPGO with Pseudo-Instrumentation] can't be reproduced on Spec CPU 2017 based on llvm-12. In RFC, results show that CSSPGO with Pseudo-Instrumentation achieves better performance over AutoFDO.

Here, I have two question:
  1. Why choose Spec CPU 2006 instead of Spec CPU 2017? Do you have results on Spec CPU 2017?
  2. Please point out if there is any error with my usage of CSSPGO, the steps are as follows:
Suppose that my program is test.cpp.
Step 1: clang  -O3  -g3  -fno-omit-frame-pointer  -fdebug-info-for-profiling  -fpseudo-probe-for-profiling  test.cpp  -o  test
Step 2: perf  record  -g  --call-graph  fp  -e  br_inst_retired.near_taken:uppp  -c  16009  -b  -o  test.perf.data  ./test
Step 3: perf  script  -F  ip,brstack  -i  test.perf.data  --show-mmap-event  &>  test.perf.script
Step 4: llvm_install/bin/llvm-profgen  --perfscript=test.perf.script  --binary=./test  --output=test.spgo.profraw  --format=text
Step 5: llvm_install/bin/llvm-profdata  merge  --text  --sample  -output=test.spgo.prof  test.profraw ...
Step 6: clang  -O3  -g3  -fpseudo-probe-for-profiling  --fprofile-sample-use=test.spgo.prof  test.cpp  -o  cs_test
Step 7: ./cs_test

Thanks,
Qingqing Xu

Wenlei He via llvm-dev

unread,
Oct 29, 2021, 3:49:48 PM10/29/21
to 徐青青, via llvm-dev, Lei Wang, Hongtao Yu

For Spec2017, we’ve seen 1%+ CPU improvements on Broadwell hosts in the past. We use spec only for bringing up new technologies and we no longer tracks spec results now as we move towards production workload. Also note that the measurement was done on our internal fork, with some internal patches. We’re still working on upstreaming some of them.

 

For the setup, -fdebug-info-for-profiling needs to be removed.

 

Thanks,

Wenlei

Hongtao Yu via llvm-dev

unread,
Oct 29, 2021, 3:57:42 PM10/29/21
to Wenlei He, 徐青青, via llvm-dev, Lei Wang

Please also be noted that in order to maximize the benefit from CSSPGO and its improved inlining, LTO mode is recommended. I suggest to try out -flto.

 

Thanks,

Hongtao

 

From: Wenlei He <wen...@fb.com>
Date: Friday, October 29, 2021 at 12:49 PM
To:
徐青青 <xuqingq...@bytedance.com>, via llvm-dev <llvm...@lists.llvm.org>
Cc: Hongtao Yu <h...@fb.com>, Lei Wang <wl...@fb.com>
Subject: Re: [llvm-dev] Performance benefits shown in [RFC: CSSPGO with Pseudo-Instrumentation] can't be reproduced.

For Spec2017, we’ve seen 1%+ CPU improvements on Broadwell hosts in the past. We use spec only for bringing up new technologies and we no longer tracks spec results now as we move towards production workload. Also note that the measurement was done on our internal fork, with some internal patches. We’re still working on upstreaming some of them.

 

For the setup, -fdebug-info-for-profiling needs to be removed.

 

Thanks,

Wenlei

 

From: llvm-dev <llvm-dev...@lists.llvm.org> on behalf of 徐青青 via llvm-dev <llvm...@lists.llvm.org>
Date: Thursday, October 28, 2021 at 1:26 AM
To: via llvm-dev <llvm...@lists.llvm.org>
Subject: [llvm-dev] Performance benefits shown in [RFC: CSSPGO with Pseudo-Instrumentation] can't be reproduced.

Hi All,

 

I am using CSSPGO with Pseudo-Instrumentation. But I found that the performance benefits shown in [RFC: CSSPGO with Pseudo-Instrumentation] can't be reproduced on Spec CPU 2017 based on llvm-12. In RFC, results show that CSSPGO with Pseudo-Instrumentation achieves better performance over AutoFDO.

 

Here, I have two question:

1.       Why choose Spec CPU 2006 instead of Spec CPU 2017? Do you have results on Spec CPU 2017?

2.       Please point out if there is any error with my usage of CSSPGO, the steps are as follows:

Lei Wang via llvm-dev

unread,
Oct 29, 2021, 11:58:57 PM10/29/21
to 徐青青, via llvm-dev, Hongtao Yu

BTW, regarding the issue in  https://groups.google.com/g/llvm-dev/c/QJFIzk6bP1Y/m/8YlhrhXDAQAJ. (Sorry I overlooked the message)

 

We have a fix in https://reviews.llvm.org/D110081 which can filter out the negative LineOffset, you can have a try on latest llvm-profgen.

 

Thanks.

Lei

徐青青 via llvm-dev

unread,
Nov 2, 2021, 2:42:21 AM11/2/21
to Lei Wang, via llvm-dev, compiler, Hongtao Yu
As you suggested, I remove -fdebug-info-for-profiling in first compiling process and add -flto in second compiling process for CSSPGO, -flto can bring great improvement.

To be fair, I also add -flto in second compiling process for AutoFDO. The result shows that AutoFDO bring more performance benefits over CSSPGO (about 20% on SpecCPU2017's 523.xalancbmk_r).

The version of llvm I used is llvm-12. And your RFC is also based on llvm-12 according the time of RFC. Have I missed anything in the usage of CSSPGO? Is there any option for CSSPGO which I need to open manually? Could you please test the branch release/12.x and confirm the results to help me to get performance benefits over AutoFDO?

Thanks,
Qingqing

徐青青 via llvm-dev

unread,
Nov 2, 2021, 5:16:44 AM11/2/21
to Lei Wang, via llvm-dev, compiler, Hongtao Yu
Can you send the results for each benchmark in SpecCPU2006 in detail instead of Geometric? I can compare your results with SpecCPU2017 because there are some common benchmarks in SpecCPU2006 and SpecCPU2017.

I also have questions:
  1. Have you seen the compilation error when you use CSSPGO in SpecCPU2006? I seen the error in SpecCPU2017-502.gcc_r. This benchmark also exists in SpecCPU2006.
  2. About the fix, we may have better choice. About the question: Can we potentially lose contexts when an invalid line offset is one of the frames? Like A:-1 @ B:2 @ C:3, without this change, we could still have samples for B:2 @ C:3. But I think that's rare.
  • You say that: you haven't seen those cases, it seems it only happened for leaf frame. Even it can be leaf call, there is no samples hit the callee. you can add a warning on the non-leaf frame invalid line of stack address.
  • As for me, I have seen the cases that happened for non-leaf frame. And if you filter out such callstack, the number of samples will decrease sharply. I haven't use the llvm-master successfully. After that I expect to show you.

If possible, I look forward to a voice conference with you if you are  convenient at any time.

Hongtao Yu via llvm-dev

unread,
Nov 2, 2021, 1:14:43 PM11/2/21
to 徐青青, Lei Wang, via llvm-dev, compiler
Replied inline.


From: 徐青青 <xuqingq...@bytedance.com>
Sent: Tuesday, November 2, 2021 2:16 AM
To: Lei Wang <wl...@fb.com>
Cc: via llvm-dev <llvm...@lists.llvm.org>; Hongtao Yu <h...@fb.com>; Wenlei He <wen...@fb.com>; compiler <comp...@bytedance.com>
Subject: Re: [External] Re: [llvm-dev] Performance benefits shown in [RFC: CSSPGO with Pseudo-Instrumentation] can't be reproduced.
 
Can you send the results for each benchmark in SpecCPU2006 in detail instead of Geometric? I can compare your results with SpecCPU2017 because there are some common benchmarks in SpecCPU2006 and SpecCPU2017.

I also have questions:
  1. Have you seen the compilation error when you use CSSPGO in SpecCPU2006? I seen the error in SpecCPU2017-502.gcc_r. This benchmark also exists in SpecCPU2006.

What compilation error did you see?

  1. About the fix, we may have better choice. About the question: Can we potentially lose contexts when an invalid line offset is one of the frames? Like A:-1 @ B:2 @ C:3, without this change, we could still have samples for B:2 @ C:3. But I think that's rare.
  • You say that: you haven't seen those cases, it seems it only happened for leaf frame. Even it can be leaf call, there is no samples hit the callee. you can add a warning on the non-leaf frame invalid line of stack address.
  • As for me, I have seen the cases that happened for non-leaf frame. And if you filter out such callstack, the number of samples will decrease sharply. I haven't use the llvm-master successfully. After that I expect to show you.
For the case like A:-1 @ B:2 @ C:3, see the call stack will be truncated to B:2 @ C:3,. As a result, the compiler will no longer be able to inline every function into A. The number of samples, in terms of LBR samples, will still be kept. 

There is a warning emitted for that. Please see https://github.com/llvm/llvm-project/blob/main/llvm/tools/llvm-profgen/PerfReader.cpp#L436-L437. It may not be in the 12.x branch.

We noticed a compiler optimization, aka, tail merge optimization, can cause such truncated stack cases. You may want to try turning it off with  -enable-tail-merge=0.


If possible, I look forward to a voice conference with you if you are  convenient at any time.
On Tue, Nov 2, 2021, 14:42 <xuqingq...@bytedance.com> wrote:
As you suggested, I remove -fdebug-info-for-profiling in first compiling process and add -flto in second compiling process for CSSPGO, -flto can bring great improvement.

Would be better to apply lto to both pass1 and pass2.

To be fair, I also add -flto in second compiling process for AutoFDO. The result shows that AutoFDO bring more performance benefits over CSSPGO (about 20% on SpecCPU2017's 523.xalancbmk_r).

How about other benchmarks? xalanc is sort of unstable sometimes.


The version of llvm I used is llvm-12. And your RFC is also based on llvm-12 according the time of RFC. Have I missed anything in the usage of CSSPGO? Is there any option for CSSPGO which I need to open manually? Could you please test the branch release/12.x and confirm the results to help me to get performance benefits over AutoFDO?

Yes, there are other switches that can help boost the performance, such as -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -fno-optimize-sibling-calls -funique-internal-linkage-names. Note that with the main branch, -funique-internal-linkage-names is automatically turned on when -fpseudo-probe-for-profiling is on. 

We've made a lot post-12.x improvements to csspgo and they are all in the main branch. Please give them a shot.
Reply all
Reply to author
Forward
0 new messages