New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unnecessary use of lea to increment an induction variable #13692
Comments
I can reproduce the difference on my Sandybridge MacBookPro8,2 2.3 GHz /private/tmp# time ./a.out.gcc real 0m0.714s real 0m0.735s |
FWIW modern gcc produces this: subq $0x01,%rdx which is also faster than clang I suspect the following paragraph from http://www.intel.com/technology/itj/2007/v11i4/1-inside/7-code-gen.htm explains it: The Intel Core micro-architecture can combine an integer compare (cmp) or test (test) instruction and a subsequent conditional jump instruction (jCC) into a single micro-operation through a process called macro-fusion. For macro-fusion to occur between cmp and jCC, the jump condition must test only the carry and/or zero flags, which is typically the case for unsigned integer compare and jump operations. The Intel Fortran/C++ compiler takes advantages of the macro-fusion feature by generating code that is likely to expose macro-fusion opportunities by detecting compare and jump instructions that are candidates for fusion. During scheduling, it forces these compare and jump instructions to be adjacent. Note that this strategy conflicts with a traditional latency-based strategy, which tends to separate producers (the compare in this case) from consumers (the conditional jump). |
What seems to be happening here is the following: The TwoAddressInstructionPass tries to transform two-address instructions (e.g. INC) into three-address instructions (e.g. LEA) whenever it considers this transformation profitable. However, the profitability check has a very narrow view: the transformation is profitable if it may help relieve register pressure. Two other factors are not taken into account: In this case, we fail on both counts. On Modern x86 CPUs, inc is slower than lea, and sinking the lea instruction to the appropriate (from the register allocator's perspective) place breaks macrofusion. Issue (a) is easy. If we decide we never want to promote an INC to a LEA because of the direct performance penalty, this transformation can be trivially disabled. However, I'm not sure this is a good idea. |
MI scheduler doesn't fix this. The 2-address pass still does the wrong thing. It would be nice to improve those heuristics. Also, the MI scheduler could correct the problem if it could create copies and split live ranges. That was part of the original MI scheduler plan but was never implemented. The implementation would look like this:
|
I looked at this again because it was related to 18598. This bug is more about scheduling the cmp/jne together, and less about the choice of inc/lea/add. My previous comment suggested recovering from the problem within the scheduler, where we know we want to merge the cmp/jmp. It would be nice if the scheduler could do things like that, but that's making it into a harder problem than it needs to be. We could introduce code in a couple places to handle this sort of idiom and just prevent the problem before it happens.
if.end: ; preds = %for.cond
|
Not much has changed in 5 years: https://godbolt.org/z/P4raiv |
Extended Description
gcc-4.2 compiles
to
Clang produces
The use of an extra register (rcx) is probably not a big issue, as it is note used on the loop. The strange part is that clang inverts the increment of 'i' and the load and then uses a leaq instead of an inc to avoid modifying the flags.
To test this I wrapped it with
If this is a perf problem or not seems to be cpu dependent. On my MacBookPro6,2 with a 2.66 GHz core i7 I get 1.243s for the clang version and 1.173 for the gcc version.
The text was updated successfully, but these errors were encountered: