Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unnecessary use of lea to increment an induction variable #13692

Open
llvmbot opened this issue Jul 10, 2012 · 6 comments
Open

unnecessary use of lea to increment an induction variable #13692

llvmbot opened this issue Jul 10, 2012 · 6 comments
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@llvmbot
Copy link
Collaborator

llvmbot commented Jul 10, 2012

Bugzilla Link 13320
Version trunk
OS All
Reporter LLVM Bugzilla Contributor
CC @adibiagio,@atrick,@legrosbuffle,@topperc,@ehsan,@jrmuizel,@LebedevRI,@RKSimon,@rotateright

Extended Description

gcc-4.2 compiles

long IndexOfChild(long* children, long d, long count)  {
  long i = -1;
  for (; ;) {
    if (i >= (count -1))
      break;
    ++i;
    if (children[i] == d) {
      return i;
    }
  }
  return -1;
}

to

0000000000000000	decq	%rdx
0000000000000003	movq	$0xffffffff,%rax
000000000000000a	nopw	0x00(%rax,%rax)
0000000000000010	cmpq	%rdx,%rax
0000000000000013	jge	0x00000020
0000000000000015	incq	%rax
0000000000000018	cmpq	%rsi,(%rdi,%rax,8)
000000000000001c	jne	0x00000010
000000000000001e	repz/ret
0000000000000020	movq	$0xffffffff,%rax
0000000000000027	ret

Clang produces

0000000000000000	decq	%rdx
0000000000000003	movq	$0xffffffff,%rcx
000000000000000a	movq	%rcx,%rax
000000000000000d	nopl	(%rax)
0000000000000010	cmpq	%rdx,%rax
0000000000000013	jge	0x00000021
0000000000000015	cmpq	%rsi,0x08(%rdi,%rax,8)
000000000000001a	leaq	0x01(%rax),%rax
000000000000001e	jne	0x00000010
0000000000000020	ret
0000000000000021	movq	%rcx,%rax
0000000000000024	ret

The use of an extra register (rcx) is probably not a big issue, as it is note used on the loop. The strange part is that clang inverts the increment of 'i' and the load and then uses a leaq instead of an inc to avoid modifying the flags.

To test this I wrapped it with

#include <string.h>
#include <stdio.h>
long IndexOfChild(long* children, long d, long count) ;
int main() {
  long n = 1000000;
  long *v = new long[n];
  memset(v, 0, n * sizeof(long));
  v[n - 1] = 42;
  int r;
  for (int i = 0; i < 1000; ++i)
    r = IndexOfChild(v, 42, n);
  printf("%d\n", r);
  return 0;
}

If this is a perf problem or not seems to be cpu dependent. On my MacBookPro6,2 with a 2.66 GHz core i7 I get 1.243s for the clang version and 1.173 for the gcc version.

@jrmuizel
Copy link

I can reproduce the difference on my Sandybridge MacBookPro8,2 2.3 GHz

/private/tmp# time ./a.out.gcc
999999

real 0m0.714s
user 0m0.707s
sys 0m0.005s
/private/tmp# time ./a.out.clang
999999

real 0m0.735s
user 0m0.729s
sys 0m0.005s

@jrmuizel
Copy link

FWIW modern gcc produces this:

subq $0x01,%rdx
movq $0xffffffff,%rax
nopl 0x00(%rax,%rax)
cmpq %rdx,%rax
jge 0x100000e90
addq $0x01,%rax
cmpq %rsi,(%rdi,%rax,8)
jne 0x100000e70
repz/ret
nopl 0x00000000(%rax)
nopl 0x00000000(%rax,%rax)
movq $0xffffffff,%rax
ret

which is also faster than clang

I suspect the following paragraph from http://www.intel.com/technology/itj/2007/v11i4/1-inside/7-code-gen.htm explains it:

The Intel Core micro-architecture can combine an integer compare (cmp) or test (test) instruction and a subsequent conditional jump instruction (jCC) into a single micro-operation through a process called macro-fusion. For macro-fusion to occur between cmp and jCC, the jump condition must test only the carry and/or zero flags, which is typically the case for unsigned integer compare and jump operations. The Intel Fortran/C++ compiler takes advantages of the macro-fusion feature by generating code that is likely to expose macro-fusion opportunities by detecting compare and jump instructions that are candidates for fusion. During scheduling, it forces these compare and jump instructions to be adjacent. Note that this strategy conflicts with a traditional latency-based strategy, which tends to separate producers (the compare in this case) from consumers (the conditional jump).

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jul 24, 2012

What seems to be happening here is the following:

The TwoAddressInstructionPass tries to transform two-address instructions (e.g. INC) into three-address instructions (e.g. LEA) whenever it considers this transformation profitable. However, the profitability check has a very narrow view: the transformation is profitable if it may help relieve register pressure. Two other factors are not taken into account:
a) Whether the new instruction is slower than the original one.
b) Whether the code motion that is required to relieve pressure is a good idea.

In this case, we fail on both counts. On Modern x86 CPUs, inc is slower than lea, and sinking the lea instruction to the appropriate (from the register allocator's perspective) place breaks macrofusion.

Issue (a) is easy. If we decide we never want to promote an INC to a LEA because of the direct performance penalty, this transformation can be trivially disabled. However, I'm not sure this is a good idea.
Issue (b) is trickier. Right now, the TwoAddressInstructionPass first transforms the 2-address instruction into a 3-address one, and only then attempts to sink it. To make the right decision, the new location would have to be known ahead of time. Furthermore, this is a target-dependent decision, and I don't see an appropriate hook.

@atrick
Copy link
Contributor

atrick commented Oct 5, 2013

MI scheduler doesn't fix this. The 2-address pass still does the wrong thing. It would be nice to improve those heuristics. Also, the MI scheduler could correct the problem if it could create copies and split live ranges. That was part of the original MI scheduler plan but was never implemented. The implementation would look like this:

  • DAG builder creates weak edges for vreg antidependence.
  • Check for unresolved weak edge when scheduling the instruction.
  • Create the copy an DAG node on -the-fly (not currently supported).
  • Create a new live interval for the copy and shrink the original.

@atrick
Copy link
Contributor

atrick commented Jan 25, 2014

I looked at this again because it was related to 18598. This bug is more about scheduling the cmp/jne together, and less about the choice of inc/lea/add. My previous comment suggested recovering from the problem within the scheduler, where we know we want to merge the cmp/jmp. It would be nice if the scheduler could do things like that, but that's making it into a harder problem than it needs to be.

We could introduce code in a couple places to handle this sort of idiom and just prevent the problem before it happens.

  1. In LoopStrengthReduce, consider the load to be a "post-inc" user. We could use a target hook to see that the load will be folded into the compare.

if.end: ; preds = %for.cond
%inc = add nsw i64 %i.0, 1
%arrayidx = getelementptr inbounds i64* %children, i64 %inc
%0 = load i64* %arrayidx, align 8, !tbaa !​1
%cmp1 = icmp eq i64 %0, %d
br i1 %cmp1, label %return, label %for.cond

  1. In ISEL, avoid folding the increment into the load's addressing mode. That way the load continues to use the post-inc form of %i. That probably means looking for patterns like this when we decide to fold. I'm not very familiar with that code.

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 30, 2019

Not much has changed in 5 years: https://godbolt.org/z/P4raiv

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

4 participants