Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suboptimal codegen for conditional float return? #38882

Open
Trass3r opened this issue Nov 2, 2018 · 14 comments
Open

suboptimal codegen for conditional float return? #38882

Trass3r opened this issue Nov 2, 2018 · 14 comments
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@Trass3r
Copy link
Contributor

Trass3r commented Nov 2, 2018

Bugzilla Link 39534
Version trunk
OS All
CC @topperc,@RKSimon,@MatzeB,@rotateright

Extended Description

bool isNanOrInf(float x)
{
return (reinterpret_cast<unsigned&>(x) & 0x7F800000) == 0x7F800000;
}

float test(float x)
{
return isNanOrInf(x) ? 0 : x;
}

test(float):
vmovd eax, xmm0
not eax
vpxor xmm1, xmm1, xmm1
and eax, 2139095040
je .LBB1_2
vmovdqa xmm1, xmm0
.LBB1_2:
vmovdqa xmm0, xmm1
ret

In contrast gcc does not use xmm1 at all:

test(float):
vmovd eax, xmm0
mov edx, 2139095040
and edx, eax
cmp edx, 2139095040
je .L5
ret
.L5:
vxorps xmm0, xmm0, xmm0
ret

https://gcc.godbolt.org/z/biNBFq

@topperc
Copy link
Collaborator

topperc commented Nov 2, 2018

Matthias, do you have any suggestions for how to improve this?

@topperc
Copy link
Collaborator

topperc commented Nov 2, 2018

Given these two equivalent IR files where I've flipped the condition and select operands, we produce better results with one than the other.

define dso_local float @​_Z4testf(float %x) local_unnamed_addr #​0 {
entry:
%0 = bitcast float %x to i32
%and.i = and i32 %0, 2139095040
%cmp.i = icmp eq i32 %and.i, 2139095040
%cond = select i1 %cmp.i, float 0.000000e+00, float %x
ret float %cond
}

define dso_local float @​_Z4testf2(float %x) local_unnamed_addr #​0 {
entry:
%0 = bitcast float %x to i32
%and.i = and i32 %0, 2139095040
%cmp.i = icmp ne i32 %and.i, 2139095040
%cond = select i1 %cmp.i, float %x, float 0.000000e+00
ret float %cond
}

@rotateright
Copy link
Contributor

Given these two equivalent IR files where I've flipped the condition and
select operands, we produce better results with one than the other.

define dso_local float @​_Z4testf(float %x) local_unnamed_addr #​0 {
entry:
%0 = bitcast float %x to i32
%and.i = and i32 %0, 2139095040
%cmp.i = icmp eq i32 %and.i, 2139095040
%cond = select i1 %cmp.i, float 0.000000e+00, float %x
ret float %cond
}

define dso_local float @​_Z4testf2(float %x) local_unnamed_addr #​0 {
entry:
%0 = bitcast float %x to i32
%and.i = and i32 %0, 2139095040
%cmp.i = icmp ne i32 %and.i, 2139095040
%cond = select i1 %cmp.i, float %x, float 0.000000e+00
ret float %cond
}

For reference, instcombine is missing several potential select canonicalizations, but it does canonicalize this code already.

Unfortunately, it's to the 'eq' version, and that's the one that has the poor codegen.

@MatzeB
Copy link
Contributor

MatzeB commented Nov 2, 2018

I assume with "bad codegen" you specifically mean the "vmovdqa xmm1, xmm0", "vmovdqa xmm0, xmm1" combination. The problem at the MI level here is that the code looks like something like this after the CMOV got replaced with a jump:

bb.0:
  ...
  %4:fr32 = FsFLD0SS
  JE_1 %bb.2

bb.1:
  %0:fr32 = COPY $xmm0

bb.2:
  %5:fr32 = PHI %0:fr32, %bb.1, %4:fr32, %bb.0

Unfortunately bb.0->bb.2 is a critical edge, and we are forced to keep the FsFLD0SS instruction in bb.0. If I force CodeGen to split all critical edges we end up with better code:

$ llc -o - -phi-elim-split-all-critical-edges=1 -no-phi-elim-live-out-early-exit=1 test.ll
...
## %bb.0:
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	movd	%xmm0, %eax
	andl	$2139095040, %eax       ## imm = 0x7F800000
	cmpl	$2139095040, %eax       ## imm = 0x7F800000
	jne	LBB0_2
## %bb.1:
	pxor	%xmm0, %xmm0
LBB0_2:
	popq	%rbp
	retq

I'm not up to date with the discussion of when we split or not split critical edges and it definitely isn't unviversally good to split them. This example however would clearly benefit from splitting, so maybe we can come up with a heuristic to catch this situation...

@MatzeB
Copy link
Contributor

MatzeB commented Nov 2, 2018

Or maybe we should change that CMOV expansion code to not produce critical edges. Given that the choice of edge is arbitrary, I'd rather not introduce any critical edge here to have more consistent codegen in that case...

@topperc
Copy link
Collaborator

topperc commented Nov 2, 2018

I was just about to ask if we should do that.

@topperc
Copy link
Collaborator

topperc commented Nov 5, 2018

Patch to avoid creating critical edges: https://reviews.llvm.org/D54119

Unfortunately it doesn't seem to help this case by itself.

@MatzeB
Copy link
Contributor

MatzeB commented Nov 5, 2018

Looks like tailduplication code is making a bad decision because the instructions have been sunk yet... If I move the tailduplicator to run after machine sinking in TargetPassConfig then the example works as expected...

Not sure though what the interaction of taildup with the other passes are and if moving it is a good idea in general.

Short of that maybe we can do an extra round of machine sinking when we lowered any cmovs? (Or implement some simple version of sinking inside the cmov lowering)

@MatzeB
Copy link
Contributor

MatzeB commented Nov 5, 2018

And BTW: For me the example works with your patch regardless. Tail duplication makes the other edge critical so the FsFLD0SS can be sunk for me.

@topperc
Copy link
Collaborator

topperc commented Nov 5, 2018

I was checking the output from the two test cases from comment 2. With my patch they switch behavior. So I guess it fixes the original case, but regresses some opposite of that case.

@MatzeB
Copy link
Contributor

MatzeB commented Nov 5, 2018

Yeah as I explained in my other comment: Short of the instruction actually sunk into the blocks it's probably random which edge the tail duplicator picks. That probably also explains some of the regressions in the tests you saw.

I CCed Kyle maybe he can make a statement as to the effects of moving early tail duplication later into the pipeline (at least after the pre-RA machine sinking)...

@topperc
Copy link
Collaborator

topperc commented Dec 10, 2018

Kyle, have you had a chance to look at this?

@llvmbot
Copy link
Collaborator

llvmbot commented Dec 10, 2018

I haven't been working on llvm in a while, but I can't think of any reason why moving early tail duplication after sinking would cause problems. If it improves benchmarks and it doesn't cause regressions, go for it.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@Trass3r
Copy link
Contributor Author

Trass3r commented Apr 27, 2022

Clang still generates the same code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

5 participants