suboptimal codegen for conditional float return? #38882

Trass3r · 2018-11-02T12:10:33Z


Bugzilla Link	39534
Version	trunk
OS	All
CC	@topperc,@RKSimon,@MatzeB,@rotateright

Extended Description

bool isNanOrInf(float x)
{
return (reinterpret_cast<unsigned&>(x) & 0x7F800000) == 0x7F800000;
}

float test(float x)
{
return isNanOrInf(x) ? 0 : x;
}

test(float):
vmovd eax, xmm0
not eax
vpxor xmm1, xmm1, xmm1
and eax, 2139095040
je .LBB1_2
vmovdqa xmm1, xmm0
.LBB1_2:
vmovdqa xmm0, xmm1
ret

In contrast gcc does not use xmm1 at all:

test(float):
vmovd eax, xmm0
mov edx, 2139095040
and edx, eax
cmp edx, 2139095040
je .L5
ret
.L5:
vxorps xmm0, xmm0, xmm0
ret

https://gcc.godbolt.org/z/biNBFq

topperc · 2018-11-02T19:00:45Z

Matthias, do you have any suggestions for how to improve this?

topperc · 2018-11-02T19:06:30Z

Given these two equivalent IR files where I've flipped the condition and select operands, we produce better results with one than the other.

define dso_local float @_Z4testf(float %x) local_unnamed_addr #0 {
entry:
%0 = bitcast float %x to i32
%and.i = and i32 %0, 2139095040
%cmp.i = icmp eq i32 %and.i, 2139095040
%cond = select i1 %cmp.i, float 0.000000e+00, float %x
ret float %cond
}

define dso_local float @_Z4testf2(float %x) local_unnamed_addr #0 {
entry:
%0 = bitcast float %x to i32
%and.i = and i32 %0, 2139095040
%cmp.i = icmp ne i32 %and.i, 2139095040
%cond = select i1 %cmp.i, float %x, float 0.000000e+00
ret float %cond
}

rotateright · 2018-11-02T19:10:37Z

Given these two equivalent IR files where I've flipped the condition and
select operands, we produce better results with one than the other.

define dso_local float @_Z4testf(float %x) local_unnamed_addr #0 {
entry:
%0 = bitcast float %x to i32
%and.i = and i32 %0, 2139095040
%cmp.i = icmp eq i32 %and.i, 2139095040
%cond = select i1 %cmp.i, float 0.000000e+00, float %x
ret float %cond
}

define dso_local float @_Z4testf2(float %x) local_unnamed_addr #0 {
entry:
%0 = bitcast float %x to i32
%and.i = and i32 %0, 2139095040
%cmp.i = icmp ne i32 %and.i, 2139095040
%cond = select i1 %cmp.i, float %x, float 0.000000e+00
ret float %cond
}

For reference, instcombine is missing several potential select canonicalizations, but it does canonicalize this code already.

Unfortunately, it's to the 'eq' version, and that's the one that has the poor codegen.

MatzeB · 2018-11-02T20:16:49Z

I assume with "bad codegen" you specifically mean the "vmovdqa xmm1, xmm0", "vmovdqa xmm0, xmm1" combination. The problem at the MI level here is that the code looks like something like this after the CMOV got replaced with a jump:

bb.0:
  ...
  %4:fr32 = FsFLD0SS
  JE_1 %bb.2

bb.1:
  %0:fr32 = COPY $xmm0

bb.2:
  %5:fr32 = PHI %0:fr32, %bb.1, %4:fr32, %bb.0

Unfortunately bb.0->bb.2 is a critical edge, and we are forced to keep the FsFLD0SS instruction in bb.0. If I force CodeGen to split all critical edges we end up with better code:

$ llc -o - -phi-elim-split-all-critical-edges=1 -no-phi-elim-live-out-early-exit=1 test.ll
...
## %bb.0:
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	movd	%xmm0, %eax
	andl	$2139095040, %eax       ## imm = 0x7F800000
	cmpl	$2139095040, %eax       ## imm = 0x7F800000
	jne	LBB0_2
## %bb.1:
	pxor	%xmm0, %xmm0
LBB0_2:
	popq	%rbp
	retq

I'm not up to date with the discussion of when we split or not split critical edges and it definitely isn't unviversally good to split them. This example however would clearly benefit from splitting, so maybe we can come up with a heuristic to catch this situation...

MatzeB · 2018-11-02T20:29:05Z

Or maybe we should change that CMOV expansion code to not produce critical edges. Given that the choice of edge is arbitrary, I'd rather not introduce any critical edge here to have more consistent codegen in that case...

topperc · 2018-11-02T20:44:38Z

I was just about to ask if we should do that.

topperc · 2018-11-05T20:08:40Z

Patch to avoid creating critical edges: https://reviews.llvm.org/D54119

Unfortunately it doesn't seem to help this case by itself.

MatzeB · 2018-11-05T20:28:31Z

Looks like tailduplication code is making a bad decision because the instructions have been sunk yet... If I move the tailduplicator to run after machine sinking in TargetPassConfig then the example works as expected...

Not sure though what the interaction of taildup with the other passes are and if moving it is a good idea in general.

Short of that maybe we can do an extra round of machine sinking when we lowered any cmovs? (Or implement some simple version of sinking inside the cmov lowering)

MatzeB · 2018-11-05T20:30:45Z

And BTW: For me the example works with your patch regardless. Tail duplication makes the other edge critical so the FsFLD0SS can be sunk for me.

topperc · 2018-11-05T20:37:17Z

I was checking the output from the two test cases from comment 2. With my patch they switch behavior. So I guess it fixes the original case, but regresses some opposite of that case.

MatzeB · 2018-11-05T20:53:37Z

Yeah as I explained in my other comment: Short of the instruction actually sunk into the blocks it's probably random which edge the tail duplicator picks. That probably also explains some of the regressions in the tests you saw.

I CCed Kyle maybe he can make a statement as to the effects of moving early tail duplication later into the pipeline (at least after the pre-RA machine sinking)...

topperc · 2018-12-10T07:14:56Z

Kyle, have you had a chance to look at this?

llvmbot · 2018-12-10T20:06:39Z

I haven't been working on llvm in a while, but I can't think of any reason why moving early tail duplication after sinking would cause problems. If it improves benchmarks and it doesn't cause regressions, go for it.

Trass3r · 2022-04-27T04:09:53Z

Clang still generates the same code.

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suboptimal codegen for conditional float return? #38882

suboptimal codegen for conditional float return? #38882

Trass3r commented Nov 2, 2018

topperc commented Nov 2, 2018

topperc commented Nov 2, 2018

rotateright commented Nov 2, 2018

MatzeB commented Nov 2, 2018

MatzeB commented Nov 2, 2018

topperc commented Nov 2, 2018

topperc commented Nov 5, 2018

MatzeB commented Nov 5, 2018

MatzeB commented Nov 5, 2018

topperc commented Nov 5, 2018

MatzeB commented Nov 5, 2018

topperc commented Dec 10, 2018

llvmbot commented Dec 10, 2018

Trass3r commented Apr 27, 2022

suboptimal codegen for conditional float return? #38882

suboptimal codegen for conditional float return? #38882

Comments

Trass3r commented Nov 2, 2018

Extended Description

topperc commented Nov 2, 2018

topperc commented Nov 2, 2018

rotateright commented Nov 2, 2018

MatzeB commented Nov 2, 2018

MatzeB commented Nov 2, 2018

topperc commented Nov 2, 2018

topperc commented Nov 5, 2018

MatzeB commented Nov 5, 2018

MatzeB commented Nov 5, 2018

topperc commented Nov 5, 2018

MatzeB commented Nov 5, 2018

topperc commented Dec 10, 2018

llvmbot commented Dec 10, 2018

Trass3r commented Apr 27, 2022