39534 – suboptimal codegen for conditional float return?

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 39534 - suboptimal codegen for conditional float return?

Summary: suboptimal codegen for conditional float return?

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Backend: X86 (show other bugs)
Version:	trunk
Hardware:	PC All

Importance:	P normal
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-11-02 05:10 PDT by Trass3r
Modified:	2018-12-10 12:06 PST (History)
CC List:	6 users (show)

See Also:
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Trass3r 2018-11-02 05:10:33 PDT

bool isNanOrInf(float x)
{
	return (reinterpret_cast<unsigned&>(x) & 0x7F800000) == 0x7F800000;
}

float test(float x)
{
	return isNanOrInf(x) ? 0 : x;
}

test(float):
  vmovd eax, xmm0
  not eax
  vpxor xmm1, xmm1, xmm1
  and eax, 2139095040
  je .LBB1_2
  vmovdqa xmm1, xmm0
.LBB1_2:
  vmovdqa xmm0, xmm1
  ret

In contrast gcc does not use xmm1 at all:

test(float):
  vmovd eax, xmm0
  mov edx, 2139095040
  and edx, eax
  cmp edx, 2139095040
  je .L5
  ret
.L5:
  vxorps xmm0, xmm0, xmm0
  ret

https://gcc.godbolt.org/z/biNBFq

Comment 1 Craig Topper 2018-11-02 12:00:45 PDT

Matthias, do you have any suggestions for how to improve this?

Comment 2 Craig Topper 2018-11-02 12:06:30 PDT

Given these two equivalent IR files where I've flipped the condition and select operands, we produce better results with one than the other.

define dso_local float @_Z4testf(float %x) local_unnamed_addr #0 {
entry:
  %0 = bitcast float %x to i32
  %and.i = and i32 %0, 2139095040
  %cmp.i = icmp eq i32 %and.i, 2139095040
  %cond = select i1 %cmp.i, float 0.000000e+00, float %x
  ret float %cond
}

define dso_local float @_Z4testf2(float %x) local_unnamed_addr #0 {
entry:
  %0 = bitcast float %x to i32
  %and.i = and i32 %0, 2139095040
  %cmp.i = icmp ne i32 %and.i, 2139095040
  %cond = select i1 %cmp.i, float %x, float 0.000000e+00
  ret float %cond
}

Comment 3 Sanjay Patel 2018-11-02 12:10:37 PDT

(In reply to Craig Topper from comment #2)
> Given these two equivalent IR files where I've flipped the condition and
> select operands, we produce better results with one than the other.
> 
> define dso_local float @_Z4testf(float %x) local_unnamed_addr #0 {
> entry:
>   %0 = bitcast float %x to i32
>   %and.i = and i32 %0, 2139095040
>   %cmp.i = icmp eq i32 %and.i, 2139095040
>   %cond = select i1 %cmp.i, float 0.000000e+00, float %x
>   ret float %cond
> }
> 
> define dso_local float @_Z4testf2(float %x) local_unnamed_addr #0 {
> entry:
>   %0 = bitcast float %x to i32
>   %and.i = and i32 %0, 2139095040
>   %cmp.i = icmp ne i32 %and.i, 2139095040
>   %cond = select i1 %cmp.i, float %x, float 0.000000e+00
>   ret float %cond
> }

For reference, instcombine is missing several potential select canonicalizations, but it does canonicalize this code already.

Unfortunately, it's to the 'eq' version, and that's the one that has the poor codegen.

Comment 4 Matthias Braun 2018-11-02 13:16:49 PDT

I assume with "bad codegen" you specifically mean the "vmovdqa xmm1, xmm0", "vmovdqa xmm0, xmm1" combination. The problem at the MI level here is that the code looks like something like this after the CMOV got replaced with a jump:

```
bb.0:
  ...
  %4:fr32 = FsFLD0SS
  JE_1 %bb.2

bb.1:
  %0:fr32 = COPY $xmm0

bb.2:
  %5:fr32 = PHI %0:fr32, %bb.1, %4:fr32, %bb.0
```

Unfortunately bb.0->bb.2 is a critical edge, and we are forced to keep the FsFLD0SS instruction in bb.0. If I force CodeGen to split all critical edges we end up with better code:

```
$ llc -o - -phi-elim-split-all-critical-edges=1 -no-phi-elim-live-out-early-exit=1 test.ll
...
## %bb.0:
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	movd	%xmm0, %eax
	andl	$2139095040, %eax       ## imm = 0x7F800000
	cmpl	$2139095040, %eax       ## imm = 0x7F800000
	jne	LBB0_2
## %bb.1:
	pxor	%xmm0, %xmm0
LBB0_2:
	popq	%rbp
	retq
```

I'm not up to date with the discussion of when we split or not split critical edges and it definitely isn't unviversally good to split them. This example however would clearly benefit from splitting, so maybe we can come up with a heuristic to catch this situation...

Comment 5 Matthias Braun 2018-11-02 13:29:05 PDT

Or maybe we should change that CMOV expansion code to not produce critical edges. Given that the choice of edge is arbitrary, I'd rather not introduce any critical edge here to have more consistent codegen in that case...

Comment 6 Craig Topper 2018-11-02 13:44:38 PDT

I was just about to ask if we should do that.

Comment 7 Craig Topper 2018-11-05 12:08:40 PST

Patch to avoid creating critical edges: https://reviews.llvm.org/D54119

Unfortunately it doesn't seem to help this case by itself.

Comment 8 Matthias Braun 2018-11-05 12:28:31 PST

Looks like tailduplication code is making a bad decision because the instructions have been sunk yet... If I move the tailduplicator to run after machine sinking in TargetPassConfig then the example works as expected...

Not sure though what the interaction of taildup with the other passes are and if moving it is a good idea in general.

Short of that maybe we can do an extra round of machine sinking when we lowered any cmovs? (Or implement some simple version of sinking inside the cmov lowering)

Comment 9 Matthias Braun 2018-11-05 12:30:45 PST

And BTW: For me the example works with your patch regardless. Tail duplication makes the other edge critical so the `FsFLD0SS` can be sunk for me.

Comment 10 Craig Topper 2018-11-05 12:37:17 PST

I was checking the output from the two test cases from comment 2. With my patch they switch behavior. So I guess it fixes the original case, but regresses some opposite of that case.

Comment 11 Matthias Braun 2018-11-05 12:53:37 PST

Yeah as I explained in my other comment: Short of the instruction actually sunk into the blocks it's probably random which edge the tail duplicator picks. That probably also explains some of the regressions in the tests you saw.

I CCed Kyle maybe he can make a statement as to the effects of moving early tail duplication later into the pipeline (at least after the pre-RA machine sinking)...

Comment 12 Craig Topper 2018-12-09 23:14:56 PST

Kyle, have you had a chance to look at this?

Comment 13 Kyle Butt 2018-12-10 12:06:39 PST

I haven't been working on llvm in a while, but I can't think of any reason why moving early tail duplication after sinking would cause problems. If it improves benchmarks and it doesn't cause regressions, go for it.