[X86][SSE] Reduce FPU->GPR traffic by performing fcmp logic on FPU #50589

RKSimon · 2021-07-28T09:52:29Z


Bugzilla Link	51245
Resolution	FIXED
Resolved on	Sep 30, 2021 08:27
Version	trunk
OS	Windows NT
CC	@topperc,@LebedevRI,@RKSimon,@phoebewang,@rotateright
Fixed by commit(s)	`09e71c3`

Extended Description

https://godbolt.org/z/o7zzn7Gc6

bool f32cmp2(float x, float y, float z, float w) {
return (x < y) != (z < w);
}

clang -g0 -O3 -march=btver2

f32cmp2:
vucomiss %xmm0, %xmm1
seta %cl
vucomiss %xmm2, %xmm3
seta %al
xorb %cl, %al
retq

We can reduce fpu->gpr traffic by using 2 x cmpss instead, performing the xor on the fpu and then just transferring the result:

f32cmp2:
vcmpltss %xmm1, %xmm0, %xmm0
vcmpltss %xmm3, %xmm2, %xmm2
vxorps %xmm0, %xmm2, %xmm2
vmovd %xmm2, %eax
andb $1, %al

https://llvm.godbolt.org/z/xKWrPo88f

The text was updated successfully, but these errors were encountered:

rotateright · 2021-08-13T17:41:51Z

The ideal IR would insertelement, perform the cmp+logic as vector ops, and then extractelement, so I drafted that as a -vector-combine patch...but -instcombine has a general scalarizer that reduces that back down.

We don't want to do ad-hoc vectorization in codegen, but creating the fake vector ops late is the only way I see how to do this.

LebedevRI · 2021-08-13T17:49:03Z

The ideal IR would insertelement, perform the cmp+logic as vector ops, and
then extractelement, so I drafted that as a -vector-combine patch...but
-instcombine has a general scalarizer that reduces that back down.

We don't want to do ad-hoc vectorization in codegen, but creating the fake
vector ops late is the only way I see how to do this.

Especially now that we have costmodel-driven vectorcombine,
perhaps we should reevaluate vector scalarization in instcombine?

rotateright · 2021-08-13T17:59:50Z

Especially now that we have costmodel-driven vectorcombine,
perhaps we should reevaluate vector scalarization in instcombine?

That's likely going to cause regressions. There's a good argument that scalarization is the right direction for early/canonical IR.

This is another case (like with the late -simplifycfg options) where we'd like to distinguish between early/late IR.

Another option would be to move/add -vector-combine as a really late IR pass, but I think at least for this case, it's a simple enough transform that we just leave it to SDAG combining. X86 already has some folds to turn regular bitwise logic into the SSE equivalent nodes.

rotateright · 2021-09-28T17:39:39Z

Not sure if we want to leave this open to track enhancements, but we get the requested XMM logic op on all of the examples in the godbolt link and description after:
https://reviews.llvm.org/rG09e71c367af3
https://reviews.llvm.org/D110342

RKSimon · 2021-09-28T18:58:32Z

Thanks for working on this!

Please can you ensure we have test coverage for more than 2 fcmps (with TODOs)?

Then we can close this ticket and address the remaining TODOs (3+ fcmps, mixed float/double fcmp etc.) in fcmp-logic.ll one by one.

e.g.

bool f32cmp3(float x, float y, float z, float w) {
return ((x > 0) || (y > 0)) != (z < w);
}

define i1 @f32cmp3(float %0, float %1, float %2, float %3) {
%5 = fcmp ogt float %0, 0.000000e+00
%6 = fcmp ogt float %1, 0.000000e+00
%7 = select i1 %5, i1 true, i1 %6
%8 = fcmp olt float %2, %3
%9 = xor i1 %7, %8
ret i1 %9
}

f32cmp3:
vxorps %xmm4, %xmm4, %xmm4
vcmpltps %xmm1, %xmm4, %xmm1
vcmpltps %xmm0, %xmm4, %xmm0
vucomiss %xmm2, %xmm3
vorps %xmm1, %xmm0, %xmm0
vmovd %xmm0, %ecx
seta %al
xorb %cl, %al
andb $1, %al
retq

rotateright · 2021-09-30T15:27:54Z

Please can you ensure we have test coverage for more than 2 fcmps (with
TODOs)?

Then we can close this ticket and address the remaining TODOs (3+ fcmps,
mixed float/double fcmp etc.) in fcmp-logic.ll one by one.

e.g.

bool f32cmp3(float x, float y, float z, float w) {
return ((x > 0) || (y > 0)) != (z < w);
}

Added regression test corresponding to that example:
https://reviews.llvm.org/rG97948620b1ac

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86][SSE] Reduce FPU->GPR traffic by performing fcmp logic on FPU #50589

[X86][SSE] Reduce FPU->GPR traffic by performing fcmp logic on FPU #50589

RKSimon commented Jul 28, 2021

rotateright commented Aug 13, 2021

LebedevRI commented Aug 13, 2021

rotateright commented Aug 13, 2021

rotateright commented Sep 28, 2021

RKSimon commented Sep 28, 2021

rotateright commented Sep 30, 2021

[X86][SSE] Reduce FPU->GPR traffic by performing fcmp logic on FPU #50589

[X86][SSE] Reduce FPU->GPR traffic by performing fcmp logic on FPU #50589

Comments

RKSimon commented Jul 28, 2021

Extended Description

rotateright commented Aug 13, 2021

LebedevRI commented Aug 13, 2021

rotateright commented Aug 13, 2021

rotateright commented Sep 28, 2021

RKSimon commented Sep 28, 2021

rotateright commented Sep 30, 2021