Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86][SSE] Reduce FPU->GPR traffic by performing fcmp logic on FPU #50589

Closed
RKSimon opened this issue Jul 28, 2021 · 6 comments
Closed

[X86][SSE] Reduce FPU->GPR traffic by performing fcmp logic on FPU #50589

RKSimon opened this issue Jul 28, 2021 · 6 comments
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@RKSimon
Copy link
Collaborator

RKSimon commented Jul 28, 2021

Bugzilla Link 51245
Resolution FIXED
Resolved on Sep 30, 2021 08:27
Version trunk
OS Windows NT
CC @topperc,@LebedevRI,@RKSimon,@phoebewang,@rotateright
Fixed by commit(s) 09e71c3

Extended Description

https://godbolt.org/z/o7zzn7Gc6

bool f32cmp2(float x, float y, float z, float w) {
return (x < y) != (z < w);
}

clang -g0 -O3 -march=btver2

f32cmp2:
vucomiss %xmm0, %xmm1
seta %cl
vucomiss %xmm2, %xmm3
seta %al
xorb %cl, %al
retq

We can reduce fpu->gpr traffic by using 2 x cmpss instead, performing the xor on the fpu and then just transferring the result:

f32cmp2:
vcmpltss %xmm1, %xmm0, %xmm0
vcmpltss %xmm3, %xmm2, %xmm2
vxorps %xmm0, %xmm2, %xmm2
vmovd %xmm2, %eax
andb $1, %al

https://llvm.godbolt.org/z/xKWrPo88f

@rotateright
Copy link
Contributor

The ideal IR would insertelement, perform the cmp+logic as vector ops, and then extractelement, so I drafted that as a -vector-combine patch...but -instcombine has a general scalarizer that reduces that back down.

We don't want to do ad-hoc vectorization in codegen, but creating the fake vector ops late is the only way I see how to do this.

@LebedevRI
Copy link
Member

The ideal IR would insertelement, perform the cmp+logic as vector ops, and
then extractelement, so I drafted that as a -vector-combine patch...but
-instcombine has a general scalarizer that reduces that back down.

We don't want to do ad-hoc vectorization in codegen, but creating the fake
vector ops late is the only way I see how to do this.

Especially now that we have costmodel-driven vectorcombine,
perhaps we should reevaluate vector scalarization in instcombine?

@rotateright
Copy link
Contributor

Especially now that we have costmodel-driven vectorcombine,
perhaps we should reevaluate vector scalarization in instcombine?

That's likely going to cause regressions. There's a good argument that scalarization is the right direction for early/canonical IR.

This is another case (like with the late -simplifycfg options) where we'd like to distinguish between early/late IR.

Another option would be to move/add -vector-combine as a really late IR pass, but I think at least for this case, it's a simple enough transform that we just leave it to SDAG combining. X86 already has some folds to turn regular bitwise logic into the SSE equivalent nodes.

@rotateright
Copy link
Contributor

Not sure if we want to leave this open to track enhancements, but we get the requested XMM logic op on all of the examples in the godbolt link and description after:
https://reviews.llvm.org/rG09e71c367af3
https://reviews.llvm.org/D110342

@RKSimon
Copy link
Collaborator Author

RKSimon commented Sep 28, 2021

Thanks for working on this!

Please can you ensure we have test coverage for more than 2 fcmps (with TODOs)?

Then we can close this ticket and address the remaining TODOs (3+ fcmps, mixed float/double fcmp etc.) in fcmp-logic.ll one by one.

e.g.

bool f32cmp3(float x, float y, float z, float w) {
return ((x > 0) || (y > 0)) != (z < w);
}

define i1 @​f32cmp3(float %0, float %1, float %2, float %3) {
%5 = fcmp ogt float %0, 0.000000e+00
%6 = fcmp ogt float %1, 0.000000e+00
%7 = select i1 %5, i1 true, i1 %6
%8 = fcmp olt float %2, %3
%9 = xor i1 %7, %8
ret i1 %9
}

f32cmp3:
vxorps %xmm4, %xmm4, %xmm4
vcmpltps %xmm1, %xmm4, %xmm1
vcmpltps %xmm0, %xmm4, %xmm0
vucomiss %xmm2, %xmm3
vorps %xmm1, %xmm0, %xmm0
vmovd %xmm0, %ecx
seta %al
xorb %cl, %al
andb $1, %al
retq

@rotateright
Copy link
Contributor

Please can you ensure we have test coverage for more than 2 fcmps (with
TODOs)?

Then we can close this ticket and address the remaining TODOs (3+ fcmps,
mixed float/double fcmp etc.) in fcmp-logic.ll one by one.

e.g.

bool f32cmp3(float x, float y, float z, float w) {
return ((x > 0) || (y > 0)) != (z < w);
}

Added regression test corresponding to that example:
https://reviews.llvm.org/rG97948620b1ac

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

3 participants