-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[X86][SSE] Reduce FPU->GPR traffic by performing fcmp logic on FPU #50589
Comments
The ideal IR would insertelement, perform the cmp+logic as vector ops, and then extractelement, so I drafted that as a -vector-combine patch...but -instcombine has a general scalarizer that reduces that back down. We don't want to do ad-hoc vectorization in codegen, but creating the fake vector ops late is the only way I see how to do this. |
Especially now that we have costmodel-driven vectorcombine, |
That's likely going to cause regressions. There's a good argument that scalarization is the right direction for early/canonical IR. This is another case (like with the late -simplifycfg options) where we'd like to distinguish between early/late IR. Another option would be to move/add -vector-combine as a really late IR pass, but I think at least for this case, it's a simple enough transform that we just leave it to SDAG combining. X86 already has some folds to turn regular bitwise logic into the SSE equivalent nodes. |
Not sure if we want to leave this open to track enhancements, but we get the requested XMM logic op on all of the examples in the godbolt link and description after: |
Thanks for working on this! Please can you ensure we have test coverage for more than 2 fcmps (with TODOs)? Then we can close this ticket and address the remaining TODOs (3+ fcmps, mixed float/double fcmp etc.) in fcmp-logic.ll one by one. e.g. bool f32cmp3(float x, float y, float z, float w) { define i1 @f32cmp3(float %0, float %1, float %2, float %3) { f32cmp3: |
Added regression test corresponding to that example: |
Extended Description
https://godbolt.org/z/o7zzn7Gc6
bool f32cmp2(float x, float y, float z, float w) {
return (x < y) != (z < w);
}
clang -g0 -O3 -march=btver2
f32cmp2:
vucomiss %xmm0, %xmm1
seta %cl
vucomiss %xmm2, %xmm3
seta %al
xorb %cl, %al
retq
We can reduce fpu->gpr traffic by using 2 x cmpss instead, performing the xor on the fpu and then just transferring the result:
f32cmp2:
vcmpltss %xmm1, %xmm0, %xmm0
vcmpltss %xmm3, %xmm2, %xmm2
vxorps %xmm0, %xmm2, %xmm2
vmovd %xmm2, %eax
andb $1, %al
https://llvm.godbolt.org/z/xKWrPo88f
The text was updated successfully, but these errors were encountered: