https://godbolt.org/z/o7zzn7Gc6 bool f32cmp2(float x, float y, float z, float w) { return (x < y) != (z < w); } clang -g0 -O3 -march=btver2 f32cmp2: vucomiss %xmm0, %xmm1 seta %cl vucomiss %xmm2, %xmm3 seta %al xorb %cl, %al retq We can reduce fpu->gpr traffic by using 2 x cmpss instead, performing the xor on the fpu and then just transferring the result: f32cmp2: vcmpltss %xmm1, %xmm0, %xmm0 vcmpltss %xmm3, %xmm2, %xmm2 vxorps %xmm0, %xmm2, %xmm2 vmovd %xmm2, %eax andb $1, %al https://llvm.godbolt.org/z/xKWrPo88f
The ideal IR would insertelement, perform the cmp+logic as vector ops, and then extractelement, so I drafted that as a -vector-combine patch...but -instcombine has a general scalarizer that reduces that back down. We don't want to do ad-hoc vectorization in codegen, but creating the fake vector ops late is the only way I see how to do this.
(In reply to Sanjay Patel from comment #1) > The ideal IR would insertelement, perform the cmp+logic as vector ops, and > then extractelement, so I drafted that as a -vector-combine patch...but > -instcombine has a general scalarizer that reduces that back down. > > We don't want to do ad-hoc vectorization in codegen, but creating the fake > vector ops late is the only way I see how to do this. Especially now that we have costmodel-driven vectorcombine, perhaps we should reevaluate vector scalarization in instcombine?
(In reply to Roman Lebedev from comment #2) > Especially now that we have costmodel-driven vectorcombine, > perhaps we should reevaluate vector scalarization in instcombine? That's likely going to cause regressions. There's a good argument that scalarization is the right direction for early/canonical IR. This is another case (like with the late -simplifycfg options) where we'd like to distinguish between early/late IR. Another option would be to move/add -vector-combine as a really late IR pass, but I think at least for this case, it's a simple enough transform that we just leave it to SDAG combining. X86 already has some folds to turn regular bitwise logic into the SSE equivalent nodes.
Not sure if we want to leave this open to track enhancements, but we get the requested XMM logic op on all of the examples in the godbolt link and description after: https://reviews.llvm.org/rG09e71c367af3 https://reviews.llvm.org/D110342
Thanks for working on this! Please can you ensure we have test coverage for more than 2 fcmps (with TODOs)? Then we can close this ticket and address the remaining TODOs (3+ fcmps, mixed float/double fcmp etc.) in fcmp-logic.ll one by one. e.g. bool f32cmp3(float x, float y, float z, float w) { return ((x > 0) || (y > 0)) != (z < w); } define i1 @f32cmp3(float %0, float %1, float %2, float %3) { %5 = fcmp ogt float %0, 0.000000e+00 %6 = fcmp ogt float %1, 0.000000e+00 %7 = select i1 %5, i1 true, i1 %6 %8 = fcmp olt float %2, %3 %9 = xor i1 %7, %8 ret i1 %9 } f32cmp3: vxorps %xmm4, %xmm4, %xmm4 vcmpltps %xmm1, %xmm4, %xmm1 vcmpltps %xmm0, %xmm4, %xmm0 vucomiss %xmm2, %xmm3 vorps %xmm1, %xmm0, %xmm0 vmovd %xmm0, %ecx seta %al xorb %cl, %al andb $1, %al retq
(In reply to Simon Pilgrim from comment #5) > Please can you ensure we have test coverage for more than 2 fcmps (with > TODOs)? > > Then we can close this ticket and address the remaining TODOs (3+ fcmps, > mixed float/double fcmp etc.) in fcmp-logic.ll one by one. > > e.g. > > bool f32cmp3(float x, float y, float z, float w) { > return ((x > 0) || (y > 0)) != (z < w); > } Added regression test corresponding to that example: https://reviews.llvm.org/rG97948620b1ac