51245 – [X86][SSE] Reduce FPU->GPR traffic by performing fcmp logic on FPU

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 51245 - [X86][SSE] Reduce FPU->GPR traffic by performing fcmp logic on FPU

Summary: [X86][SSE] Reduce FPU->GPR traffic by performing fcmp logic on FPU

Status:	RESOLVED FIXED

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Backend: X86 (show other bugs)
Version:	trunk
Hardware:	PC Windows NT

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-07-28 02:52 PDT by Simon Pilgrim
Modified:	2021-09-30 08:27 PDT (History)
CC List:	6 users (show)

See Also:	34563
Fixed By Commit(s):	09e71c367af3

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Simon Pilgrim 2021-07-28 02:52:29 PDT

https://godbolt.org/z/o7zzn7Gc6

bool f32cmp2(float x, float y, float z, float w) {
    return (x < y) != (z < w);
}

clang -g0 -O3 -march=btver2 

f32cmp2:
  vucomiss %xmm0, %xmm1
  seta %cl
  vucomiss %xmm2, %xmm3
  seta %al
  xorb %cl, %al
  retq

We can reduce fpu->gpr traffic by using 2 x cmpss instead, performing the xor on the fpu and then just transferring the result:

f32cmp2:
  vcmpltss %xmm1, %xmm0, %xmm0
  vcmpltss %xmm3, %xmm2, %xmm2
  vxorps %xmm0, %xmm2, %xmm2
  vmovd %xmm2, %eax
  andb $1, %al

https://llvm.godbolt.org/z/xKWrPo88f

Comment 1 Sanjay Patel 2021-08-13 10:41:51 PDT

The ideal IR would insertelement, perform the cmp+logic as vector ops, and then extractelement, so I drafted that as a -vector-combine patch...but -instcombine has a general scalarizer that reduces that back down.

We don't want to do ad-hoc vectorization in codegen, but creating the fake vector ops late is the only way I see how to do this.

Comment 2 Roman Lebedev 2021-08-13 10:49:03 PDT

(In reply to Sanjay Patel from comment #1)
> The ideal IR would insertelement, perform the cmp+logic as vector ops, and
> then extractelement, so I drafted that as a -vector-combine patch...but
> -instcombine has a general scalarizer that reduces that back down.
> 
> We don't want to do ad-hoc vectorization in codegen, but creating the fake
> vector ops late is the only way I see how to do this.

Especially now that we have costmodel-driven vectorcombine,
perhaps we should reevaluate vector scalarization in instcombine?

Comment 3 Sanjay Patel 2021-08-13 10:59:50 PDT

(In reply to Roman Lebedev from comment #2)
> Especially now that we have costmodel-driven vectorcombine,
> perhaps we should reevaluate vector scalarization in instcombine?

That's likely going to cause regressions. There's a good argument that scalarization is the right direction for early/canonical IR. 

This is another case (like with the late -simplifycfg options) where we'd like to distinguish between early/late IR.

Another option would be to move/add -vector-combine as a really late IR pass, but I think at least for this case, it's a simple enough transform that we just leave it to SDAG combining. X86 already has some folds to turn regular bitwise logic into the SSE equivalent nodes.

Comment 4 Sanjay Patel 2021-09-28 10:39:39 PDT

Not sure if we want to leave this open to track enhancements, but we get the requested XMM logic op on all of the examples in the godbolt link and description after:
https://reviews.llvm.org/rG09e71c367af3
https://reviews.llvm.org/D110342

Comment 5 Simon Pilgrim 2021-09-28 11:58:32 PDT

Thanks for working on this! 

Please can you ensure we have test coverage for more than 2 fcmps (with TODOs)? 

Then we can close this ticket and address the remaining TODOs (3+ fcmps, mixed float/double fcmp etc.) in fcmp-logic.ll one by one.

e.g.

bool f32cmp3(float x, float y, float z, float w) {
    return ((x > 0) || (y > 0)) != (z < w);
}


define i1 @f32cmp3(float %0, float %1, float %2, float %3) {
  %5 = fcmp ogt float %0, 0.000000e+00
  %6 = fcmp ogt float %1, 0.000000e+00
  %7 = select i1 %5, i1 true, i1 %6
  %8 = fcmp olt float %2, %3
  %9 = xor i1 %7, %8
  ret i1 %9
}

f32cmp3:
  vxorps %xmm4, %xmm4, %xmm4
  vcmpltps %xmm1, %xmm4, %xmm1
  vcmpltps %xmm0, %xmm4, %xmm0
  vucomiss %xmm2, %xmm3
  vorps %xmm1, %xmm0, %xmm0
  vmovd %xmm0, %ecx
  seta %al
  xorb %cl, %al
  andb $1, %al
  retq

Comment 6 Sanjay Patel 2021-09-30 08:27:54 PDT

(In reply to Simon Pilgrim from comment #5)
> Please can you ensure we have test coverage for more than 2 fcmps (with
> TODOs)? 
> 
> Then we can close this ticket and address the remaining TODOs (3+ fcmps,
> mixed float/double fcmp etc.) in fcmp-logic.ll one by one.
> 
> e.g.
> 
> bool f32cmp3(float x, float y, float z, float w) {
>     return ((x > 0) || (y > 0)) != (z < w);
> }

Added regression test corresponding to that example:
https://reviews.llvm.org/rG97948620b1ac