31857 – Doubling a single complex float produces unnecessarily inefficient code without -ffast-math

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 31857 - Doubling a single complex float produces unnecessarily inefficient code without -ffast-math

Summary: Doubling a single complex float produces unnecessarily inefficient code witho...

Status:	NEW

Alias:	None

Product:	new-bugs
Classification:	Unclassified
Component:	new bugs (show other bugs)
Version:	trunk
Hardware:	PC Linux

Importance:	P normal
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-02-03 07:31 PST by drraph
Modified:	2019-10-10 10:03 PDT (History)
CC List:	8 users (show)

See Also:	31866
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description drraph 2017-02-03 07:31:09 PST

Consider

#include <complex.h>
complex float f(complex float x[]) {
  complex float p = 1.0;
  for (int i = 0; i < 1; i++)
    p += 2*x[i];
  return p;
}

This code is simply doubling one complex float and adding 1. 

In clang trunk with -O3  -march=core-avx2 you get

f:                                      # @f
        vmovss  xmm3, dword ptr [rdi + 4] # xmm3 = mem[0],zero,zero,zero
        vbroadcastss    xmm0, xmm3
        vmulps  xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        vmovss  xmm2, dword ptr [rdi]   # xmm2 = mem[0],zero,zero,zero
        vbroadcastss    xmm1, xmm2
        vmovss  xmm4, dword ptr [rip + .LCPI0_1] # xmm4 = mem[0],zero,zero,zero
        vmulps  xmm1, xmm1, xmm4
        vsubps  xmm4, xmm1, xmm0
        vaddps  xmm1, xmm1, xmm0
        vblendps        xmm0, xmm4, xmm1, 2 # xmm0 = xmm4[0],xmm1[1],xmm4[2,3]
        vucomiss        xmm4, xmm4
        jnp     .LBB0_3
        vmovshdup       xmm1, xmm1      # xmm1 = xmm1[1,1,3,3]
        vucomiss        xmm1, xmm1
        jp      .LBB0_2
.LBB0_3:
        vmovss  xmm1, dword ptr [rip + .LCPI0_2] # xmm1 = mem[0],zero,zero,zero
        vaddps  xmm0, xmm0, xmm1
        ret
.LBB0_2:
        push    rax
        vmovss  xmm0, dword ptr [rip + .LCPI0_1] # xmm0 = mem[0],zero,zero,zero
        vxorps  xmm1, xmm1, xmm1
        call    __mulsc3
        add     rsp, 8
        jmp     .LBB0_3



Using the Intel Compiler with -O3  -march=core-avx2  -fp-model strict you get:

f:
        vmovsd    xmm0, QWORD PTR [rdi]                         #5.12
        vmulps    xmm2, xmm0, XMMWORD PTR .L_2il0floatpacket.1[rip] #5.12
        vmovsd    xmm1, QWORD PTR p.152.0.0.1[rip]              #3.19
        vaddps    xmm0, xmm1, xmm2                              #5.5
        ret    

as expected.

The -fp-model strict tells the compiler to strictly adhere to value-safe optimizations when implementing floating-point calculations and enables floating-point exception semantics. It also turns off fuse add multiply which might not be relevant here.

If you turn on -ffast-math in clang trunk you do get much better although still not ideal code:

f:                                      # @f
        vmovss  xmm0, dword ptr [rdi]   # xmm0 = mem[0],zero,zero,zero
        vmovss  xmm1, dword ptr [rdi + 4] # xmm1 = mem[0],zero,zero,zero
        vaddss  xmm1, xmm1, xmm1
        vmovss  xmm2, dword ptr [rip + .LCPI0_0] # xmm2 = mem[0],zero,zero,zero
        vfmadd213ss     xmm2, xmm0, dword ptr [rip + .LCPI0_1]
        vinsertps       xmm0, xmm2, xmm1, 16 # xmm0 = xmm2[0],xmm1[0],xmm2[2,3]
        ret

Comment 1 drraph 2017-02-03 07:46:36 PST

A much better example is this:


#include <complex.h>
complex float f(complex float x) {
  return 2*x;
}

clang trunk gives (using the same options)

f:                                      # @f
        vmovaps xmm2, xmm0
        vaddps  xmm1, xmm2, xmm2
        vpermilps       xmm0, xmm2, 225 # xmm0 = xmm2[1,0,2,3]
        vxorps  xmm3, xmm3, xmm3
        vmulps  xmm3, xmm0, xmm3
        vsubps  xmm0, xmm1, xmm3
        vaddps  xmm1, xmm1, xmm3
        vucomiss        xmm0, xmm0
        jnp     .LBB0_2
        vmovshdup       xmm3, xmm1      # xmm3 = xmm1[1,1,3,3]
        vucomiss        xmm3, xmm3
        jp      .LBB0_3
.LBB0_2:
        vblendps        xmm0, xmm0, xmm1, 2 # xmm0 = xmm0[0],xmm1[1],xmm0[2,3]
        ret
.LBB0_3:
        vmovshdup       xmm3, xmm2      # xmm3 = xmm2[1,1,3,3]
        vmovss  xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
        vxorps  xmm1, xmm1, xmm1
        jmp     __mulsc3                # TAILCALL

Much better would be:

f:
        vmulps    xmm0, xmm0, XMMWORD PTR .L_2il0floatpacket.1[rip] #3.12
        ret

Comment 2 drraph 2017-02-03 08:18:26 PST

A final comment. Adding -ffast-math to clang does give

f:                                      # @f
        vaddps  xmm0, xmm0, xmm0
        ret

as expected.

As far as I can tell, we *shouldn't* need -ffast-math to get this optimisation.

Comment 3 Mehdi Amini 2017-02-03 16:23:04 PST

Apparently we promote 2 as an int to a complex and perform a cross product in the expression `2*c`, however if the source is written with `2.f * c` we don't perform any cross product.
The optimizer can't optimize the cross-product as efficiently without fast-math because it has to account for NaN.

Tim found that is may be this refactoring commit that would be the culprit to incorrectly promote 2 as a complex in the multiplication:

 https://github.com/llvm-mirror/clang/commit/8289f49c0c0

Comment 4 Michael Kuperstein 2017-02-03 16:40:33 PST

That's only half the problem. 

1) 
Consider:

#include <complex.h>
complex float f(complex float x) {
  return (2.f + 0 * I) * x;
}

ICC is fine with this and generates a single mul - I guess it figures it can disregard nans.
GCC falls down just like clang, and calls _mulsc3.

2) When building the original code (with 2, not 2.f) with -ffinite-math-only we get:

f:
        vmovshdup       %xmm0, %xmm1
        vaddss  %xmm1, %xmm1, %xmm2
        vxorps  %xmm3, %xmm3, %xmm3
        vmulss  %xmm3, %xmm0, %xmm4
        vaddss  %xmm4, %xmm2, %xmm2
        vaddss  %xmm0, %xmm0, %xmm0
        vmulss  %xmm3, %xmm1, %xmm1
        vsubss  %xmm1, %xmm0, %xmm0
        vinsertps       $16, %xmm2, %xmm0, %xmm0
        retq

The IR looks like:
define <2 x float> @f(floatcomplex )(<2 x float>) local_unnamed_addr #0 {
  %2 = extractelement <2 x float> %0, i32 0
  %3 = extractelement <2 x float> %0, i32 1
  %4 = fmul nnan ninf float %3, 2.000000e+00
  %5 = fmul nnan ninf float %2, 0.000000e+00
  %6 = fadd nnan ninf float %4, %5
  %7 = fmul nnan ninf float %2, 2.000000e+00
  %8 = fmul nnan ninf float %3, 0.000000e+00
  %9 = fsub nnan ninf float %7, %8
  %10 = insertelement <2 x float> undef, float %9, i32 0
  %11 = insertelement <2 x float> %10, float %6, i32 1
  ret <2 x float> %11
}

Shouldn't we be able to optimize away "fmul nnan ninf float %2, 0.000000e+00"? Or do we really need fast for this?

Comment 5 Mehdi Amini 2017-02-03 16:42:55 PST

"No NaNs - Allow optimizations to assume the arguments and result are not NaN. Such optimizations are required to retain defined behavior over NaNs, but the value of the result is undefined."

So I'd say clearly yes :)

The fast-math flags are likely not used individually everywhere in the optimizer (i.e. transforms will check if we have "fast").

Comment 6 Michael Kuperstein 2017-02-03 16:44:29 PST

What I meant was - is there a reason *except* nans and infs not to optimize this out? But I guess not, and you're right, we're just not testing the flags individually because we never bothered.

Comment 7 Tim Northover 2017-02-03 16:46:35 PST

Just to add chapter & verse from C99...

6.3.1.8 (the usual arithmetic conversions) says:

"Otherwise, if the corresponding real type of either operand is float, the other operand is converted, without change of type domain, to a type whose corresponding real type is float.[51]
[...]
[51] For example, addition of a double _Complex and a float entails just the conversion of the float operand to double (and yields a double _Complex result)."

The "type domain" referred to is precisely the complex/real division. So that's pretty clear that the int shouldn't be converted but still doesn't actually say how you multiply a real and a complex.

For that, the only reference seems to be the (informative rather than normative) Appendix G. G.5.1 says:

"If the operands are not both complex, then the result and floating-point exception behavior of the * operator is defined by the usual mathematical formula:
[...] x(u + iv) = (xu) + i(xv)"

which skips the cross-wise terms entirely.

Comment 8 Sanjay Patel 2017-02-05 16:50:13 PST

(In reply to comment #6)
> Shouldn't we be able to optimize away "fmul nnan ninf float %2, 0.000000e+00"? 
> What I meant was - is there a reason *except* nans and infs not to optimize
> this out? But I guess not, and you're right, we're just not testing the
> flags individually because we never bothered.

Yes, there is a reason: -0.0. We didn't say that it's ok to ignore signed zero (nsz), so -0.0 * 0.0 is not the same as 0.0 * 0.0.

In general, I think you're right: we don't test the individual flags as much as we should. But in this case, SimplifyFMulInst already does the right thing:
 // fmul nnan nsz X, 0 ==> 0
 if (FMF.noNaNs() && FMF.noSignedZeros() && match(Op1, m_AnyZero()))
   return Op1;