Consider #include <complex.h> complex float f(complex float x[]) { complex float p = 1.0; for (int i = 0; i < 1; i++) p += 2*x[i]; return p; } This code is simply doubling one complex float and adding 1. In clang trunk with -O3 -march=core-avx2 you get f: # @f vmovss xmm3, dword ptr [rdi + 4] # xmm3 = mem[0],zero,zero,zero vbroadcastss xmm0, xmm3 vmulps xmm0, xmm0, xmmword ptr [rip + .LCPI0_0] vmovss xmm2, dword ptr [rdi] # xmm2 = mem[0],zero,zero,zero vbroadcastss xmm1, xmm2 vmovss xmm4, dword ptr [rip + .LCPI0_1] # xmm4 = mem[0],zero,zero,zero vmulps xmm1, xmm1, xmm4 vsubps xmm4, xmm1, xmm0 vaddps xmm1, xmm1, xmm0 vblendps xmm0, xmm4, xmm1, 2 # xmm0 = xmm4[0],xmm1[1],xmm4[2,3] vucomiss xmm4, xmm4 jnp .LBB0_3 vmovshdup xmm1, xmm1 # xmm1 = xmm1[1,1,3,3] vucomiss xmm1, xmm1 jp .LBB0_2 .LBB0_3: vmovss xmm1, dword ptr [rip + .LCPI0_2] # xmm1 = mem[0],zero,zero,zero vaddps xmm0, xmm0, xmm1 ret .LBB0_2: push rax vmovss xmm0, dword ptr [rip + .LCPI0_1] # xmm0 = mem[0],zero,zero,zero vxorps xmm1, xmm1, xmm1 call __mulsc3 add rsp, 8 jmp .LBB0_3 Using the Intel Compiler with -O3 -march=core-avx2 -fp-model strict you get: f: vmovsd xmm0, QWORD PTR [rdi] #5.12 vmulps xmm2, xmm0, XMMWORD PTR .L_2il0floatpacket.1[rip] #5.12 vmovsd xmm1, QWORD PTR p.152.0.0.1[rip] #3.19 vaddps xmm0, xmm1, xmm2 #5.5 ret as expected. The -fp-model strict tells the compiler to strictly adhere to value-safe optimizations when implementing floating-point calculations and enables floating-point exception semantics. It also turns off fuse add multiply which might not be relevant here. If you turn on -ffast-math in clang trunk you do get much better although still not ideal code: f: # @f vmovss xmm0, dword ptr [rdi] # xmm0 = mem[0],zero,zero,zero vmovss xmm1, dword ptr [rdi + 4] # xmm1 = mem[0],zero,zero,zero vaddss xmm1, xmm1, xmm1 vmovss xmm2, dword ptr [rip + .LCPI0_0] # xmm2 = mem[0],zero,zero,zero vfmadd213ss xmm2, xmm0, dword ptr [rip + .LCPI0_1] vinsertps xmm0, xmm2, xmm1, 16 # xmm0 = xmm2[0],xmm1[0],xmm2[2,3] ret
A much better example is this: #include <complex.h> complex float f(complex float x) { return 2*x; } clang trunk gives (using the same options) f: # @f vmovaps xmm2, xmm0 vaddps xmm1, xmm2, xmm2 vpermilps xmm0, xmm2, 225 # xmm0 = xmm2[1,0,2,3] vxorps xmm3, xmm3, xmm3 vmulps xmm3, xmm0, xmm3 vsubps xmm0, xmm1, xmm3 vaddps xmm1, xmm1, xmm3 vucomiss xmm0, xmm0 jnp .LBB0_2 vmovshdup xmm3, xmm1 # xmm3 = xmm1[1,1,3,3] vucomiss xmm3, xmm3 jp .LBB0_3 .LBB0_2: vblendps xmm0, xmm0, xmm1, 2 # xmm0 = xmm0[0],xmm1[1],xmm0[2,3] ret .LBB0_3: vmovshdup xmm3, xmm2 # xmm3 = xmm2[1,1,3,3] vmovss xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero vxorps xmm1, xmm1, xmm1 jmp __mulsc3 # TAILCALL Much better would be: f: vmulps xmm0, xmm0, XMMWORD PTR .L_2il0floatpacket.1[rip] #3.12 ret
A final comment. Adding -ffast-math to clang does give f: # @f vaddps xmm0, xmm0, xmm0 ret as expected. As far as I can tell, we *shouldn't* need -ffast-math to get this optimisation.
Apparently we promote 2 as an int to a complex and perform a cross product in the expression `2*c`, however if the source is written with `2.f * c` we don't perform any cross product. The optimizer can't optimize the cross-product as efficiently without fast-math because it has to account for NaN. Tim found that is may be this refactoring commit that would be the culprit to incorrectly promote 2 as a complex in the multiplication: https://github.com/llvm-mirror/clang/commit/8289f49c0c0
That's only half the problem. 1) Consider: #include <complex.h> complex float f(complex float x) { return (2.f + 0 * I) * x; } ICC is fine with this and generates a single mul - I guess it figures it can disregard nans. GCC falls down just like clang, and calls _mulsc3. 2) When building the original code (with 2, not 2.f) with -ffinite-math-only we get: f: vmovshdup %xmm0, %xmm1 vaddss %xmm1, %xmm1, %xmm2 vxorps %xmm3, %xmm3, %xmm3 vmulss %xmm3, %xmm0, %xmm4 vaddss %xmm4, %xmm2, %xmm2 vaddss %xmm0, %xmm0, %xmm0 vmulss %xmm3, %xmm1, %xmm1 vsubss %xmm1, %xmm0, %xmm0 vinsertps $16, %xmm2, %xmm0, %xmm0 retq The IR looks like: define <2 x float> @f(floatcomplex )(<2 x float>) local_unnamed_addr #0 { %2 = extractelement <2 x float> %0, i32 0 %3 = extractelement <2 x float> %0, i32 1 %4 = fmul nnan ninf float %3, 2.000000e+00 %5 = fmul nnan ninf float %2, 0.000000e+00 %6 = fadd nnan ninf float %4, %5 %7 = fmul nnan ninf float %2, 2.000000e+00 %8 = fmul nnan ninf float %3, 0.000000e+00 %9 = fsub nnan ninf float %7, %8 %10 = insertelement <2 x float> undef, float %9, i32 0 %11 = insertelement <2 x float> %10, float %6, i32 1 ret <2 x float> %11 } Shouldn't we be able to optimize away "fmul nnan ninf float %2, 0.000000e+00"? Or do we really need fast for this?
"No NaNs - Allow optimizations to assume the arguments and result are not NaN. Such optimizations are required to retain defined behavior over NaNs, but the value of the result is undefined." So I'd say clearly yes :) The fast-math flags are likely not used individually everywhere in the optimizer (i.e. transforms will check if we have "fast").
What I meant was - is there a reason *except* nans and infs not to optimize this out? But I guess not, and you're right, we're just not testing the flags individually because we never bothered.
Just to add chapter & verse from C99... 6.3.1.8 (the usual arithmetic conversions) says: "Otherwise, if the corresponding real type of either operand is float, the other operand is converted, without change of type domain, to a type whose corresponding real type is float.[51] [...] [51] For example, addition of a double _Complex and a float entails just the conversion of the float operand to double (and yields a double _Complex result)." The "type domain" referred to is precisely the complex/real division. So that's pretty clear that the int shouldn't be converted but still doesn't actually say how you multiply a real and a complex. For that, the only reference seems to be the (informative rather than normative) Appendix G. G.5.1 says: "If the operands are not both complex, then the result and floating-point exception behavior of the * operator is defined by the usual mathematical formula: [...] x(u + iv) = (xu) + i(xv)" which skips the cross-wise terms entirely.
(In reply to comment #6) > Shouldn't we be able to optimize away "fmul nnan ninf float %2, 0.000000e+00"? > What I meant was - is there a reason *except* nans and infs not to optimize > this out? But I guess not, and you're right, we're just not testing the > flags individually because we never bothered. Yes, there is a reason: -0.0. We didn't say that it's ok to ignore signed zero (nsz), so -0.0 * 0.0 is not the same as 0.0 * 0.0. In general, I think you're right: we don't test the individual flags as much as we should. But in this case, SimplifyFMulInst already does the right thing: // fmul nnan nsz X, 0 ==> 0 if (FMF.noNaNs() && FMF.noSignedZeros() && match(Op1, m_AnyZero())) return Op1;