36617 – Missing float truncation rounding patterns

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 36617 - Missing float truncation rounding patterns

Summary: Missing float truncation rounding patterns

Status:	RESOLVED FIXED

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Common Code Generator Code (show other bugs)
Version:	trunk
Hardware:	PC Windows NT

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-03-06 07:16 PST by Simon Pilgrim
Modified:	2020-05-24 07:01 PDT (History)
CC List:	7 users (show)

See Also:	37751 39974
Fixed By Commit(s):	d04db4825a4d, 7f4ff782d406, cceb630a07cc

Attachments
testloop-subnormal-cvt.asm (2.74 KB, text/plain) 2020-04-01 09:13 PDT, Peter Cordes	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Simon Pilgrim 2018-03-06 07:16:53 PST

Rounding floats/doubles patterns could be converted to truncation rounding instructions (roundss on x86/sse etc.). AFAICT these don't need to be -ffast-math only.

float rnd(float x) {
  return (float)((int)x);
}
__v4sf rnd(__v4sf x) {
    return __builtin_convertvector(__builtin_convertvector(x, __v4si), __v4sf);
}

define dso_local float @_Z3rndf {
  %2 = fptosi float %0 to i32
  %3 = sitofp i32 %2 to float
  ret float %3
}
define dso_local <4 x float> @_Z3rndDv4_f {
  %2 = fptosi <4 x float> %0 to <4 x i32>
  %3 = sitofp <4 x i32> %2 to <4 x float>
  ret <4 x float> %3
}

_Z3rndf:
  cvttss2si %xmm0, %eax
  xorps %xmm0, %xmm0
  cvtsi2ssl %eax, %xmm0
  retq
_Z3rndDv4_f:
  cvttps2dq %xmm0, %xmm0
  cvtdq2ps %xmm0, %xmm0
  retq

Comment 1 Simon Pilgrim 2018-03-06 07:18:18 PST

Truncating to smaller float types should be investigated as well:

float rnd32(double x) {
  return (float)((int)x);
}

define float @_Z5rnd32d(double) {
  %2 = fptosi double %0 to i32
  %3 = sitofp i32 %2 to float
  ret float %3
}

_Z3rndDv4_f:
  cvttps2dq %xmm0, %xmm0
  cvtdq2ps %xmm0, %xmm0
  retq

Comment 2 Sanjay Patel 2018-03-26 14:41:40 PDT

Proposal for round-trip casting (to the same size):
https://reviews.llvm.org/D44909

Comment 3 Sanjay Patel 2018-03-31 11:06:21 PDT

The examples in the description should be fixed with:
https://reviews.llvm.org/rL328921

Comment 4 Sanjay Patel 2018-04-12 08:30:50 PDT

(In reply to Sanjay Patel from comment #3)
> The examples in the description should be fixed with:
> https://reviews.llvm.org/rL328921

Reverted because we broke Chrome:
https://reviews.llvm.org/rL329920

Comment 5 Sanjay Patel 2018-04-20 08:30:19 PDT

Trying the initial case again with more warnings about the potential danger:
https://reviews.llvm.org/rL328921

Comment 6 Sanjay Patel 2018-04-20 08:31:01 PDT

(In reply to Sanjay Patel from comment #5)
> Trying the initial case again with more warnings about the potential danger:
> https://reviews.llvm.org/rL328921

Oops - new link:
https://reviews.llvm.org/rL330437

Comment 7 Sanjay Patel 2018-04-25 15:18:51 PDT

We'll have to hide this behind a flag and default it off to start:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20180423/545906.html

Comment 8 Sanjay Patel 2018-04-30 12:11:47 PDT

I was wrong about having to default this to 'off' (at least for now), but we do have flags to enable/disable the transform:
https://reviews.llvm.org/D46236
https://reviews.llvm.org/rL331209

@scanon suggested that we add a platform-independent clang builtin that would clamp using either a fixed method (probably the one used by PPC/ARM because they're more useful than x86?) or using the target's method. 

I haven't looked at the details yet, but that sounds good. Ie, it's safe to assume that we're going to get more complaints about breaking code with this optimization. 

Also for the record, we are not doing this optimization because we think it's fun to break people's programs using a UB loophole. We added this because it can be a big perf improvement (which is why the operation is supported in HW on multiple platforms).

Comment 9 Sanjay Patel 2018-06-12 08:53:23 PDT

We missed an FMF constraint (latest comments here):
https://reviews.llvm.org/D44909

Miscompiling the -0.0 case...

The good news is that I expect less chance of the more common out-of-range pitfall once we limit this with FMF (I'm assuming most people don't compile with loosened FP).

Comment 10 Sanjay Patel 2018-06-12 09:21:51 PDT

This exposes a hole in IR FMF. We don't currently parse FMF on the conversion instructions even though they are FPMathOperators (the *itofp ones anyway); we only allow FMF on FP binops, calls, and fcmp.

Comment 11 Sanjay Patel 2018-06-12 10:38:31 PDT

https://reviews.llvm.org/D48085

Comment 12 Simon Pilgrim 2020-03-17 04:41:52 PDT

Current Codegen: https://godbolt.org/z/ZEVtp7

Comment 13 Simon Pilgrim 2020-03-17 04:46:25 PDT

@spatel This looks mostly fine to me (thanks for 2 years ago!), the only possible improvement is:

define dso_local double @_Z3rndd(double %0) local_unnamed_addr #0 {
  %2 = fptosi double %0 to i32
  %3 = sitofp i32 %2 to float
  %4 = fpext float %3 to double
  ret double %4
}

define dso_local float @_Z5rnd32d(double %0) local_unnamed_addr #0 {
  %2 = fptosi double %0 to i32
  %3 = sitofp i32 %2 to float
  ret float %3
}

_Z3rndd:                                # @_Z3rndd
        vcvttsd2si      %xmm0, %eax
        vcvtsi2ss       %eax, %xmm1, %xmm0
        vcvtss2sd       %xmm0, %xmm0, %xmm0
        retq

_Z5rnd32d:                              # @_Z5rnd32d
        vcvttsd2si      %xmm0, %eax
        vcvtsi2ss       %eax, %xmm1, %xmm0
        retq

We could avoid the XMM->GPR->XMM transfers by using vcvttpd2dq+vcvtdq2ps(+vcvtss2sd) - what do you think?

Comment 14 Sanjay Patel 2020-03-31 08:36:52 PDT

(In reply to Simon Pilgrim from comment #13)
> @spatel This looks mostly fine to me (thanks for 2 years ago!), the only
> possible improvement is:
> 
> define dso_local double @_Z3rndd(double %0) local_unnamed_addr #0 {
>   %2 = fptosi double %0 to i32
>   %3 = sitofp i32 %2 to float
>   %4 = fpext float %3 to double
>   ret double %4
> }
> 
> define dso_local float @_Z5rnd32d(double %0) local_unnamed_addr #0 {
>   %2 = fptosi double %0 to i32
>   %3 = sitofp i32 %2 to float
>   ret float %3
> }
> 
> _Z3rndd:                                # @_Z3rndd
>         vcvttsd2si      %xmm0, %eax
>         vcvtsi2ss       %eax, %xmm1, %xmm0
>         vcvtss2sd       %xmm0, %xmm0, %xmm0
>         retq
> 
> _Z5rnd32d:                              # @_Z5rnd32d
>         vcvttsd2si      %xmm0, %eax
>         vcvtsi2ss       %eax, %xmm1, %xmm0
>         retq
> 
> We could avoid the XMM->GPR->XMM transfers by using
> vcvttpd2dq+vcvtdq2ps(+vcvtss2sd) - what do you think?

Yes, a quick check of Agner timings for recent Intel and AMD says avoiding the GPR round-trip is always a win. We did something similar with:
https://reviews.llvm.org/D56864

Comment 15 Sanjay Patel 2020-03-31 09:56:14 PDT

(In reply to Sanjay Patel from comment #14)
> Yes, a quick check of Agner timings for recent Intel and AMD says avoiding
> the GPR round-trip is always a win. We did something similar with:
> https://reviews.llvm.org/D56864

There's a potential complication though: if we operate on garbage data in the 128-bit %xmm reg, we might induce some terrible x86 denorm-handling stall and kill perf. Can we rule out the potential denorm stall on a cvttps2dq or similar cast instructions?

If the stall concern is valid, then the way around that would be to zero out the high elements, so the asm diff looks something like this:

diff --git a/llvm/test/CodeGen/X86/ftrunc.ll b/llvm/test/CodeGen/X86/ftrunc.ll
index 0a1c1e2a851..d4042b8ceca 100644
--- a/llvm/test/CodeGen/X86/ftrunc.ll
+++ b/llvm/test/CodeGen/X86/ftrunc.ll
@@ -223,9 +223,10 @@ define <4 x double> @trunc_unsigned_v4f64(<4 x double> %x) #0 {
 define float @trunc_signed_f32(float %x) #0 {
 ; SSE2-LABEL: trunc_signed_f32:
 ; SSE2:       # %bb.0:
-; SSE2-NEXT:    cvttss2si %xmm0, %eax
-; SSE2-NEXT:    xorps %xmm0, %xmm0
-; SSE2-NEXT:    cvtsi2ss %eax, %xmm0
+; SSE2-NEXT:    xorps %xmm1, %xmm1
+; SSE2-NEXT:    movss {{.*#+}} xmm1 = xmm0[0],xmm1[1,2,3]
+; SSE2-NEXT:    cvttps2dq %xmm1, %xmm0
+; SSE2-NEXT:    cvtdq2ps %xmm0, %xmm0
 ; SSE2-NEXT:    retq


Still worth doing?

Comment 16 Peter Cordes 2020-04-01 09:13:52 PDT

Created attachment 23302 [details]
testloop-subnormal-cvt.asm

> Can we rule out the potential denorm stall on a cvttps2dq or similar cast instructions?

I *think* this is probably safe for tune=generic but it needs testing on Silvermont and/or Goldmont.  Agner Fog's microarch PDF says:

>   Operations that have subnormal numbers as input or output
>   or generate underflow take approximately 160 clock cycles
>   unless the flush-to-zero mode and denormals-are-zero
>   mode are both used.

But this might not include conversions.  There aren't 2 inputs that need to be aligned wrt. each other to match up the place values of mantissas.  Anything with an exponent part below 2^-1 simply rounds to zero, including subnormals.  But if conversion shares HW with other FPU operations, maybe it would be subject to the same subnormal penalties?  I'm hopeful but I wouldn't bet on it.

I tested on Core 2 (Conroe) and Skylake and found no penalties for subnormal cvt.  Even though Core 2 has ~150x penalties for basically anything else involving subnormals.

cvtps2dq and pd2dq, and their truncating versions, are both fast with a mix of normal and subnormal inputs.  (Not sure if cvtpd2dq is a win over scalar round trip; it has a shuffle uop both ways.  But I tested anyway just for the record.)

In a loop in a static executable, NASM source:

    times 8 cvtps2dq  xmm0, xmm1    ; fine with subnormal
    times 8 cvtpd2dq  xmm0, xmm2    ; fine with subnormal
    times 8 cvttps2dq  xmm0, xmm1   ; fine with subnormal
    times 8 cvttpd2dq  xmm0, xmm2   ; fine with subnormal


section .rodata
f1:     dd 0.5, 0x12345, 0.123, 123         ; normal, subnormal, normal, subnormal
d2:     dq 5.1234567, 0x800000000000ff12    ; normal, -subnormal

Full source attached, in case anyone wants to check on a Bulldozer-family or Zen.  Agner Fog says Bulldozer/Piledriver have penalties for subnormal or underflow *results*, no mention of inputs.  If it runs in 

Agner also doesn't mention input penalties for Ryzen.

Using the same test program with different instructions in the loop, I was able to see huge (factor of > 150) slowdowns for compares with subnormal inputs on Core 2, so I'm sure I had the right values in registers and that I would have detected a problem if it existed.  (cmpltps, or ucomiss with a memory source)


nasm -felf64 subnormal-cvt.asm && ld subnormal-cvt.o && time ./a.out

If the run time is something under 1 or 2 seconds (e.g. 390 ms on a 4.1 GHz Skylake where cvt is 2/clock) it's fast.  It takes 1.3 seconds on my 2.4GHz E6600 Core 2 Conroe/Merom.

If it was slow, it would take ~150 times that.  Or possibly (but unlikely) on Ryzen, just a bit longer than you'd expect if there are penalties for this; time with perf stat to see if you get the expected ~1 instruction / cycle on Zen1 or Zen2

Comment 17 Simon Pilgrim 2020-04-02 03:00:55 PDT

(In reply to Simon Pilgrim from comment #13)
> We could avoid the XMM->GPR->XMM transfers by using
> vcvttpd2dq+vcvtdq2ps(+vcvtss2sd) - what do you think?

To avoid denormal issues, a pre-splat might do the trick - its still going to be a lot quicker than a GPR transfer.

vcvtdq2ps(vcvttpd2dq(vpermilpd(x,0)))

or

vcvtss2sd(vcvtdq2ps(vcvttpd2dq(vpermilpd(x,0))))

I don't think SimplifyDemandedVectorElts will do anything to remove the splat but we'd have to confirm.

Comment 18 Peter Cordes 2020-04-03 15:37:48 PDT

(In reply to Simon Pilgrim from comment #17)
> (In reply to Simon Pilgrim from comment #13)
> > We could avoid the XMM->GPR->XMM transfers by using
> > vcvttpd2dq+vcvtdq2ps(+vcvtss2sd) - what do you think?
> 
> To avoid denormal issues, a pre-splat might do the trick - its still going
> to be a lot quicker than a GPR transfer.
>
> vcvtdq2ps(vcvttpd2dq(vpermilpd(x,0)))

That defeats the throughput advantage, at least on Intel CPUs, but could still be a latency win for tune=generic *if* there are any CPUs that do have penalties for subnormal FP->int.

We should only consider this if we find evidence that some CPU we care about for tune=generic really does have a penalty for subnormal inputs to FP->int conversion.  I haven't found any evidence of that on Intel, including Core 2 which *does* have subnormal-input penalties for everything else, including compares and float->double.  As I said, I'm hopeful that there aren't any such CPUs, and Agner Fog's guide says AMD CPU subnormal penalties are only ever for output, not input.

It's bad for -mtune=sandybridge or tune=core2 where denormals definitely don't hurt FP->integer conversion.

That extra shuffle uop defeats all of the throughput benefit on Sandybridge-family (On Skylake it's nearly break-even for throughput vs. GPR and back).  The port breakdown is p5 + p01+p5 + p01 on SKL.  (packed conversion to/from double costs a shuffle uop as well as the conversion, because unlike float the element sizes don't match int32.  Scalar is also 2: convert and domain transfer in one direction or the other.)

It does still have a latency advantage on Skylake over ss2si / si2sd to a GPR and back.  (And a big latency advantage on Bulldozer-family).

SKL and Core 2 latency and throughput results for this loop body.  Note that scalar and back wins on Core 2 for both throughput *and* latency.

    xorps   xmm0, xmm0   ;; break the dep chain, comment to test latency
%rep 4
%if 1
;    movaps    xmm0, xmm2
    unpcklpd  xmm0, xmm0        ; 2 p5 + 2p01 on SKL
    cvtpd2dq  xmm0, xmm0        ; rep4: Core2 tput 0.42s  SKL tput: 0.192s @ ~4.15 GHz.  800.5 Mcycles  4 front-end uops / round trip
    cvtdq2ps  xmm0, xmm0        ; rep4: Core2 lat: 1.34s  SKL lat:  0.96s
%else
    ;; SKL ports: p0+p01,  p01+p5
    cvtsd2si  eax, xmm0        ; rep4: Core2 tput 0.33s  SKL tput: 0.194s @ ~4.15 GHz.  807.75 Mcycles  4 front-end uops / round trip
    cvtsi2ss  xmm0, eax        ; rep4: Core2 lat  1.00s  SKL lat:  1.15s
    ; note that this has a false dependency on the destination.  Thanks, Intel.
%endif


And BTW, you want unpcklpd or movlhps to broadcast the low 64 bits with more compact machine code and a vpermilpd with a 3-byte VEX and an immediate.  (And requiring AVX).  Or maybe just movq to copy and zero-extend the low qword: 0.33c throughput on Intel, downside is possible bypass delay.  Otherwise, without AVX we have to choose between broadcast in place (putting the shuffle in the critical path for other readers of the FP value), or copy and broadcast which probably costs a movaps plus a shuffle.


I think it's likely that FP->integer won't be slowed by subnormals anywhere, because Core 2 has penalties for comparing subnormals or converting ps -> pd, but still has no penalty for ps -> int.

Comment 19 Peter Cordes 2020-04-03 15:49:12 PDT

(In reply to Simon Pilgrim from comment #17)
> 
> vcvtss2sd(vcvtdq2ps(vcvttpd2dq(vpermilpd(x,0))))

Note that if the source FP value was float, we could skip the int->float step and go int->double directly instead of int->float->double.  That would guarantee that the integer value either overflowed or was an integer that float can represent exactly.  Thus the int->float step is always exact.

But that's not the case for a double input.  If we want to preserve the behaviour of rounding to integer and then to the nearest representable float, we need to keep that step.

Maybe everyone else already noticed that, too, but I thought it was neat.

If we had AVX512 for VRNDSCALESD, that wouldn't help and more than SSE4.1 ROUNDSD; it can only specify a non-negative number of binary fraction bits to keep, not rounding to a higher boundary than 2^0

Comment 20 Steve Canon 2020-04-06 06:56:37 PDT

- You won't end up in this situation without doing something weird without vector intrinsics or a reinterpret cast; normal scalar code will not find itself in this scenario. If you're doing something weird, you are opting in to dealing with the hazards that result.

- Subnormals in the unused lanes don't actually cause an issue on anything recent.

In light of these two factors, it's nuts to slow down the common path at all to try to smooth over a hypothetical subnormal penalty that can only happen under very contrived circumstances on decade-plus-old CPUs. If there's a sequence that avoids a possible subnormal with zero penalty for the normal case, use it, but we shouldn't accept any slowdown to the normal case for this.

Comment 21 Peter Cordes 2020-04-06 09:13:10 PDT

Steve's comment reminded me of a previous discussion I had about high garbage being a problem in XMM registers for FP-exception correctness.  cvtdq2ps can raise Precision (inexact).

The x86-64 SysV calling convention allows it.  While it's rare, compilers will sometimes call scalar functions with XMM0's high elements still holding whatever's left over from some auto-vectorization they did.  Or return a scalar float after a horizontal sum that used shuffles that *don't* result in all elements holding the same value.

Clearly code must not rely on high elements being zero in function args or return values for correctness, and LLVM doesn't.  But does correctness include FP exception semantics?  For clang/LLVM it already doesn't.

Without -ffast-math, return ((a+b)+(c+d))+((e+f)+(g+h)); for float a..h args vectorizes with unpcklps + addps leaving the high elements = 2nd element of the XMM inputs.  https://gcc.godbolt.org/z/oRZuN9.  (*with* -ffast-math, clang just uses 7x addss in a way that keeps some but different ILP.)

That *will* have performance problems for subnormals on P6-family, and maybe sometimes on Sandybridge-family.  But it's not a correctness problem and might still be worth doing if we don't mind raising FP exceptions that the C source wouldn't have.

Not everyone is as cavalier with FP exception semantics, although it's really hard (and bad for performance) to get right if you care about FP exceptions as a visible side-effect (i.e. running an exception handler once per exception).  GCC enables `-ftrapping-math` by default, but it's broken and has apparently never fully worked.  (It does allow some optimizations that could change the number and IIRC type of FP exceptions that could be raised, as well as missing some optimizations that are still safe.)

For https://stackoverflow.com/questions/36706721/is-a-sign-or-zero-extension-required-when-adding-a-32bit-offset-to-a-pointer-for/36760539#36760539 I emailed one of the x86-64 System V ABI authors (Michael Matz); he confirmed that XMM scalar function args can have high garbage and added:

> There once was a glibc math function (custom AMD implementation) that used
> the packed variants and indeed produced FP exception when called from normal
> C code (which itself left stuff in the upper bits from unrelated code).
> The fix was to not used the packed variant.  So, the above clang code seems
> buggy without other guards.

(I had asked if that clang3.7 codegen was a sign that the ABI guaranteed something.  But no, it's just clang/LLVM being aggressive.)

Apparently cvtdq2ps can raise is a Precision exception (but no others).  I think if an element is not already an exact integer-valued float, and/or is out of range for int32_t.

----

Side note: in the unlikely event we ever want to emit that shuffle for some non-default tuning:
  we can still leave it out with -ffast-math without -fPIC - that means we're compiling code for an executable which will presumably be linked with -ffast-math, and run with FTZ and DAZ (denormals are zero), the purpose of which are to avoid these penalties.


(In reply to Steve Canon from comment #20)
> - You won't end up in this situation without doing something weird without
> vector intrinsics or a reinterpret cast; normal scalar code will not find
> itself in this scenario. If you're doing something weird, you are opting in
> to dealing with the hazards that result.

Auto-vectorized code that calls a scalar function could cause this, e.g. after a horizontal sum that ends up with the right value in the low lane and leftover other values in the high lanes.

Although most likely it already caused some subnormal penalties, in which case one more is probably not a big deal for the rare case where that happens.  Especially not if we're only concerned with performance, not FP-exception correctness.

Comment 22 Sanjay Patel 2020-04-06 09:51:22 PDT

(In reply to Peter Cordes from comment #21)
> (In reply to Steve Canon from comment #20)
> > - You won't end up in this situation without doing something weird without
> > vector intrinsics or a reinterpret cast; normal scalar code will not find
> > itself in this scenario. If you're doing something weird, you are opting in
> > to dealing with the hazards that result.
> 
> Auto-vectorized code that calls a scalar function could cause this, e.g.
> after a horizontal sum that ends up with the right value in the low lane and
> leftover other values in the high lanes.

There's some misunderstanding here. We're talking about changing the codegen for this most basic C code (no weird stuff necessary):

float rnd(float x) {
  return (float)((int)x);
}

Comment 23 Steve Canon 2020-04-06 10:03:58 PDT

> There's some misunderstanding here. We're talking about changing the codegen for this most basic C code (no weird stuff necessary)

Subnormal representations in the high-order lanes have to come from somewhere. They don't just magically appear.

In particular:

- scalar loads zero the high lanes
- moves from GPR zero the high lanes

There are a few ways to get a scalar value that preserve the high-order lanes. The most common culprits are the scalar conversions and unary scalar operations coming from memory. In those cases, the correct fix for the resulting performance bug will be to zero before *that* operation (or convert to separate load + op) to also break the dependency.

> Without -ffast-math, return ((a+b)+(c+d))+((e+f)+(g+h)); for float a..h args vectorizes with unpcklps + addps leaving the high elements = 2nd element of the XMM inputs.

I'm totally fine with eating a subnormal stall in a case like this in order to avoid pessimizing "normal code", because if a stall is happening after a reduction like this, you're *already* eating subnormal stalls in the reduction itself. The ship sailed. You're already not going fast.

w.r.t. modeling floating-point exceptions, a constrained intrinsic used when fenv_access is enabled would of course not be able to take part in this optimization, so that shouldn't be an issue.

Comment 24 Sanjay Patel 2020-04-06 11:16:00 PDT

(In reply to Steve Canon from comment #23)
> Subnormal representations in the high-order lanes have to come from
> somewhere. They don't just magically appear.

Ok, I agree it's far-fetched that garbage will be up there and has not already caused perf problems. And it's moot if there's no denorm penalty on the convert insts on recent HW. 

We don't have to worry about exceptions because programs that enabled exceptions should use strict/constrained ops, and we're dealing with the base opcodes here.

As a final safety check, can someone try Peter's attached asm code (see comment 16) on an AMD CPU to see if that can trigger a denorm penalty there?

Comment 25 Roman Lebedev 2020-04-06 11:28:14 PDT

$ cat /proc/cpuinfo | grep model | sort | uniq
model           : 2
model name      : AMD FX(tm)-8350 Eight-Core Processor
$ taskset -c 3 perf stat -etask-clock:u,context-switches:u,cpu-migrations:u,page-faults:u,cycles:u,branches:u,instructions:u -r1 ./subnormal-cvt 

 Performance counter stats for './subnormal-cvt':

            799.31 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                 2      page-faults:u             #    0.003 K/sec                  
        3200715964      cycles:u                  #    4.004 GHz                    
         100000415      branches:u                #  125.109 M/sec                  
        3400000428      instructions:u            #    1.06  insn per cycle         
                                                  #    0.00  stalled cycles per insn

       0.799595756 seconds time elapsed

       0.799619000 seconds user
       0.000000000 seconds sys

"uops_issued.any:u,uops_executed.thread:u,idq.dsb_uops:u" counters aren't supported here, so i had to drop them.

Comment 26 Peter Cordes 2020-04-06 11:47:56 PDT

(In reply to Sanjay Patel from comment #22)
> > There's some misunderstanding here. We're talking about changing the codegen for this most basic C code (no weird stuff necessary)

Yes exactly.  If that function doesn't inline, *its caller* could have left high garbage in XMM0.

Normal scalar loads zero the upper elements, which is why I proposed an auto-vectorized caller as a simple example of
when we could in practice see high garbage across function call boundaries.

(In reply to Steve Canon from comment #23)
> Subnormal representations in the high-order lanes have to come from
> somewhere. They don't just magically appear.
> 
> In particular:
> 
> - scalar loads zero the high lanes
> - moves from GPR zero the high lanes
> 
> There are a few ways to get a scalar value that preserve the high-order
> lanes. The most common culprits are the scalar conversions and unary scalar
> operations coming from memory. In those cases, the correct fix for the
> resulting performance bug will be to zero before *that* operation (or
> convert to separate load + op) to also break the dependency.

Yup, GCC does dep-breaking PXOR-zeroing to work around Intel's short-sighted design (for PIII)
for scalar conversions that merge into the destination.  With AVX we can sometimes do the zeroing once and non-destructively merge into the same register.  (It seems braindead that the AVX versions didn't just zero-extend, i.e. implicitly use an internal zeroed reg as the merge target.  So Intel missed two more chances to fix their ISA with AVX VEX and AVX512 EVEX.)
IIRC, MSVC does scalar load + packed-conversion, which is good for float.

But yes, these merge ops could have leftover *integer* vectors.  Small integers are the bit patterns for subnormal floats.  So that's probably a more likely source of subnormal FP bit patterns than actual subnormal floats from FP ops.


> > Without -ffast-math, return ((a+b)+(c+d))+((e+f)+(g+h)); for float a..h args vectorizes with unpcklps + addps leaving the high elements = 2nd element of the XMM inputs.
> 
> I'm totally fine with eating a subnormal stall in a case like this in order
> to avoid pessimizing "normal code", because if a stall is happening after a
> reduction like this, you're *already* eating subnormal stalls in the
> reduction itself. The ship sailed. You're already not going fast.

Yes, that was my thought, too.
If real code generates subnormal floats as part of its real work, that's something for the programmer to worry about even if we introduce an occasional extra penalty in the rare cases they're left lying around and we trip over them.

> w.r.t. modeling floating-point exceptions, a constrained intrinsic used when
> fenv_access is enabled would of course not be able to take part in this
> optimization, so that shouldn't be an issue.

Oh, you're right, clang supports -ftrapping-math (but unlike GCC it's not on by default).
That does result in evaluation in source bracket-nesting order without shuffling, just 7x addss. 
 https://gcc.godbolt.org/z/a44S43

(Incidentally, that's probably optimal for most cases of surrounding code,
especially on Skylake with 2/clock FP add but 1/clock FP shuffle.
Fewer total instructions and shorter critical-path latency even on Haswell with 1/clock FP add.)

Comment 27 Peter Cordes 2020-04-06 11:57:19 PDT

(In reply to Sanjay Patel from comment #24)
> As a final safety check, can someone try Peter's attached asm code (see
> comment 16) on an AMD CPU to see if that can trigger a denorm penalty there?

The only CPUs I'm at all worried about are Atom / Silvermont-family.

Low power Intel have more performance glass jaws than the big-core chips in other areas (e.g. much more picky about too many prefixes).  I don't know how much we care about Silvermont for tune=generic, especially for FP code, if it does turn out to be a problem there.

Roman's 1.06  insn per cycle result on Bulldozer-family confirms it's fine.  Probably safe to assume Zen is also fine.  Agner says AMD only has subnormal penalties on outputs, no mention of inputs, and in this case the output is integer.  So that's certainly the result I was expecting, but definitely good to be sure.

Comment 28 Sanjay Patel 2020-04-17 08:48:23 PDT

Still need to handle more type variations, but here are two patterns:
https://reviews.llvm.org/D77895
https://reviews.llvm.org/D78362

Comment 29 Sanjay Patel 2020-04-30 07:54:36 PDT

One more x86-specialization:
https://reviews.llvm.org/D78362

And a generic transform to remove the last cast:
https://reviews.llvm.org/D79116

Comment 30 Sanjay Patel 2020-04-30 07:55:44 PDT

(In reply to Sanjay Patel from comment #29)
> One more x86-specialization:
> https://reviews.llvm.org/D78362

Pasted wrong link:
https://reviews.llvm.org/D78758

Comment 31 Simon Pilgrim 2020-05-04 03:48:46 PDT

All cases now covered - thank you!

Comment 32 Sanjay Patel 2020-05-04 04:35:51 PDT

Not shown in the examples here, but there's another FP cast fold proposed in:
https://reviews.llvm.org/D79187

Comment 33 Sanjay Patel 2020-05-24 07:01:39 PDT

Added IR folds for int -> FP -> ext/trunc FP:
https://reviews.llvm.org/rGbfd512160fe0
https://reviews.llvm.org/rGc048a02b5b26