52259 – Fast Math Flags not propagated to phis/select instructions

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 52259 - Fast Math Flags not propagated to phis/select instructions

Summary: Fast Math Flags not propagated to phis/select instructions

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Common Code Generator Code (show other bugs)
Version:	trunk
Hardware:	PC All

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-10-21 16:43 PDT by Usman Nadeem
Modified:	2021-11-19 15:50 PST (History)
CC List:	8 users (show)

See Also:	35607 38086
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Usman Nadeem 2021-10-21 16:43:34 PDT

Consider the following code compiled with -Ofast.

double floatingAbs(double x) {
    if (x < 0)
        return -x;
    else
        return x;
}


This should produce a "fabs" but the compiler is unable to do produce one because the select instruction in the IR does not have any fast flags.

From my analysis:
SROA          -> creates a Phi node without any fast flags.
SimplifyCFG   -> converts the Phi to a select and copies the Phi's fast flags,
                 unfortunately the phi does not have any fast flags to begin with.

Comment 1 James Y Knight 2021-11-03 07:24:55 PDT

It wasn't initially clear to me why there was a problem here (although I'm sure it was clear to those who've been working on this for a while). So, going through it again:

The optimized IR that comes out from this example now is:

```
define dso_local double @foo(double %x) local_unnamed_addr #0 {
entry:
  %cmp = fcmp fast olt double %x, 0.000000e+00
  %fneg = fneg fast double %x
  %retval.0 = select i1 %cmp, double %fneg, double %x
  ret double %retval.0
}
attributes #0 = { mustprogress nofree norecurse nosync nounwind readnone uwtable willreturn "approx-func-fp-math"="true" "frame-pointer"="non-leaf" "min-legal-vector-width"="0" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="cortex-a76" "unsafe-fp-math"="true" }
```

While this looks reasonable, we'd like it to just turn into llvm.fabs. However, that transform would change the return value when called with -0.0 from the original -0.0 to instead be +0.0.

And, sadly, putting nsz flags just on the math ops, as we have here, is insufficient to allow for the desired optimization.

That's because:
  %cmp = fcmp fast olt double %x, 0.000000e+00
-> "if %x is less than zero". The nsz flag here lets us treat +/- zero as equivalent, but that's meaningless, because +/- zero are _already_ treated as equivalent in fcmp. The semantics of nsz don't permit us to have the comparison return "true", given %x = -0.0 (and we don't want to change them in such a way, either).

  %fneg = fneg fast double %x
-> The value here only ends up being used when the comparison was true, which means we're not dealing with any zeros, so the "nsz" here is effectively irrelevant.

  %retval.0 = select i1 %cmp, double %fneg, double %x
-> So here is the first opportunity to actually use the property of not caring about the sign. But we can't do so.


Similarly, looking at the original C code, it's quite clear that none of the  operations in the function are affected by allowing +/- zero to be treated equivalently.

Basically, only fi we define the nsz semantics as affecting _values_, rather than _operations_, can we get the transform we want.

And that would seem to imply placing nsz flags on a lot more kinds of things: every FP load, store, ret, bitcast, incoming-parameter, maybe more.

(It seems like it's kinda really expected to just be a different IR _type_ (e.g. "float-nsz" not "float").

Comment 2 Sanjay Patel 2021-11-05 12:10:35 PDT

Thanks for writing that summary! Parts of that comment are scattered around in the related bugs, patch reviews, and commit messages, but this is the best example/description of the problem that I have read.

We've mostly been able to cheat our way around the current limitations of FMF by reasoning (possibly wrongly) via FMF on sequences of instructions (back/forward propagating as needed). Sometimes we've relied on the legacy function-level attributes too (mostly in the backend/codegen).

But there are a few examples like this one where it seems we are stuck. Note that "ninf" and "nnan" have similar constraints.

Comment 3 Andy Kaylor 2021-11-18 16:19:59 PST

This is interesting. Below are my notes on working through what James said about needing fast-math flags on things like loads and stores. I'm recording it here in full verbosity for the benefit of anyone else who might not see why this is so.

If instead of the if-else, you write the code like this:

```
double floatingAbs(double x) {
  return (x < 0) ? -x : x;
}
```

The code will get optimized to llvm.fabs(x) as desired. The front end produces this:

```
define dso_local double @floatingAbs(double %0) #0 {
  %2 = alloca double, align 8
  store double %0, double* %2, align 8, !tbaa !3
  %3 = load double, double* %2, align 8, !tbaa !3
  %4 = fcmp fast olt double %3, 0.000000e+00
  br i1 %4, label %5, label %8

5:                                                ; preds = %1
  %6 = load double, double* %2, align 8, !tbaa !3
  %7 = fneg fast double %6
  br label %10

8:                                                ; preds = %1
  %9 = load double, double* %2, align 8, !tbaa !3
  br label %10

10:                                               ; preds = %8, %5
  %11 = phi fast double [ %7, %5 ], [ %9, %8 ]
  ret double %11
}
```

Which SROA turns into this:

```
define dso_local double @floatingAbs(double %0) #0 {
  %2 = fcmp fast olt double %0, 0.000000e+00
  br i1 %2, label %3, label %5

  %4 = fneg fast double %0
  br label %6

  br label %6

  %7 = phi fast double [ %4, %3 ], [ %0, %5 ]
  ret double %7
}
```

Which SimplifyCFG turns into this:

```
define dso_local double @floatingAbs(double %0) #0 {
  %2 = fcmp fast olt double %0, 0.000000e+00
  %3 = fneg fast double %0
  %4 = select fast i1 %2, double %3, double %0
  ret double %4
}
```

Which InstCombine turns into llvm.fabs().

Note that in the above example, the phi and select nodes do get the 'fast' flag and that seems to be why we did the transformation.

In the original if-else case, the front end gives us this:

```
define dso_local double @floatingAbs(double %0) #0 {
  %2 = alloca double, align 8
  %3 = alloca double, align 8
  store double %0, double* %3, align 8, !tbaa !3
  %4 = load double, double* %3, align 8, !tbaa !3
  %5 = fcmp fast olt double %4, 0.000000e+00
  br i1 %5, label %6, label %9

6:                                                ; preds = %1
  %7 = load double, double* %3, align 8, !tbaa !3
  %8 = fneg fast double %7
  store double %8, double* %2, align 8
  br label %11

9:                                                ; preds = %1
  %10 = load double, double* %3, align 8, !tbaa !3
  store double %10, double* %2, align 8
  br label %11

11:                                               ; preds = %9, %6
  %12 = load double, double* %2, align 8
  ret double %12
}
```

Here clang stored the value fneg value in the if branch and reloaded it in BB#11, so there was phi to put the select on. SROA gets rid of the loads and stores, but it doesn't set fast-math flags on the phi.

```
define dso_local double @floatingAbs(double %0) #0 {
  %2 = fcmp fast olt double %0, 0.000000e+00
  br i1 %2, label %3, label %5

3:                                                ; preds = %1
  %4 = fneg fast double %0
  br label %6

5:                                                ; preds = %1
  br label %6

6:                                                ; preds = %5, %3
  %.0 = phi double [ %4, %3 ], [ %0, %5 ]
  ret double %.0
}
```

I tried to think of some hypothetical rule by which SROA could deduce the 'fast' flag for the phi, but since one of the loaded values is just an argument I can't see any way we could do it (apart from the function attribute, which I want to get rid of).

Comment 4 Sanjay Patel 2021-11-19 05:49:06 PST

Thanks for stepping through the passes and posting the analysis, Andy. 

The problem is complicated, so having this example along with your and James' explanations will definitely help illustrate that as we search for a solution.

I'll add another example based on bug 35607 to try to make the scope of the problem clearer:

```
#include <algorithm>

float test3(float a, float b) {
    return std::max(std::min(a,b), std::max(a,b));
}

```

We should be able to recognize that a 'fast' FP max-of-max is just FP max, but that does not work currently:
https://godbolt.org/z/dsjh7o6Ko

That shows the missed optimization as x86 asm and optimized IR. The last pane shows the unoptimized IR coming from clang (front-end). 

Note that there are no phi, select, or FP ops - just a sequence of load, store, compare, branch, and calls. The min/max calls take/return pointers, not FP values. So "fcmp" is the only FPMathOperator (and we have been trying to move away from that definition).

I think this example would require FMF on loaded FP values to get the expected optimization.

Comment 5 Andy Kaylor 2021-11-19 14:49:22 PST

I might be starting to warm up to function attributes as a potential solution to this problem. One of my concerns with function attributes was how they would work when pragmas or inlining led to functions with mixed fast-math behaviors. It seems that clang (mostly) handles this in a way that alleviates my concerns.

Using this code as a test:

```
#include <math.h>

float f(float x) {
#pragma float_control(precise, on)
  if (isnan(x)) {
    return 0;
  } else {
#pragma float_control(precise, off)
    if (x < 0)
      return -x;
    else
      return x;
  }
}

#pragma float_control(precise, on)
int foo(float x) {
  return isnan(x);
}

#pragma float_control(precise, off)
float bar(float x) {
  if (x < 0)
    return -x;
  else
    return x;
}

float fubar(float x) {
  if (foo(x))
    return 0;
  else
    return bar(x);
}
```

I see that clang creates the following function attribute sets:

attributes #0 = { mustprogress nofree norecurse nosync nounwind readnone uwtable willreturn "approx-func-fp-math"="true" "denormal-fp-math"="preserve-sign,preserve-sign" "denormal-fp-math-f32"="ieee,ieee" "frame-pointer"="none" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" "unsafe-fp-math"="false" }
attributes #1 = { mustprogress nofree norecurse nosync nounwind readnone uwtable willreturn "approx-func-fp-math"="true" "denormal-fp-math"="preserve-sign,preserve-sign" "denormal-fp-math-f32"="ieee,ieee" "frame-pointer"="none" "min-legal-vector-width"="0" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" "unsafe-fp-math"="true" }

https://godbolt.org/z/aMf3axqjj

Essentially, #0 has attributes that enable fast-math, and #1 has attributes that do not enable fast-math. For functions with mixed states, clang uses the attributes that do not enable fast-math. It seems to be getting "approx-func-fp-math" wrong, but that may be an issue with the pragma handling. Also, "unsafe-fp-math" seems to encompass 'arcp', 'recip', and 'reassoc' with no way to represent them separately. There also seems to be disagreement between clang and SelectionDAG over whether "unsafe-fp-math" implies 'contract'.

If we had some well-defined relationship between the function attributes and the instruction-level fast math flags and some way of enforcing it, I might be OK with using the function attributes in IR optimization passes. Based on a suggestion Renato Golin made on the mailing list, I think the rules would be:

1. If any instruction has a fast-math flag set, the corresponding function attribute must be omitted or set to false
2. The instruction-level fast-math flags, when set, take precedence over the function attribute
3. If fast-math flags are not set for an instruction but an optimization is enabled by a corresponding function attribute, the optimization is permitted
4. If fast-math flags are not set for an instruction and the corresponding function attribute is not found or is set to false, the optimization is not permitted

Comment 6 Andy Kaylor 2021-11-19 15:50:00 PST

One other thing I wanted to make note of with regard to the function attributes is that the use of function attributes when fast-math is enabled throughout a function has a potential benefit to bitcode size. The way clang generates IR, setting both instruction-level FMF and the function attributes, wouldn't get this benefit, but we could potentially add it as an optimization just before the BitcodeWriter is invoked.

When an instruction has fast-math flags set, the BitcodeWriter emits an extra byte for those flags. If no flags are set, a byte is saved for each instruction. If anyone cared about bitcode size (and I'm pretty sure there are people who do), they could sweep over the IR just before writing the bitcode and remove all instruction level fast-math flags and set the function attribute for functions where all eligible instructions had the same fast-math flags set. There may even be front ends that are already taking advantage of this behavior.