Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast Math Flags not propagated to phis/select instructions #51601

Open
llvmbot opened this issue Oct 21, 2021 · 7 comments
Open

Fast Math Flags not propagated to phis/select instructions #51601

llvmbot opened this issue Oct 21, 2021 · 7 comments
Labels
bugzilla Issues migrated from bugzilla confirmed Verified by a second party floating-point Floating-point math llvm:codegen llvm:optimizations

Comments

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 21, 2021

Bugzilla Link 52259
Version trunk
OS All
Reporter LLVM Bugzilla Contributor
CC @andykaylor,@davidbolvansky,@jyknight,@RKSimon,@rotateright

Extended Description

Consider the following code compiled with -Ofast.

double floatingAbs(double x) {
    if (x < 0)
        return -x;
    else
        return x;
}

This should produce a fabs but the compiler is unable to do produce one because the select instruction in the IR does not have any fast flags.

From my analysis:
SROA -> creates a Phi node without any fast flags.
SimplifyCFG -> converts the Phi to a select and copies the Phi's fast flags,
unfortunately the phi does not have any fast flags to begin with.

@jyknight
Copy link
Member

jyknight commented Nov 3, 2021

It wasn't initially clear to me why there was a problem here (although I'm sure it was clear to those who've been working on this for a while). So, going through it again:

The optimized IR that comes out from this example now is:

define dso_local double @&#8203;foo(double %x) local_unnamed_addr #&#8203;0 {
entry:
  %cmp = fcmp fast olt double %x, 0.000000e+00
  %fneg = fneg fast double %x
  %retval.0 = select i1 %cmp, double %fneg, double %x
  ret double %retval.0
}
attributes #&#8203;0 = { mustprogress nofree norecurse nosync nounwind readnone uwtable willreturn "approx-func-fp-math"="true" "frame-pointer"="non-leaf" "min-legal-vector-width"="0" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="cortex-a76" "unsafe-fp-math"="true" }

While this looks reasonable, we'd like it to just turn into llvm.fabs. However, that transform would change the return value when called with -0.0 from the original -0.0 to instead be +0.0.

And, sadly, putting nsz flags just on the math ops, as we have here, is insufficient to allow for the desired optimization.

That's because:
%cmp = fcmp fast olt double %x, 0.000000e+00
-> "if %x is less than zero". The nsz flag here lets us treat +/- zero as equivalent, but that's meaningless, because +/- zero are already treated as equivalent in fcmp. The semantics of nsz don't permit us to have the comparison return "true", given %x = -0.0 (and we don't want to change them in such a way, either).

%fneg = fneg fast double %x
-> The value here only ends up being used when the comparison was true, which means we're not dealing with any zeros, so the "nsz" here is effectively irrelevant.

%retval.0 = select i1 %cmp, double %fneg, double %x
-> So here is the first opportunity to actually use the property of not caring about the sign. But we can't do so.

Similarly, looking at the original C code, it's quite clear that none of the operations in the function are affected by allowing +/- zero to be treated equivalently.

Basically, only fi we define the nsz semantics as affecting values, rather than operations, can we get the transform we want.

And that would seem to imply placing nsz flags on a lot more kinds of things: every FP load, store, ret, bitcast, incoming-parameter, maybe more.

(It seems like it's kinda really expected to just be a different IR type (e.g. "float-nsz" not "float").

@rotateright
Copy link
Contributor

Thanks for writing that summary! Parts of that comment are scattered around in the related bugs, patch reviews, and commit messages, but this is the best example/description of the problem that I have read.

We've mostly been able to cheat our way around the current limitations of FMF by reasoning (possibly wrongly) via FMF on sequences of instructions (back/forward propagating as needed). Sometimes we've relied on the legacy function-level attributes too (mostly in the backend/codegen).

But there are a few examples like this one where it seems we are stuck. Note that "ninf" and "nnan" have similar constraints.

@andykaylor
Copy link
Contributor

This is interesting. Below are my notes on working through what James said about needing fast-math flags on things like loads and stores. I'm recording it here in full verbosity for the benefit of anyone else who might not see why this is so.

If instead of the if-else, you write the code like this:

double floatingAbs(double x) {
  return (x < 0) ? -x : x;
}

The code will get optimized to llvm.fabs(x) as desired. The front end produces this:

define dso_local double @&#8203;floatingAbs(double %0) #&#8203;0 {
  %2 = alloca double, align 8
  store double %0, double* %2, align 8, !tbaa !&#8203;3
  %3 = load double, double* %2, align 8, !tbaa !&#8203;3
  %4 = fcmp fast olt double %3, 0.000000e+00
  br i1 %4, label %5, label %8

5:                                                ; preds = %1
  %6 = load double, double* %2, align 8, !tbaa !&#8203;3
  %7 = fneg fast double %6
  br label %10

8:                                                ; preds = %1
  %9 = load double, double* %2, align 8, !tbaa !&#8203;3
  br label %10

10:                                               ; preds = %8, %5
  %11 = phi fast double [ %7, %5 ], [ %9, %8 ]
  ret double %11
}

Which SROA turns into this:

define dso_local double @&#8203;floatingAbs(double %0) #&#8203;0 {
  %2 = fcmp fast olt double %0, 0.000000e+00
  br i1 %2, label %3, label %5

  %4 = fneg fast double %0
  br label %6

  br label %6

  %7 = phi fast double [ %4, %3 ], [ %0, %5 ]
  ret double %7
}

Which SimplifyCFG turns into this:

define dso_local double @&#8203;floatingAbs(double %0) #&#8203;0 {
  %2 = fcmp fast olt double %0, 0.000000e+00
  %3 = fneg fast double %0
  %4 = select fast i1 %2, double %3, double %0
  ret double %4
}

Which InstCombine turns into llvm.fabs().

Note that in the above example, the phi and select nodes do get the 'fast' flag and that seems to be why we did the transformation.

In the original if-else case, the front end gives us this:

define dso_local double @&#8203;floatingAbs(double %0) #&#8203;0 {
  %2 = alloca double, align 8
  %3 = alloca double, align 8
  store double %0, double* %3, align 8, !tbaa !&#8203;3
  %4 = load double, double* %3, align 8, !tbaa !&#8203;3
  %5 = fcmp fast olt double %4, 0.000000e+00
  br i1 %5, label %6, label %9

6:                                                ; preds = %1
  %7 = load double, double* %3, align 8, !tbaa !&#8203;3
  %8 = fneg fast double %7
  store double %8, double* %2, align 8
  br label %11

9:                                                ; preds = %1
  %10 = load double, double* %3, align 8, !tbaa !&#8203;3
  store double %10, double* %2, align 8
  br label %11

11:                                               ; preds = %9, %6
  %12 = load double, double* %2, align 8
  ret double %12
}

Here clang stored the value fneg value in the if branch and reloaded it in BB#11, so there was phi to put the select on. SROA gets rid of the loads and stores, but it doesn't set fast-math flags on the phi.

define dso_local double @&#8203;floatingAbs(double %0) #&#8203;0 {
  %2 = fcmp fast olt double %0, 0.000000e+00
  br i1 %2, label %3, label %5

3:                                                ; preds = %1
  %4 = fneg fast double %0
  br label %6

5:                                                ; preds = %1
  br label %6

6:                                                ; preds = %5, %3
  %.0 = phi double [ %4, %3 ], [ %0, %5 ]
  ret double %.0
}

I tried to think of some hypothetical rule by which SROA could deduce the 'fast' flag for the phi, but since one of the loaded values is just an argument I can't see any way we could do it (apart from the function attribute, which I want to get rid of).

@rotateright
Copy link
Contributor

Thanks for stepping through the passes and posting the analysis, Andy.

The problem is complicated, so having this example along with your and James' explanations will definitely help illustrate that as we search for a solution.

I'll add another example based on bug 35607 to try to make the scope of the problem clearer:

#include <algorithm>

float test3(float a, float b) {
    return std::max(std::min(a,b), std::max(a,b));
}

We should be able to recognize that a 'fast' FP max-of-max is just FP max, but that does not work currently:
https://godbolt.org/z/dsjh7o6Ko

That shows the missed optimization as x86 asm and optimized IR. The last pane shows the unoptimized IR coming from clang (front-end).

Note that there are no phi, select, or FP ops - just a sequence of load, store, compare, branch, and calls. The min/max calls take/return pointers, not FP values. So "fcmp" is the only FPMathOperator (and we have been trying to move away from that definition).

I think this example would require FMF on loaded FP values to get the expected optimization.

@andykaylor
Copy link
Contributor

I might be starting to warm up to function attributes as a potential solution to this problem. One of my concerns with function attributes was how they would work when pragmas or inlining led to functions with mixed fast-math behaviors. It seems that clang (mostly) handles this in a way that alleviates my concerns.

Using this code as a test:

#include <math.h>

float f(float x) {
#pragma float_control(precise, on)
  if (isnan(x)) {
    return 0;
  } else {
#pragma float_control(precise, off)
    if (x < 0)
      return -x;
    else
      return x;
  }
}

#pragma float_control(precise, on)
int foo(float x) {
  return isnan(x);
}

#pragma float_control(precise, off)
float bar(float x) {
  if (x < 0)
    return -x;
  else
    return x;
}

float fubar(float x) {
  if (foo(x))
    return 0;
  else
    return bar(x);
}

I see that clang creates the following function attribute sets:

attributes #​0 = { mustprogress nofree norecurse nosync nounwind readnone uwtable willreturn "approx-func-fp-math"="true" "denormal-fp-math"="preserve-sign,preserve-sign" "denormal-fp-math-f32"="ieee,ieee" "frame-pointer"="none" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" "unsafe-fp-math"="false" }
attributes #​1 = { mustprogress nofree norecurse nosync nounwind readnone uwtable willreturn "approx-func-fp-math"="true" "denormal-fp-math"="preserve-sign,preserve-sign" "denormal-fp-math-f32"="ieee,ieee" "frame-pointer"="none" "min-legal-vector-width"="0" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" "unsafe-fp-math"="true" }

https://godbolt.org/z/aMf3axqjj

Essentially, #​0 has attributes that enable fast-math, and #​1 has attributes that do not enable fast-math. For functions with mixed states, clang uses the attributes that do not enable fast-math. It seems to be getting "approx-func-fp-math" wrong, but that may be an issue with the pragma handling. Also, "unsafe-fp-math" seems to encompass 'arcp', 'recip', and 'reassoc' with no way to represent them separately. There also seems to be disagreement between clang and SelectionDAG over whether "unsafe-fp-math" implies 'contract'.

If we had some well-defined relationship between the function attributes and the instruction-level fast math flags and some way of enforcing it, I might be OK with using the function attributes in IR optimization passes. Based on a suggestion Renato Golin made on the mailing list, I think the rules would be:

  1. If any instruction has a fast-math flag set, the corresponding function attribute must be omitted or set to false
  2. The instruction-level fast-math flags, when set, take precedence over the function attribute
  3. If fast-math flags are not set for an instruction but an optimization is enabled by a corresponding function attribute, the optimization is permitted
  4. If fast-math flags are not set for an instruction and the corresponding function attribute is not found or is set to false, the optimization is not permitted

@andykaylor
Copy link
Contributor

One other thing I wanted to make note of with regard to the function attributes is that the use of function attributes when fast-math is enabled throughout a function has a potential benefit to bitcode size. The way clang generates IR, setting both instruction-level FMF and the function attributes, wouldn't get this benefit, but we could potentially add it as an optimization just before the BitcodeWriter is invoked.

When an instruction has fast-math flags set, the BitcodeWriter emits an extra byte for those flags. If no flags are set, a byte is saved for each instruction. If anyone cared about bitcode size (and I'm pretty sure there are people who do), they could sweep over the IR just before writing the bitcode and remove all instruction level fast-math flags and set the function attribute for functions where all eligible instructions had the same fast-math flags set. There may even be front ends that are already taking advantage of this behavior.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
@yashssh
Copy link
Contributor

yashssh commented Feb 22, 2024

I would like to reignite this discussion, the issue described is still valid and I recently encountered it only to find out that it's already being tracked here. Pasting the summary of the discussion we had in the 'floating point workgroup' meeting(Thank you @andykaylor!).

Yashwant Singh brought up a specific example where the lack of fast-math flags on a select instruction was blocking a desirable optimization of an fabs idiom. This problem was previously discussed on the LLVM mailing list and is archived here: #51601 As discussed there, the only way to enable the optimization using current IR constructs is to rely on function level attributes. We discussed various problems with function-level attributes, particularly related to inlining where it may be necessary to drop the attribute. The most obvious alternative was to allow front ends to set fast-math flags on load instructions, but that isn’t currently allowed and would likely expose other problems.

ps We are moving away from any solution based on using the function attributes, I'm gonna perform some experiments around a proposed solution (set fast-math flags on load instructions) and see what comes. If there might be a better/intuitive fix or any comments on the discussion kindly suggest :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla confirmed Verified by a second party floating-point Floating-point math llvm:codegen llvm:optimizations
Projects
None yet
Development

No branches or pull requests

7 participants