Fast Math Flags not propagated to phis/select instructions #51601

llvmbot · 2021-10-21T23:43:34Z


Bugzilla Link	52259
Version	trunk
OS	All
Reporter	LLVM Bugzilla Contributor
CC	@andykaylor,@davidbolvansky,@jyknight,@RKSimon,@rotateright

Extended Description

Consider the following code compiled with -Ofast.

double floatingAbs(double x) {
    if (x < 0)
        return -x;
    else
        return x;
}

This should produce a fabs but the compiler is unable to do produce one because the select instruction in the IR does not have any fast flags.

From my analysis:
SROA -> creates a Phi node without any fast flags.
SimplifyCFG -> converts the Phi to a select and copies the Phi's fast flags,
unfortunately the phi does not have any fast flags to begin with.

The text was updated successfully, but these errors were encountered:

jyknight · 2021-11-03T14:24:55Z

It wasn't initially clear to me why there was a problem here (although I'm sure it was clear to those who've been working on this for a while). So, going through it again:

The optimized IR that comes out from this example now is:

define dso_local double @&#8203;foo(double %x) local_unnamed_addr #&#8203;0 {
entry:
  %cmp = fcmp fast olt double %x, 0.000000e+00
  %fneg = fneg fast double %x
  %retval.0 = select i1 %cmp, double %fneg, double %x
  ret double %retval.0
}
attributes #&#8203;0 = { mustprogress nofree norecurse nosync nounwind readnone uwtable willreturn "approx-func-fp-math"="true" "frame-pointer"="non-leaf" "min-legal-vector-width"="0" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="cortex-a76" "unsafe-fp-math"="true" }

While this looks reasonable, we'd like it to just turn into llvm.fabs. However, that transform would change the return value when called with -0.0 from the original -0.0 to instead be +0.0.

And, sadly, putting nsz flags just on the math ops, as we have here, is insufficient to allow for the desired optimization.

That's because:
%cmp = fcmp fast olt double %x, 0.000000e+00
-> "if %x is less than zero". The nsz flag here lets us treat +/- zero as equivalent, but that's meaningless, because +/- zero are already treated as equivalent in fcmp. The semantics of nsz don't permit us to have the comparison return "true", given %x = -0.0 (and we don't want to change them in such a way, either).

%fneg = fneg fast double %x
-> The value here only ends up being used when the comparison was true, which means we're not dealing with any zeros, so the "nsz" here is effectively irrelevant.

%retval.0 = select i1 %cmp, double %fneg, double %x
-> So here is the first opportunity to actually use the property of not caring about the sign. But we can't do so.

Similarly, looking at the original C code, it's quite clear that none of the operations in the function are affected by allowing +/- zero to be treated equivalently.

Basically, only fi we define the nsz semantics as affecting values, rather than operations, can we get the transform we want.

And that would seem to imply placing nsz flags on a lot more kinds of things: every FP load, store, ret, bitcast, incoming-parameter, maybe more.

(It seems like it's kinda really expected to just be a different IR type (e.g. "float-nsz" not "float").

rotateright · 2021-11-05T19:10:35Z

Thanks for writing that summary! Parts of that comment are scattered around in the related bugs, patch reviews, and commit messages, but this is the best example/description of the problem that I have read.

We've mostly been able to cheat our way around the current limitations of FMF by reasoning (possibly wrongly) via FMF on sequences of instructions (back/forward propagating as needed). Sometimes we've relied on the legacy function-level attributes too (mostly in the backend/codegen).

But there are a few examples like this one where it seems we are stuck. Note that "ninf" and "nnan" have similar constraints.

andykaylor · 2021-11-19T00:19:59Z

This is interesting. Below are my notes on working through what James said about needing fast-math flags on things like loads and stores. I'm recording it here in full verbosity for the benefit of anyone else who might not see why this is so.

If instead of the if-else, you write the code like this:

double floatingAbs(double x) {
  return (x < 0) ? -x : x;
}

The code will get optimized to llvm.fabs(x) as desired. The front end produces this:

define dso_local double @&#8203;floatingAbs(double %0) #&#8203;0 {
  %2 = alloca double, align 8
  store double %0, double* %2, align 8, !tbaa !&#8203;3
  %3 = load double, double* %2, align 8, !tbaa !&#8203;3
  %4 = fcmp fast olt double %3, 0.000000e+00
  br i1 %4, label %5, label %8

5:                                                ; preds = %1
  %6 = load double, double* %2, align 8, !tbaa !&#8203;3
  %7 = fneg fast double %6
  br label %10

8:                                                ; preds = %1
  %9 = load double, double* %2, align 8, !tbaa !&#8203;3
  br label %10

10:                                               ; preds = %8, %5
  %11 = phi fast double [ %7, %5 ], [ %9, %8 ]
  ret double %11
}

Which SROA turns into this:

define dso_local double @&#8203;floatingAbs(double %0) #&#8203;0 {
  %2 = fcmp fast olt double %0, 0.000000e+00
  br i1 %2, label %3, label %5

  %4 = fneg fast double %0
  br label %6

  br label %6

  %7 = phi fast double [ %4, %3 ], [ %0, %5 ]
  ret double %7
}

Which SimplifyCFG turns into this:

define dso_local double @&#8203;floatingAbs(double %0) #&#8203;0 {
  %2 = fcmp fast olt double %0, 0.000000e+00
  %3 = fneg fast double %0
  %4 = select fast i1 %2, double %3, double %0
  ret double %4
}

Which InstCombine turns into llvm.fabs().

Note that in the above example, the phi and select nodes do get the 'fast' flag and that seems to be why we did the transformation.

In the original if-else case, the front end gives us this:

define dso_local double @&#8203;floatingAbs(double %0) #&#8203;0 {
  %2 = alloca double, align 8
  %3 = alloca double, align 8
  store double %0, double* %3, align 8, !tbaa !&#8203;3
  %4 = load double, double* %3, align 8, !tbaa !&#8203;3
  %5 = fcmp fast olt double %4, 0.000000e+00
  br i1 %5, label %6, label %9

6:                                                ; preds = %1
  %7 = load double, double* %3, align 8, !tbaa !&#8203;3
  %8 = fneg fast double %7
  store double %8, double* %2, align 8
  br label %11

9:                                                ; preds = %1
  %10 = load double, double* %3, align 8, !tbaa !&#8203;3
  store double %10, double* %2, align 8
  br label %11

11:                                               ; preds = %9, %6
  %12 = load double, double* %2, align 8
  ret double %12
}

Here clang stored the value fneg value in the if branch and reloaded it in BB#11, so there was phi to put the select on. SROA gets rid of the loads and stores, but it doesn't set fast-math flags on the phi.

define dso_local double @&#8203;floatingAbs(double %0) #&#8203;0 {
  %2 = fcmp fast olt double %0, 0.000000e+00
  br i1 %2, label %3, label %5

3:                                                ; preds = %1
  %4 = fneg fast double %0
  br label %6

5:                                                ; preds = %1
  br label %6

6:                                                ; preds = %5, %3
  %.0 = phi double [ %4, %3 ], [ %0, %5 ]
  ret double %.0
}

I tried to think of some hypothetical rule by which SROA could deduce the 'fast' flag for the phi, but since one of the loaded values is just an argument I can't see any way we could do it (apart from the function attribute, which I want to get rid of).

rotateright · 2021-11-19T13:49:06Z

Thanks for stepping through the passes and posting the analysis, Andy.

The problem is complicated, so having this example along with your and James' explanations will definitely help illustrate that as we search for a solution.

I'll add another example based on bug 35607 to try to make the scope of the problem clearer:

#include <algorithm>

float test3(float a, float b) {
    return std::max(std::min(a,b), std::max(a,b));
}

We should be able to recognize that a 'fast' FP max-of-max is just FP max, but that does not work currently:
https://godbolt.org/z/dsjh7o6Ko

That shows the missed optimization as x86 asm and optimized IR. The last pane shows the unoptimized IR coming from clang (front-end).

Note that there are no phi, select, or FP ops - just a sequence of load, store, compare, branch, and calls. The min/max calls take/return pointers, not FP values. So "fcmp" is the only FPMathOperator (and we have been trying to move away from that definition).

I think this example would require FMF on loaded FP values to get the expected optimization.

andykaylor · 2021-11-19T22:49:22Z

I might be starting to warm up to function attributes as a potential solution to this problem. One of my concerns with function attributes was how they would work when pragmas or inlining led to functions with mixed fast-math behaviors. It seems that clang (mostly) handles this in a way that alleviates my concerns.

Using this code as a test:

#include <math.h>

float f(float x) {
#pragma float_control(precise, on)
  if (isnan(x)) {
    return 0;
  } else {
#pragma float_control(precise, off)
    if (x < 0)
      return -x;
    else
      return x;
  }
}

#pragma float_control(precise, on)
int foo(float x) {
  return isnan(x);
}

#pragma float_control(precise, off)
float bar(float x) {
  if (x < 0)
    return -x;
  else
    return x;
}

float fubar(float x) {
  if (foo(x))
    return 0;
  else
    return bar(x);
}

I see that clang creates the following function attribute sets:

attributes #0 = { mustprogress nofree norecurse nosync nounwind readnone uwtable willreturn "approx-func-fp-math"="true" "denormal-fp-math"="preserve-sign,preserve-sign" "denormal-fp-math-f32"="ieee,ieee" "frame-pointer"="none" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" "unsafe-fp-math"="false" }
attributes #1 = { mustprogress nofree norecurse nosync nounwind readnone uwtable willreturn "approx-func-fp-math"="true" "denormal-fp-math"="preserve-sign,preserve-sign" "denormal-fp-math-f32"="ieee,ieee" "frame-pointer"="none" "min-legal-vector-width"="0" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" "unsafe-fp-math"="true" }

https://godbolt.org/z/aMf3axqjj

Essentially, #0 has attributes that enable fast-math, and #1 has attributes that do not enable fast-math. For functions with mixed states, clang uses the attributes that do not enable fast-math. It seems to be getting "approx-func-fp-math" wrong, but that may be an issue with the pragma handling. Also, "unsafe-fp-math" seems to encompass 'arcp', 'recip', and 'reassoc' with no way to represent them separately. There also seems to be disagreement between clang and SelectionDAG over whether "unsafe-fp-math" implies 'contract'.

If we had some well-defined relationship between the function attributes and the instruction-level fast math flags and some way of enforcing it, I might be OK with using the function attributes in IR optimization passes. Based on a suggestion Renato Golin made on the mailing list, I think the rules would be:

If any instruction has a fast-math flag set, the corresponding function attribute must be omitted or set to false
The instruction-level fast-math flags, when set, take precedence over the function attribute
If fast-math flags are not set for an instruction but an optimization is enabled by a corresponding function attribute, the optimization is permitted
If fast-math flags are not set for an instruction and the corresponding function attribute is not found or is set to false, the optimization is not permitted

andykaylor · 2021-11-19T23:50:00Z

One other thing I wanted to make note of with regard to the function attributes is that the use of function attributes when fast-math is enabled throughout a function has a potential benefit to bitcode size. The way clang generates IR, setting both instruction-level FMF and the function attributes, wouldn't get this benefit, but we could potentially add it as an optimization just before the BitcodeWriter is invoked.

When an instruction has fast-math flags set, the BitcodeWriter emits an extra byte for those flags. If no flags are set, a byte is saved for each instruction. If anyone cared about bitcode size (and I'm pretty sure there are people who do), they could sweep over the IR just before writing the bitcode and remove all instruction level fast-math flags and set the function attribute for functions where all eligible instructions had the same fast-math flags set. There may even be front ends that are already taking advantage of this behavior.

yashssh · 2024-02-22T05:40:13Z

I would like to reignite this discussion, the issue described is still valid and I recently encountered it only to find out that it's already being tracked here. Pasting the summary of the discussion we had in the 'floating point workgroup' meeting(Thank you @andykaylor!).

Yashwant Singh brought up a specific example where the lack of fast-math flags on a select instruction was blocking a desirable optimization of an fabs idiom. This problem was previously discussed on the LLVM mailing list and is archived here: #51601 As discussed there, the only way to enable the optimization using current IR constructs is to rely on function level attributes. We discussed various problems with function-level attributes, particularly related to inlining where it may be necessary to drop the attribute. The most obvious alternative was to allow front ends to set fast-math flags on load instructions, but that isn’t currently allowed and would likely expose other problems.

ps We are moving away from any solution based on using the function attributes, I'm gonna perform some experiments around a proposed solution (set fast-math flags on load instructions) and see what comes. If there might be a better/intuitive fix or any comments on the discussion kindly suggest :)

arsenm · 2024-06-13T11:55:42Z

  %cmp = fcmp nsz ogt double %X, 0.000000e+00
  %fneg = fneg nsz double %X
  %retval.0 = select i1 %cmp, double %X, double %fneg
  ret double %retval.0

Can we pull nsz from the compare into the select when both select value operands are derived from the compared value? Looking through any sign-only operators?

andykaylor · 2024-06-14T00:35:00Z

  %cmp = fcmp nsz ogt double %X, 0.000000e+00
  %fneg = fneg nsz double %X
  %retval.0 = select i1 %cmp, double %X, double %fneg
  ret double %retval.0
Can we pull nsz from the compare into the select when both select value operands are derived from the compared value? Looking through any sign-only operators?

I don't think so. The nsz flag doesn't guarantee that %X isn't -0.0. It just says we can behave as though that doesn't matter for the instruction it's on for the purposes of optimization.

Suppose the IR looks like this:

  %X = fadd double %A, %B
  %cmp = fcmp nsz ogt double %X, 0.000000e+00
  %fneg = fneg nsz double %X
  %retval.0 = select i1 %cmp, double %X, double %fneg
  ret double %retval.0

Here the instruction that produces %X does not have the nsz flag set, which clearly says we cannot treat disregard the sign of zero in the result. Passing that value to an fcmp instruction does not change the value. So the select has to honor the sign of zero.

…using function attribute (llvm#83381) Its expected that the sequence `return X > 0.0 ? X : -X`, compiled with -Ofast, produces fabs intrinsic. However, at this point, LLVM is unable to do so. The above sequence goes through the following transformation during the pass pipeline: 1) SROA pass generates the phi node. Here, it does not infer the fast-math flags on the phi node unlike clang frontend typically does. 2) Phi node eventually gets translated into select instruction. Because of missing no-signed-zeros(nsz) fast-math flag on the select instruction, InstCombine pass fails to fold the sequence into fabs intrinsic. This patch, as a part of SROA, tries to propagate nsz fast-math flag on the phi node using function attribute enabling this folding. Closes llvm#51601 Co-authored-by: Sushant Gokhale <sgokhale@nvidia.com>

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021

arsenm added llvm:optimizations floating-point labels Aug 14, 2023

Endilll added the confirmed label Feb 22, 2024

yashssh mentioned this issue Feb 29, 2024

[SROA] Propagate no-signed-zeros(nsz) fast-math flag on the phi node using function attribute #83381

Merged

sushgokh closed this as completed in #83381 Jul 2, 2024

sushgokh closed this as completed in cd1e6a5 Jul 2, 2024

EugeneZelenko added llvm:transforms and removed llvm:codegen llvm:optimizations labels Jul 2, 2024

andykaylor mentioned this issue Oct 8, 2024

[clang] Allow ConditionalOperator fast-math flags to be overridden by pragma float_control #105912

Closed

vzakhari mentioned this issue Feb 4, 2025

[RFC][mlir] Conditional support for fast-math attributes. #125620

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast Math Flags not propagated to phis/select instructions #51601

Fast Math Flags not propagated to phis/select instructions #51601

llvmbot commented Oct 21, 2021 •

edited by Endilll

Loading

jyknight commented Nov 3, 2021 •

edited

Loading

rotateright commented Nov 5, 2021

andykaylor commented Nov 19, 2021

rotateright commented Nov 19, 2021

andykaylor commented Nov 19, 2021

andykaylor commented Nov 19, 2021

yashssh commented Feb 22, 2024 •

edited

Loading

arsenm commented Jun 13, 2024

andykaylor commented Jun 14, 2024

Fast Math Flags not propagated to phis/select instructions #51601

Fast Math Flags not propagated to phis/select instructions #51601

Comments

llvmbot commented Oct 21, 2021 • edited by Endilll Loading

Extended Description

jyknight commented Nov 3, 2021 • edited Loading

rotateright commented Nov 5, 2021

andykaylor commented Nov 19, 2021

rotateright commented Nov 19, 2021

andykaylor commented Nov 19, 2021

andykaylor commented Nov 19, 2021

yashssh commented Feb 22, 2024 • edited Loading

arsenm commented Jun 13, 2024

andykaylor commented Jun 14, 2024

llvmbot commented Oct 21, 2021 •

edited by Endilll

Loading

jyknight commented Nov 3, 2021 •

edited

Loading

yashssh commented Feb 22, 2024 •

edited

Loading