-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast Math Flags not propagated to phis/select instructions #51601
Comments
It wasn't initially clear to me why there was a problem here (although I'm sure it was clear to those who've been working on this for a while). So, going through it again: The optimized IR that comes out from this example now is:
While this looks reasonable, we'd like it to just turn into llvm.fabs. However, that transform would change the return value when called with -0.0 from the original -0.0 to instead be +0.0. And, sadly, putting nsz flags just on the math ops, as we have here, is insufficient to allow for the desired optimization. That's because:
Similarly, looking at the original C code, it's quite clear that none of the operations in the function are affected by allowing +/- zero to be treated equivalently. Basically, only fi we define the nsz semantics as affecting values, rather than operations, can we get the transform we want. And that would seem to imply placing nsz flags on a lot more kinds of things: every FP load, store, ret, bitcast, incoming-parameter, maybe more. (It seems like it's kinda really expected to just be a different IR type (e.g. "float-nsz" not "float"). |
Thanks for writing that summary! Parts of that comment are scattered around in the related bugs, patch reviews, and commit messages, but this is the best example/description of the problem that I have read. We've mostly been able to cheat our way around the current limitations of FMF by reasoning (possibly wrongly) via FMF on sequences of instructions (back/forward propagating as needed). Sometimes we've relied on the legacy function-level attributes too (mostly in the backend/codegen). But there are a few examples like this one where it seems we are stuck. Note that "ninf" and "nnan" have similar constraints. |
This is interesting. Below are my notes on working through what James said about needing fast-math flags on things like loads and stores. I'm recording it here in full verbosity for the benefit of anyone else who might not see why this is so. If instead of the if-else, you write the code like this:
The code will get optimized to llvm.fabs(x) as desired. The front end produces this:
Which SROA turns into this:
Which SimplifyCFG turns into this:
Which InstCombine turns into llvm.fabs(). Note that in the above example, the phi and select nodes do get the 'fast' flag and that seems to be why we did the transformation. In the original if-else case, the front end gives us this:
Here clang stored the value fneg value in the if branch and reloaded it in BB#11, so there was phi to put the select on. SROA gets rid of the loads and stores, but it doesn't set fast-math flags on the phi.
I tried to think of some hypothetical rule by which SROA could deduce the 'fast' flag for the phi, but since one of the loaded values is just an argument I can't see any way we could do it (apart from the function attribute, which I want to get rid of). |
Thanks for stepping through the passes and posting the analysis, Andy. The problem is complicated, so having this example along with your and James' explanations will definitely help illustrate that as we search for a solution. I'll add another example based on bug 35607 to try to make the scope of the problem clearer:
We should be able to recognize that a 'fast' FP max-of-max is just FP max, but that does not work currently: That shows the missed optimization as x86 asm and optimized IR. The last pane shows the unoptimized IR coming from clang (front-end). Note that there are no phi, select, or FP ops - just a sequence of load, store, compare, branch, and calls. The min/max calls take/return pointers, not FP values. So "fcmp" is the only FPMathOperator (and we have been trying to move away from that definition). I think this example would require FMF on loaded FP values to get the expected optimization. |
I might be starting to warm up to function attributes as a potential solution to this problem. One of my concerns with function attributes was how they would work when pragmas or inlining led to functions with mixed fast-math behaviors. It seems that clang (mostly) handles this in a way that alleviates my concerns. Using this code as a test:
I see that clang creates the following function attribute sets: attributes #0 = { mustprogress nofree norecurse nosync nounwind readnone uwtable willreturn "approx-func-fp-math"="true" "denormal-fp-math"="preserve-sign,preserve-sign" "denormal-fp-math-f32"="ieee,ieee" "frame-pointer"="none" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" "unsafe-fp-math"="false" } https://godbolt.org/z/aMf3axqjj Essentially, #0 has attributes that enable fast-math, and #1 has attributes that do not enable fast-math. For functions with mixed states, clang uses the attributes that do not enable fast-math. It seems to be getting "approx-func-fp-math" wrong, but that may be an issue with the pragma handling. Also, "unsafe-fp-math" seems to encompass 'arcp', 'recip', and 'reassoc' with no way to represent them separately. There also seems to be disagreement between clang and SelectionDAG over whether "unsafe-fp-math" implies 'contract'. If we had some well-defined relationship between the function attributes and the instruction-level fast math flags and some way of enforcing it, I might be OK with using the function attributes in IR optimization passes. Based on a suggestion Renato Golin made on the mailing list, I think the rules would be:
|
One other thing I wanted to make note of with regard to the function attributes is that the use of function attributes when fast-math is enabled throughout a function has a potential benefit to bitcode size. The way clang generates IR, setting both instruction-level FMF and the function attributes, wouldn't get this benefit, but we could potentially add it as an optimization just before the BitcodeWriter is invoked. When an instruction has fast-math flags set, the BitcodeWriter emits an extra byte for those flags. If no flags are set, a byte is saved for each instruction. If anyone cared about bitcode size (and I'm pretty sure there are people who do), they could sweep over the IR just before writing the bitcode and remove all instruction level fast-math flags and set the function attribute for functions where all eligible instructions had the same fast-math flags set. There may even be front ends that are already taking advantage of this behavior. |
I would like to reignite this discussion, the issue described is still valid and I recently encountered it only to find out that it's already being tracked here. Pasting the summary of the discussion we had in the 'floating point workgroup' meeting(Thank you @andykaylor!).
ps We are moving away from any solution based on using the function attributes, I'm gonna perform some experiments around a proposed solution (set fast-math flags on load instructions) and see what comes. If there might be a better/intuitive fix or any comments on the discussion kindly suggest :) |
Can we pull nsz from the compare into the select when both select value operands are derived from the compared value? Looking through any sign-only operators? |
I don't think so. The Suppose the IR looks like this:
Here the instruction that produces |
…using function attribute (llvm#83381) Its expected that the sequence `return X > 0.0 ? X : -X`, compiled with -Ofast, produces fabs intrinsic. However, at this point, LLVM is unable to do so. The above sequence goes through the following transformation during the pass pipeline: 1) SROA pass generates the phi node. Here, it does not infer the fast-math flags on the phi node unlike clang frontend typically does. 2) Phi node eventually gets translated into select instruction. Because of missing no-signed-zeros(nsz) fast-math flag on the select instruction, InstCombine pass fails to fold the sequence into fabs intrinsic. This patch, as a part of SROA, tries to propagate nsz fast-math flag on the phi node using function attribute enabling this folding. Closes llvm#51601 Co-authored-by: Sushant Gokhale <sgokhale@nvidia.com>
…using function attribute (llvm#83381) Its expected that the sequence `return X > 0.0 ? X : -X`, compiled with -Ofast, produces fabs intrinsic. However, at this point, LLVM is unable to do so. The above sequence goes through the following transformation during the pass pipeline: 1) SROA pass generates the phi node. Here, it does not infer the fast-math flags on the phi node unlike clang frontend typically does. 2) Phi node eventually gets translated into select instruction. Because of missing no-signed-zeros(nsz) fast-math flag on the select instruction, InstCombine pass fails to fold the sequence into fabs intrinsic. This patch, as a part of SROA, tries to propagate nsz fast-math flag on the phi node using function attribute enabling this folding. Closes llvm#51601 Co-authored-by: Sushant Gokhale <sgokhale@nvidia.com>
Extended Description
Consider the following code compiled with
-Ofast
.This should produce a
fabs
but the compiler is unable to do produce one because the select instruction in the IR does not have any fast flags.From my analysis:
SROA -> creates a Phi node without any fast flags.
SimplifyCFG -> converts the Phi to a select and copies the Phi's fast flags,
unfortunately the phi does not have any fast flags to begin with.
The text was updated successfully, but these errors were encountered: