-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InstCombine changes sibling non-fast instruction to fast #45671
Comments
The transform in question is http://reviews.llvm.org/rL358404 $ ./bin/opt -march=x86-64 -mcpu=sandybridge -instcombine -debug /tmp/test.ll INSTCOMBINE ITERATION #1 on julia_coeff_α_1367 INSTCOMBINE ITERATION #2 on julia_coeff_α_1367 |
This is an open question about the semantics of fast-math-flags in IR and codegen. In most cases so far, we have assumed that if the last value in a calculation is relaxed in some way ("fast" means all possible FP shortcuts are allowed -- http://llvm.org/docs/LangRef.html#fast-math-flags ), then we are free to apply those semantics to intermediate calcs leading to the final value. It's not clear to me from this example what difference we get in the final output. Can you explain what you are/were expecting? The case where IR instructions have different FMF doesn't come up much with C/C++ source code AFAIK, so we haven't gotten much feedback on that. The "@fastmath" annotation that you show in the Julia source makes it easy to create these patterns with different FMF. |
What if there was code like: mask = a == a Would this mean that it'd be legal for LLVM to optimize the above into: d = fadd fast a, c because if you assume
FWIW, Julia will likely remove the |
I was expecting numerically identical results across versions of Julia. I've been trying to carefully craft a numerical function, and while working on improving the numerical accuracy, I came across this discrepancy between Julia v1.4 and v1.5 where the underlying LLVM version changed. I [incorrectly] thought the fast-math optimizations would permit instruction reordering/transformations among themselves, but that effect would be restricted to that subset. For this specific function, an example is calling
Yes, I initially tried in vain to create a C example which showed the effect as well, and that's also why I was motivated to file a bug to give feedback since it is so much easier with Julia's "@fastmath" to make some unusual-to-C/C++ code combinations. |
I agree that seems like a stretch to reason about 'a' based on the fact that 'd' is not NaN. Let's take a minimal example of the transform in question with this bug report -- (x / y) * 42.0: Currently, we will turn this into (x * 42.0) / y: ...because the multiply can be reassociated and there's no other use of the division value. This report is suggesting that's too liberal - ie, the division requires precise IEEE semantics, so we can't touch it. That means "reassoc" on any single value alone is worthless - we can't reassociate unless 2 or more values agree to be reassociable. I'm not opposed to this interpretation, but we will want to specify in the LLVM LangRef (and probably should have done something about that already).
Nice! I know almost nothing about Julia, but it's interesting to see FP optimizations specified at line/operation level in source. |
In general, this is impossible once any FP calculation is marked "fast". Without a formal spec, FP math can produce different results for any math op, and there may not be an obvious "correct" answer. Take for example the trailing sqrt in the original code. If converted to an estimate sequence, that result can differ between various hardware implementations - any of the initial estimate, refinement algorithm, or availability of FMA will produce different results. I did some analysis of that for x86 in bug 21385. Even identical asm is not guaranteed to produce identical results - if we use an estimate instruction, that can vary depending on CPU model. So again, I'm not opposed to tightening the semantics, but I'm not entirely sure what effects that might have. We may end up losing an optimization somewhere and getting the opposite bug report where the stricter interpretation means we didn't produce as fast a sequence as someone else was expecting. |
You're right — I was a bit imprecise with what I meant. I started investigating when I didn't get the same answer (on same CPU), and then the IR I found I didn't understand. My assumed reading of the relatively sparse description of what fast-math matched my experience with LLVM 8 but didn't with LLVM 9, so that's what motivated this bug report. (And in fact, the observations of what LLVM 8 and earlier empirically did probably informed my mental model.) I'm perfectly willing to be told my interpretation was too optimistic, though :-) And it's not actually a critical issue for me since I can easily work around the differences across versions of Julia by just being a little less lazy — I was ultimately abusing Julia's @fastmath macro to just get a version of sqrt that doesn't do the domain (x < 0) check since I know by construction that won't happen in my algorithm.
Kind of a side comment at this point, but I've been doing single CPU tests thus far and plan to expand to a couple of different architectures soon. My goal for the moment is a best-reasonable effort algorithm, wherein I make use of numerical best practices in designing the algorithm, but I'm not (yet) going to bend over backwards if one CPU has poorer accuracy of say it's sqrt instruction than another. (I do still want the speed afforded by CPU instructions versus library calls.)
I think we're in agreement. It'd be nice to have some clarification on what the semantics are just because there are reasonable yet disjoint interpretations possible. I don't need this to be "fixed" to make progress, and Julia admittedly sits in a unique space where single-instruction fast-math flags are easy to make. |
Extended Description
Overview:
This is a specific case found in Julia code with more context found in [1], wherein it was suggested I file this as a bug with LLVM.
The unoptimized IR below contains two fast-math instructions just before returning --- a multiplication followed by a square root. At an optimization level of -O1 or higher, LLVM v8 optimizes away a bit of the function preamble, but otherwise maintains the mathematical calculation. Starting with LLVM v9, though, the fdiv (line %17 in unoptimized form) is changed to an fdiv fast instruction and moved to the end of the function, just before the fmul fast. The Julia Discourse forum helped reduce optimization from the set of -O1 to specifically an effect of the InstCombine optimization pass being enabled.
Godbolt link demonstrating this: [2]
This changes the numerical result of the function, dependent on the version of LLVM being used in Julia. I've noticed, though, that the llc-emitted assembly is the same in both LLVM v8 and v9 as viewed on Godbolt, but Julia's emitted assembly differs (in such a way that it's sensitive to the change in LLVM IR optimization). I've put together a gist [3] that shows the assembly for Julia v1.4.2 (LLVM v8), Julia v1.5-beta1 (LLVM v9), and LLVM llc run on the provided IR (same in both v8 and v9).
Steps to Reproduce:
; Original unoptimized IR, to be optimized with
; opt -march=x86-64 -mcpu=sandybridge -instcombine
; or replace '-instcombine' with {'-O1', '-O2', '-O3'}, for both LLVM v8 and v9.
define double @"julia_coeff_\CE\B1_1367"(i64, i64) {
top:
%2 = call %jl_value_t*** @julia.ptls_states()
%3 = bitcast %jl_value_t*** %2 to %jl_value_t**
%4 = getelementptr inbounds %jl_value_t*, %jl_value_t** %3, i64 4
%5 = bitcast %jl_value_t** %4 to i64**
%6 = load i64*, i64** %5
%7 = sitofp i64 %0 to double
%8 = sitofp i64 %1 to double
%9 = fmul double 2.000000e+00, %7
%10 = fadd double %9, 1.000000e+00
%11 = fmul double 2.000000e+00, %7
%12 = fsub double %11, 3.000000e+00
%13 = fmul double %7, %7
%14 = fmul double %8, %8
%15 = fsub double %13, %14
%16 = fmul double %12, %15
%17 = fdiv double %10, %16
%18 = fsub double %7, 1.000000e+00
%19 = fmul double %18, %18
%20 = fmul double 4.000000e+00, %19
%21 = fsub double %20, 1.000000e+00
%22 = fmul fast double %17, %21
%23 = call fast double @llvm.sqrt.f64(double %22)
ret double %23
}
Expected Results:
As I understand it (and correct me if I'm wrong), the fdiv should not be changed to an fdiv fast instruction.
Build Date & Hardware:
julia> versioninfo()
Julia Version 1.5.0-beta1.0
Commit 6443f6c95a (2020-05-28 17:42 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 3 2200G with Radeon Vega Graphics
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, znver1)
Additional Information:
[1] https://discourse.julialang.org/t/investigating-numerical-change-in-function-return-value-between-v1-4-vs-v1-5/41332/5
[2] https://godbolt.org/z/r-cNuL
[3] https://gist.github.com/jmert/6aea12adb74ef8b7f25eba276d42911a
The text was updated successfully, but these errors were encountered: