Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InstCombine changes sibling non-fast instruction to fast #45671

Open
jmert mannequin opened this issue Jun 15, 2020 · 7 comments
Open

InstCombine changes sibling non-fast instruction to fast #45671

jmert mannequin opened this issue Jun 15, 2020 · 7 comments
Labels
bugzilla Issues migrated from bugzilla floating-point Floating-point math llvm:optimizations

Comments

@jmert
Copy link
Mannequin

jmert mannequin commented Jun 15, 2020

Bugzilla Link 46326
Version 9.0
OS Linux
CC @chriselrod,@LebedevRI,@RKSimon,@nunoplopes,@rotateright

Extended Description

Overview:

This is a specific case found in Julia code with more context found in [1], wherein it was suggested I file this as a bug with LLVM.

The unoptimized IR below contains two fast-math instructions just before returning --- a multiplication followed by a square root. At an optimization level of -O1 or higher, LLVM v8 optimizes away a bit of the function preamble, but otherwise maintains the mathematical calculation. Starting with LLVM v9, though, the fdiv (line %17 in unoptimized form) is changed to an fdiv fast instruction and moved to the end of the function, just before the fmul fast. The Julia Discourse forum helped reduce optimization from the set of -O1 to specifically an effect of the InstCombine optimization pass being enabled.

Godbolt link demonstrating this: [2]

This changes the numerical result of the function, dependent on the version of LLVM being used in Julia. I've noticed, though, that the llc-emitted assembly is the same in both LLVM v8 and v9 as viewed on Godbolt, but Julia's emitted assembly differs (in such a way that it's sensitive to the change in LLVM IR optimization). I've put together a gist [3] that shows the assembly for Julia v1.4.2 (LLVM v8), Julia v1.5-beta1 (LLVM v9), and LLVM llc run on the provided IR (same in both v8 and v9).

Steps to Reproduce:

; Original unoptimized IR, to be optimized with
; opt -march=x86-64 -mcpu=sandybridge -instcombine
; or replace '-instcombine' with {'-O1', '-O2', '-O3'}, for both LLVM v8 and v9.
define double @"julia_coeff_\CE\B1_1367"(i64, i64) {
top:
%2 = call %jl_value_t*** @​julia.ptls_states()
%3 = bitcast %jl_value_t*** %2 to %jl_value_t**
%4 = getelementptr inbounds %jl_value_t*, %jl_value_t** %3, i64 4
%5 = bitcast %jl_value_t** %4 to i64**
%6 = load i64*, i64** %5
%7 = sitofp i64 %0 to double
%8 = sitofp i64 %1 to double
%9 = fmul double 2.000000e+00, %7
%10 = fadd double %9, 1.000000e+00
%11 = fmul double 2.000000e+00, %7
%12 = fsub double %11, 3.000000e+00
%13 = fmul double %7, %7
%14 = fmul double %8, %8
%15 = fsub double %13, %14
%16 = fmul double %12, %15
%17 = fdiv double %10, %16
%18 = fsub double %7, 1.000000e+00
%19 = fmul double %18, %18
%20 = fmul double 4.000000e+00, %19
%21 = fsub double %20, 1.000000e+00
%22 = fmul fast double %17, %21
%23 = call fast double @​llvm.sqrt.f64(double %22)
ret double %23
}

Expected Results:

As I understand it (and correct me if I'm wrong), the fdiv should not be changed to an fdiv fast instruction.

Build Date & Hardware:

julia> versioninfo()
Julia Version 1.5.0-beta1.0
Commit 6443f6c95a (2020-05-28 17:42 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 3 2200G with Radeon Vega Graphics
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, znver1)

Additional Information:

[1] https://discourse.julialang.org/t/investigating-numerical-change-in-function-return-value-between-v1-4-vs-v1-5/41332/5
[2] https://godbolt.org/z/r-cNuL
[3] https://gist.github.com/jmert/6aea12adb74ef8b7f25eba276d42911a

@LebedevRI
Copy link
Member

The transform in question is http://reviews.llvm.org/rL358404

$ ./bin/opt -march=x86-64 -mcpu=sandybridge -instcombine -debug /tmp/test.ll
Args: ./bin/opt -march=x86-64 -mcpu=sandybridge -instcombine -debug /tmp/test.ll
WARNING: You're attempting to print out a bitcode file.
This is inadvisable as it may cause display problems. If
you REALLY want to taste LLVM bitcode first-hand, you
can force output with the `-f' option.

INSTCOMBINE ITERATION #​1 on julia_coeff_α_1367
IC: ADD: ret double %23
IC: ADD: %23 = call fast double @​llvm.sqrt.f64(double %22)
IC: ADD: %22 = fmul fast double %17, %21
IC: ADD: %21 = fsub double %20, 1.000000e+00
IC: ADD: %20 = fmul double 4.000000e+00, %19
IC: ADD: %19 = fmul double %18, %18
IC: ADD: %18 = fsub double %7, 1.000000e+00
IC: ADD: %17 = fdiv double %10, %16
IC: ADD: %16 = fmul double %12, %15
IC: ADD: %15 = fsub double %13, %14
IC: ADD: %14 = fmul double %8, %8
IC: ADD: %13 = fmul double %7, %7
IC: ADD: %12 = fsub double %11, 3.000000e+00
IC: ADD: %11 = fmul double 2.000000e+00, %7
IC: ADD: %10 = fadd double %9, 1.000000e+00
IC: ADD: %9 = fmul double 2.000000e+00, %7
IC: ADD: %8 = sitofp i64 %1 to double
IC: ADD: %7 = sitofp i64 %0 to double
IC: DCE: %6 = load i64*, i64** %5, align 8
IC: DCE: %5 = bitcast %jl_value_t** %4 to i64**
IC: DCE: %4 = getelementptr inbounds %jl_value_t*, %jl_value_t** %3, i64 4
IC: DCE: %3 = bitcast %jl_value_t*** %2 to %jl_value_t**
IC: ADD: %2 = call %jl_value_t*** @​julia.ptls_states()
IC: Visiting: %2 = call %jl_value_t*** @​julia.ptls_states()
IC: Visiting: %3 = sitofp i64 %0 to double
IC: Visiting: %4 = sitofp i64 %1 to double
IC: Visiting: %5 = fmul double 2.000000e+00, %3
IC: Mod = %5 = fmul double 2.000000e+00, %3
New = %5 = fmul double %3, 2.000000e+00
IC: ADD: %5 = fmul double %3, 2.000000e+00
IC: Visiting: %5 = fmul double %3, 2.000000e+00
IC: Visiting: %6 = fadd double %5, 1.000000e+00
IC: Visiting: %7 = fmul double 2.000000e+00, %3
IC: Mod = %7 = fmul double 2.000000e+00, %3
New = %7 = fmul double %3, 2.000000e+00
IC: ADD: %7 = fmul double %3, 2.000000e+00
IC: Visiting: %7 = fmul double %3, 2.000000e+00
IC: Visiting: %8 = fsub double %7, 3.000000e+00
IC: Old = %8 = fsub double %7, 3.000000e+00
New = = fadd double %7, -3.000000e+00
IC: ADD: %8 = fadd double %7, -3.000000e+00
IC: ERASE %9 = fsub double %7, 3.000000e+00
IC: ADD DEFERRED: %7 = fmul double %3, 2.000000e+00
IC: ADD: %7 = fmul double %3, 2.000000e+00
IC: Visiting: %7 = fmul double %3, 2.000000e+00
IC: Visiting: %8 = fadd double %7, -3.000000e+00
IC: Visiting: %9 = fmul double %3, %3
IC: Visiting: %10 = fmul double %4, %4
IC: Visiting: %11 = fsub double %9, %10
IC: Visiting: %12 = fmul double %8, %11
IC: Visiting: %13 = fdiv double %6, %12
IC: Visiting: %14 = fsub double %3, 1.000000e+00
IC: Old = %14 = fsub double %3, 1.000000e+00
New = = fadd double %3, -1.000000e+00
IC: ADD: %14 = fadd double %3, -1.000000e+00
IC: ERASE %15 = fsub double %3, 1.000000e+00
IC: ADD DEFERRED: %3 = sitofp i64 %0 to double
IC: ADD: %3 = sitofp i64 %0 to double
IC: Visiting: %3 = sitofp i64 %0 to double
IC: Visiting: %14 = fadd double %3, -1.000000e+00
IC: Visiting: %15 = fmul double %14, %14
IC: Visiting: %16 = fmul double 4.000000e+00, %15
IC: Mod = %16 = fmul double 4.000000e+00, %15
New = %16 = fmul double %15, 4.000000e+00
IC: ADD: %16 = fmul double %15, 4.000000e+00
IC: Visiting: %16 = fmul double %15, 4.000000e+00
IC: Visiting: %17 = fsub double %16, 1.000000e+00
IC: Old = %17 = fsub double %16, 1.000000e+00
New = = fadd double %16, -1.000000e+00
IC: ADD: %17 = fadd double %16, -1.000000e+00
IC: ERASE %18 = fsub double %16, 1.000000e+00
IC: ADD DEFERRED: %16 = fmul double %15, 4.000000e+00
IC: ADD: %16 = fmul double %15, 4.000000e+00
IC: Visiting: %16 = fmul double %15, 4.000000e+00
IC: Visiting: %17 = fadd double %16, -1.000000e+00
IC: Visiting: %18 = fmul fast double %13, %17
IC: ADD DEFERRED: %18 = fmul fast double %6, %17
IC: Old = %19 = fmul fast double %13, %17
New = = fdiv fast double %18, %12 # <-
IC: ADD: %19 = fdiv fast double %18, %12
IC: ERASE %20 = fmul fast double %13, %17
IC: ADD DEFERRED: %13 = fdiv double %6, %12
IC: ADD DEFERRED: %17 = fadd double %16, -1.000000e+00
IC: ADD: %17 = fadd double %16, -1.000000e+00
IC: ERASE %13 = fdiv double %6, %12
IC: ADD DEFERRED: %6 = fadd double %5, 1.000000e+00
IC: ADD DEFERRED: %12 = fmul double %8, %11
IC: ADD: %12 = fmul double %8, %11
IC: ADD: %6 = fadd double %5, 1.000000e+00
IC: ADD: %17 = fmul fast double %6, %16
IC: Visiting: %17 = fmul fast double %6, %16
IC: Visiting: %6 = fadd double %5, 1.000000e+00
IC: Visiting: %12 = fmul double %8, %11
IC: Visiting: %16 = fadd double %15, -1.000000e+00
IC: Visiting: %18 = fdiv fast double %17, %12
IC: Visiting: %19 = call fast double @​llvm.sqrt.f64(double %18)
IC: Visiting: ret double %19

INSTCOMBINE ITERATION #​2 on julia_coeff_α_1367
IC: ADD: ret double %19
IC: ADD: %19 = call fast double @​llvm.sqrt.f64(double %18)
IC: ADD: %18 = fdiv fast double %17, %12
IC: ADD: %17 = fmul fast double %6, %16
IC: ADD: %16 = fadd double %15, -1.000000e+00
IC: ADD: %15 = fmul double %14, 4.000000e+00
IC: ADD: %14 = fmul double %13, %13
IC: ADD: %13 = fadd double %3, -1.000000e+00
IC: ADD: %12 = fmul double %8, %11
IC: ADD: %11 = fsub double %9, %10
IC: ADD: %10 = fmul double %4, %4
IC: ADD: %9 = fmul double %3, %3
IC: ADD: %8 = fadd double %7, -3.000000e+00
IC: ADD: %7 = fmul double %3, 2.000000e+00
IC: ADD: %6 = fadd double %5, 1.000000e+00
IC: ADD: %5 = fmul double %3, 2.000000e+00
IC: ADD: %4 = sitofp i64 %1 to double
IC: ADD: %3 = sitofp i64 %0 to double
IC: ADD: %2 = call %jl_value_t*** @​julia.ptls_states()
IC: Visiting: %2 = call %jl_value_t*** @​julia.ptls_states()
IC: Visiting: %3 = sitofp i64 %0 to double
IC: Visiting: %4 = sitofp i64 %1 to double
IC: Visiting: %5 = fmul double %3, 2.000000e+00
IC: Visiting: %6 = fadd double %5, 1.000000e+00
IC: Visiting: %7 = fmul double %3, 2.000000e+00
IC: Visiting: %8 = fadd double %7, -3.000000e+00
IC: Visiting: %9 = fmul double %3, %3
IC: Visiting: %10 = fmul double %4, %4
IC: Visiting: %11 = fsub double %9, %10
IC: Visiting: %12 = fmul double %8, %11
IC: Visiting: %13 = fadd double %3, -1.000000e+00
IC: Visiting: %14 = fmul double %13, %13
IC: Visiting: %15 = fmul double %14, 4.000000e+00
IC: Visiting: %16 = fadd double %15, -1.000000e+00
IC: Visiting: %17 = fmul fast double %6, %16
IC: Visiting: %18 = fdiv fast double %17, %12
IC: Visiting: %19 = call fast double @​llvm.sqrt.f64(double %18)
IC: Visiting: ret double %19

@rotateright
Copy link
Contributor

As I understand it (and correct me if I'm wrong), the fdiv should not be
changed to an fdiv fast instruction.

This is an open question about the semantics of fast-math-flags in IR and codegen. In most cases so far, we have assumed that if the last value in a calculation is relaxed in some way ("fast" means all possible FP shortcuts are allowed -- http://llvm.org/docs/LangRef.html#fast-math-flags ), then we are free to apply those semantics to intermediate calcs leading to the final value.

It's not clear to me from this example what difference we get in the final output. Can you explain what you are/were expecting?

The case where IR instructions have different FMF doesn't come up much with C/C++ source code AFAIK, so we haven't gotten much feedback on that. The "@fastmath" annotation that you show in the Julia source makes it easy to create these patterns with different FMF.

@chriselrod
Copy link

then we are free to apply those semantics to intermediate calcs leading to the final value.

What if there was code like:

mask = a == a
b = select mask, a, zeroinitializer
d = fadd fast b, c

Would this mean that it'd be legal for LLVM to optimize the above into:

d = fadd fast a, c

because if you assume a's elements are not NaN, then a == a must all be true?

The "@fastmath" annotation that you show in the Julia source makes it easy to create these patterns with different FMF.

FWIW, Julia will likely remove the @fastmath annotation, possibly replacing it with more flag-specific annotations (like @nonans or @contract).

@jmert
Copy link
Mannequin Author

jmert mannequin commented Jun 15, 2020

It's not clear to me from this example what difference we get in the final
output. Can you explain what you are/were expecting?

I was expecting numerically identical results across versions of Julia. I've been trying to carefully craft a numerical function, and while working on improving the numerical accuracy, I came across this discrepancy between Julia v1.4 and v1.5 where the underlying LLVM version changed.

I [incorrectly] thought the fast-math optimizations would permit instruction reordering/transformations among themselves, but that effect would be restricted to that subset.

For this specific function, an example is calling coeff_α(6, 0) where the result shows a 1 ulp difference between the two versions of Julia. (Over a range of inputs, I calculated about 11% of the values differ by ±1 ulp across Julia versions.)

The case where IR instructions have different FMF doesn't come up much with
C/C++ source code AFAIK, so we haven't gotten much feedback on that. The
"@fastmath" annotation that you show in the Julia source makes it easy to
create these patterns with different FMF.

Yes, I initially tried in vain to create a C example which showed the effect as well, and that's also why I was motivated to file a bug to give feedback since it is so much easier with Julia's "@fastmath" to make some unusual-to-C/C++ code combinations.

@rotateright
Copy link
Contributor

mask = a == a
b = select mask, a, zeroinitializer
d = fadd fast b, c

Would this mean that it'd be legal for LLVM to optimize the above into:

d = fadd fast a, c

because if you assume a's elements are not NaN, then a == a must all be
true?

I agree that seems like a stretch to reason about 'a' based on the fact that 'd' is not NaN.

Let's take a minimal example of the transform in question with this bug report -- (x / y) * 42.0:
%div = fdiv double %x, %y
%mul = fmul reassoc double %div, 42.0

Currently, we will turn this into (x * 42.0) / y:
%t = fmul reassoc double %x, 42.0
%mul = fdiv reassoc double %t, %y

...because the multiply can be reassociated and there's no other use of the division value.

This report is suggesting that's too liberal - ie, the division requires precise IEEE semantics, so we can't touch it.

That means "reassoc" on any single value alone is worthless - we can't reassociate unless 2 or more values agree to be reassociable.

I'm not opposed to this interpretation, but we will want to specify in the LLVM LangRef (and probably should have done something about that already).

The "@fastmath" annotation that you show in the Julia source makes it easy to create these patterns with different FMF.

FWIW, Julia will likely remove the @fastmath annotation, possibly
replacing it with more flag-specific annotations (like @nonans or
@contract).

Nice! I know almost nothing about Julia, but it's interesting to see FP optimizations specified at line/operation level in source.

@rotateright
Copy link
Contributor

I was expecting numerically identical results across versions of Julia.

In general, this is impossible once any FP calculation is marked "fast".
That's because the original definition of -ffast-math from gcc is so (intentionally?) vague:
https://gcc.gnu.org/wiki/FloatingPointMath

Without a formal spec, FP math can produce different results for any math op, and there may not be an obvious "correct" answer.

Take for example the trailing sqrt in the original code. If converted to an estimate sequence, that result can differ between various hardware implementations - any of the initial estimate, refinement algorithm, or availability of FMA will produce different results.

I did some analysis of that for x86 in bug 21385. Even identical asm is not guaranteed to produce identical results - if we use an estimate instruction, that can vary depending on CPU model.

So again, I'm not opposed to tightening the semantics, but I'm not entirely sure what effects that might have. We may end up losing an optimization somewhere and getting the opposite bug report where the stricter interpretation means we didn't produce as fast a sequence as someone else was expecting.

@jmert
Copy link
Mannequin Author

jmert mannequin commented Jun 17, 2020

I was expecting numerically identical results across versions of Julia.

In general, this is impossible once any FP calculation is marked "fast".
That's because the original definition of -ffast-math from gcc is so
(intentionally?) vague:
https://gcc.gnu.org/wiki/FloatingPointMath

Without a formal spec, FP math can produce different results for any math
op, and there may not be an obvious "correct" answer.

Take for example the trailing sqrt in the original code. If converted to an
estimate sequence, that result can differ between various hardware
implementations - any of the initial estimate, refinement algorithm, or
availability of FMA will produce different results.

You're right — I was a bit imprecise with what I meant. I started investigating when I didn't get the same answer (on same CPU), and then the IR I found I didn't understand. My assumed reading of the relatively sparse description of what fast-math matched my experience with LLVM 8 but didn't with LLVM 9, so that's what motivated this bug report. (And in fact, the observations of what LLVM 8 and earlier empirically did probably informed my mental model.)

I'm perfectly willing to be told my interpretation was too optimistic, though :-) And it's not actually a critical issue for me since I can easily work around the differences across versions of Julia by just being a little less lazy — I was ultimately abusing Julia's @​fastmath macro to just get a version of sqrt that doesn't do the domain (x < 0) check since I know by construction that won't happen in my algorithm.

I did some analysis of that for x86 in bug 21385. Even identical asm is not
guaranteed to produce identical results - if we use an estimate instruction,
that can vary depending on CPU model.

Kind of a side comment at this point, but I've been doing single CPU tests thus far and plan to expand to a couple of different architectures soon. My goal for the moment is a best-reasonable effort algorithm, wherein I make use of numerical best practices in designing the algorithm, but I'm not (yet) going to bend over backwards if one CPU has poorer accuracy of say it's sqrt instruction than another. (I do still want the speed afforded by CPU instructions versus library calls.)

So again, I'm not opposed to tightening the semantics, but I'm not entirely
sure what effects that might have. We may end up losing an optimization
somewhere and getting the opposite bug report where the stricter
interpretation means we didn't produce as fast a sequence as someone else
was expecting.

I think we're in agreement. It'd be nice to have some clarification on what the semantics are just because there are reasonable yet disjoint interpretations possible.

I don't need this to be "fixed" to make progress, and Julia admittedly sits in a unique space where single-instruction fast-math flags are easy to make.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla floating-point Floating-point math llvm:optimizations
Projects
None yet
Development

No branches or pull requests

4 participants