46326 – InstCombine changes sibling non-fast instruction to fast

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 46326 - InstCombine changes sibling non-fast instruction to fast

Summary: InstCombine changes sibling non-fast instruction to fast

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Scalar Optimizations (show other bugs)
Version:	9.0
Hardware:	PC Linux

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-06-15 07:14 PDT by Justin Willmert
Modified:	2020-06-17 09:43 PDT (History)
CC List:	7 users (show)

See Also:
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Justin Willmert 2020-06-15 07:14:09 PDT

Overview:

This is a specific case found in Julia code with more context found in [1], wherein it was suggested I file this as a bug with LLVM.

The unoptimized IR below contains two fast-math instructions just before returning --- a multiplication followed by a square root. At an optimization level of -O1 or higher, LLVM v8 optimizes away a bit of the function preamble, but otherwise maintains the mathematical calculation. Starting with LLVM v9, though, the fdiv (line %17 in unoptimized form) is changed to an fdiv fast instruction and moved to the end of the function, just before the fmul fast. The Julia Discourse forum helped reduce optimization from the set of -O1 to specifically an effect of the InstCombine optimization pass being enabled.

Godbolt link demonstrating this: [2]

This changes the numerical result of the function, dependent on the version of LLVM being used in Julia. I've noticed, though, that the llc-emitted assembly is the same in both LLVM v8 and v9 as viewed on Godbolt, but Julia's emitted assembly differs (in such a way that it's sensitive to the change in LLVM IR optimization). I've put together a gist [3] that shows the assembly for Julia v1.4.2 (LLVM v8), Julia v1.5-beta1 (LLVM v9), and LLVM llc run on the provided IR (same in both v8 and v9).

Steps to Reproduce:

; Original unoptimized IR, to be optimized with
;   opt -march=x86-64 -mcpu=sandybridge -instcombine
; or replace '-instcombine' with {'-O1', '-O2', '-O3'}, for both LLVM v8 and v9.
define double @"julia_coeff_\CE\B1_1367"(i64, i64) {
top:
  %2 = call %jl_value_t*** @julia.ptls_states()
  %3 = bitcast %jl_value_t*** %2 to %jl_value_t**
  %4 = getelementptr inbounds %jl_value_t*, %jl_value_t** %3, i64 4
  %5 = bitcast %jl_value_t** %4 to i64**
  %6 = load i64*, i64** %5
  %7 = sitofp i64 %0 to double
  %8 = sitofp i64 %1 to double
  %9 = fmul double 2.000000e+00, %7
  %10 = fadd double %9, 1.000000e+00
  %11 = fmul double 2.000000e+00, %7
  %12 = fsub double %11, 3.000000e+00
  %13 = fmul double %7, %7
  %14 = fmul double %8, %8
  %15 = fsub double %13, %14
  %16 = fmul double %12, %15
  %17 = fdiv double %10, %16
  %18 = fsub double %7, 1.000000e+00
  %19 = fmul double %18, %18
  %20 = fmul double 4.000000e+00, %19
  %21 = fsub double %20, 1.000000e+00
  %22 = fmul fast double %17, %21
  %23 = call fast double @llvm.sqrt.f64(double %22)
  ret double %23
}

Expected Results:

As I understand it (and correct me if I'm wrong), the fdiv should not be changed to an fdiv fast instruction.

Build Date & Hardware:

julia> versioninfo()
Julia Version 1.5.0-beta1.0
Commit 6443f6c95a (2020-05-28 17:42 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 3 2200G with Radeon Vega Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, znver1)

Additional Information:

[1] https://discourse.julialang.org/t/investigating-numerical-change-in-function-return-value-between-v1-4-vs-v1-5/41332/5
[2] https://godbolt.org/z/r-cNuL
[3] https://gist.github.com/jmert/6aea12adb74ef8b7f25eba276d42911a

Comment 1 Roman Lebedev 2020-06-15 07:57:48 PDT

The transform in question is http://reviews.llvm.org/rL358404


$ ./bin/opt -march=x86-64 -mcpu=sandybridge -instcombine -debug /tmp/test.ll 
Args: ./bin/opt -march=x86-64 -mcpu=sandybridge -instcombine -debug /tmp/test.ll 
WARNING: You're attempting to print out a bitcode file.
This is inadvisable as it may cause display problems. If
you REALLY want to taste LLVM bitcode first-hand, you
can force output with the `-f' option.



INSTCOMBINE ITERATION #1 on julia_coeff_α_1367
IC: ADD:   ret double %23
IC: ADD:   %23 = call fast double @llvm.sqrt.f64(double %22)
IC: ADD:   %22 = fmul fast double %17, %21
IC: ADD:   %21 = fsub double %20, 1.000000e+00
IC: ADD:   %20 = fmul double 4.000000e+00, %19
IC: ADD:   %19 = fmul double %18, %18
IC: ADD:   %18 = fsub double %7, 1.000000e+00
IC: ADD:   %17 = fdiv double %10, %16
IC: ADD:   %16 = fmul double %12, %15
IC: ADD:   %15 = fsub double %13, %14
IC: ADD:   %14 = fmul double %8, %8
IC: ADD:   %13 = fmul double %7, %7
IC: ADD:   %12 = fsub double %11, 3.000000e+00
IC: ADD:   %11 = fmul double 2.000000e+00, %7
IC: ADD:   %10 = fadd double %9, 1.000000e+00
IC: ADD:   %9 = fmul double 2.000000e+00, %7
IC: ADD:   %8 = sitofp i64 %1 to double
IC: ADD:   %7 = sitofp i64 %0 to double
IC: DCE:   %6 = load i64*, i64** %5, align 8
IC: DCE:   %5 = bitcast %jl_value_t** %4 to i64**
IC: DCE:   %4 = getelementptr inbounds %jl_value_t*, %jl_value_t** %3, i64 4
IC: DCE:   %3 = bitcast %jl_value_t*** %2 to %jl_value_t**
IC: ADD:   %2 = call %jl_value_t*** @julia.ptls_states()
IC: Visiting:   %2 = call %jl_value_t*** @julia.ptls_states()
IC: Visiting:   %3 = sitofp i64 %0 to double
IC: Visiting:   %4 = sitofp i64 %1 to double
IC: Visiting:   %5 = fmul double 2.000000e+00, %3
IC: Mod =   %5 = fmul double 2.000000e+00, %3
    New =   %5 = fmul double %3, 2.000000e+00
IC: ADD:   %5 = fmul double %3, 2.000000e+00
IC: Visiting:   %5 = fmul double %3, 2.000000e+00
IC: Visiting:   %6 = fadd double %5, 1.000000e+00
IC: Visiting:   %7 = fmul double 2.000000e+00, %3
IC: Mod =   %7 = fmul double 2.000000e+00, %3
    New =   %7 = fmul double %3, 2.000000e+00
IC: ADD:   %7 = fmul double %3, 2.000000e+00
IC: Visiting:   %7 = fmul double %3, 2.000000e+00
IC: Visiting:   %8 = fsub double %7, 3.000000e+00
IC: Old =   %8 = fsub double %7, 3.000000e+00
    New =   <badref> = fadd double %7, -3.000000e+00
IC: ADD:   %8 = fadd double %7, -3.000000e+00
IC: ERASE   %9 = fsub double %7, 3.000000e+00
IC: ADD DEFERRED:   %7 = fmul double %3, 2.000000e+00
IC: ADD:   %7 = fmul double %3, 2.000000e+00
IC: Visiting:   %7 = fmul double %3, 2.000000e+00
IC: Visiting:   %8 = fadd double %7, -3.000000e+00
IC: Visiting:   %9 = fmul double %3, %3
IC: Visiting:   %10 = fmul double %4, %4
IC: Visiting:   %11 = fsub double %9, %10
IC: Visiting:   %12 = fmul double %8, %11
IC: Visiting:   %13 = fdiv double %6, %12
IC: Visiting:   %14 = fsub double %3, 1.000000e+00
IC: Old =   %14 = fsub double %3, 1.000000e+00
    New =   <badref> = fadd double %3, -1.000000e+00
IC: ADD:   %14 = fadd double %3, -1.000000e+00
IC: ERASE   %15 = fsub double %3, 1.000000e+00
IC: ADD DEFERRED:   %3 = sitofp i64 %0 to double
IC: ADD:   %3 = sitofp i64 %0 to double
IC: Visiting:   %3 = sitofp i64 %0 to double
IC: Visiting:   %14 = fadd double %3, -1.000000e+00
IC: Visiting:   %15 = fmul double %14, %14
IC: Visiting:   %16 = fmul double 4.000000e+00, %15
IC: Mod =   %16 = fmul double 4.000000e+00, %15
    New =   %16 = fmul double %15, 4.000000e+00
IC: ADD:   %16 = fmul double %15, 4.000000e+00
IC: Visiting:   %16 = fmul double %15, 4.000000e+00
IC: Visiting:   %17 = fsub double %16, 1.000000e+00
IC: Old =   %17 = fsub double %16, 1.000000e+00
    New =   <badref> = fadd double %16, -1.000000e+00
IC: ADD:   %17 = fadd double %16, -1.000000e+00
IC: ERASE   %18 = fsub double %16, 1.000000e+00
IC: ADD DEFERRED:   %16 = fmul double %15, 4.000000e+00
IC: ADD:   %16 = fmul double %15, 4.000000e+00
IC: Visiting:   %16 = fmul double %15, 4.000000e+00
IC: Visiting:   %17 = fadd double %16, -1.000000e+00
IC: Visiting:   %18 = fmul fast double %13, %17
IC: ADD DEFERRED:   %18 = fmul fast double %6, %17
IC: Old =   %19 = fmul fast double %13, %17
    New =   <badref> = fdiv fast double %18, %12                  # <-
IC: ADD:   %19 = fdiv fast double %18, %12
IC: ERASE   %20 = fmul fast double %13, %17
IC: ADD DEFERRED:   %13 = fdiv double %6, %12
IC: ADD DEFERRED:   %17 = fadd double %16, -1.000000e+00
IC: ADD:   %17 = fadd double %16, -1.000000e+00
IC: ERASE   %13 = fdiv double %6, %12
IC: ADD DEFERRED:   %6 = fadd double %5, 1.000000e+00
IC: ADD DEFERRED:   %12 = fmul double %8, %11
IC: ADD:   %12 = fmul double %8, %11
IC: ADD:   %6 = fadd double %5, 1.000000e+00
IC: ADD:   %17 = fmul fast double %6, %16
IC: Visiting:   %17 = fmul fast double %6, %16
IC: Visiting:   %6 = fadd double %5, 1.000000e+00
IC: Visiting:   %12 = fmul double %8, %11
IC: Visiting:   %16 = fadd double %15, -1.000000e+00
IC: Visiting:   %18 = fdiv fast double %17, %12
IC: Visiting:   %19 = call fast double @llvm.sqrt.f64(double %18)
IC: Visiting:   ret double %19


INSTCOMBINE ITERATION #2 on julia_coeff_α_1367
IC: ADD:   ret double %19
IC: ADD:   %19 = call fast double @llvm.sqrt.f64(double %18)
IC: ADD:   %18 = fdiv fast double %17, %12
IC: ADD:   %17 = fmul fast double %6, %16
IC: ADD:   %16 = fadd double %15, -1.000000e+00
IC: ADD:   %15 = fmul double %14, 4.000000e+00
IC: ADD:   %14 = fmul double %13, %13
IC: ADD:   %13 = fadd double %3, -1.000000e+00
IC: ADD:   %12 = fmul double %8, %11
IC: ADD:   %11 = fsub double %9, %10
IC: ADD:   %10 = fmul double %4, %4
IC: ADD:   %9 = fmul double %3, %3
IC: ADD:   %8 = fadd double %7, -3.000000e+00
IC: ADD:   %7 = fmul double %3, 2.000000e+00
IC: ADD:   %6 = fadd double %5, 1.000000e+00
IC: ADD:   %5 = fmul double %3, 2.000000e+00
IC: ADD:   %4 = sitofp i64 %1 to double
IC: ADD:   %3 = sitofp i64 %0 to double
IC: ADD:   %2 = call %jl_value_t*** @julia.ptls_states()
IC: Visiting:   %2 = call %jl_value_t*** @julia.ptls_states()
IC: Visiting:   %3 = sitofp i64 %0 to double
IC: Visiting:   %4 = sitofp i64 %1 to double
IC: Visiting:   %5 = fmul double %3, 2.000000e+00
IC: Visiting:   %6 = fadd double %5, 1.000000e+00
IC: Visiting:   %7 = fmul double %3, 2.000000e+00
IC: Visiting:   %8 = fadd double %7, -3.000000e+00
IC: Visiting:   %9 = fmul double %3, %3
IC: Visiting:   %10 = fmul double %4, %4
IC: Visiting:   %11 = fsub double %9, %10
IC: Visiting:   %12 = fmul double %8, %11
IC: Visiting:   %13 = fadd double %3, -1.000000e+00
IC: Visiting:   %14 = fmul double %13, %13
IC: Visiting:   %15 = fmul double %14, 4.000000e+00
IC: Visiting:   %16 = fadd double %15, -1.000000e+00
IC: Visiting:   %17 = fmul fast double %6, %16
IC: Visiting:   %18 = fdiv fast double %17, %12
IC: Visiting:   %19 = call fast double @llvm.sqrt.f64(double %18)
IC: Visiting:   ret double %19

Comment 2 Sanjay Patel 2020-06-15 08:46:48 PDT

(In reply to Justin Willmert from comment #0) 
> As I understand it (and correct me if I'm wrong), the fdiv should not be
> changed to an fdiv fast instruction.

This is an open question about the semantics of fast-math-flags in IR and codegen. In most cases so far, we have assumed that if the last value in a calculation is relaxed in some way ("fast" means all possible FP shortcuts are allowed -- http://llvm.org/docs/LangRef.html#fast-math-flags ), then we are free to apply those semantics to intermediate calcs leading to the final value.

It's not clear to me from this example what difference we get in the final output. Can you explain what you are/were expecting?

The case where IR instructions have different FMF doesn't come up much with C/C++ source code AFAIK, so we haven't gotten much feedback on that. The "@fastmath" annotation that you show in the Julia source makes it easy to create these patterns with different FMF.

Comment 3 Chris Elrod 2020-06-15 09:05:22 PDT

> then we are free to apply those semantics to intermediate calcs leading to the final value.

What if there was code like:

mask = a == a
b = select mask, a, zeroinitializer
d = fadd fast b, c

Would this mean that it'd be legal for LLVM to optimize the above into:

d = fadd fast a, c

because if you assume `a`'s elements are not NaN, then a == a must all be true?

> The "@fastmath" annotation that you show in the Julia source makes it easy to create these patterns with different FMF.

FWIW, Julia will likely remove the `@fastmath` annotation, possibly replacing it with more flag-specific annotations (like `@nonans` or `@contract`).

Comment 4 Justin Willmert 2020-06-15 09:14:40 PDT

(In reply to Sanjay Patel from comment #2)
> It's not clear to me from this example what difference we get in the final
> output. Can you explain what you are/were expecting?

I was expecting numerically identical results across versions of Julia. I've been trying to carefully craft a numerical function, and while working on improving the numerical accuracy, I came across this discrepancy between Julia v1.4 and v1.5 where the underlying LLVM version changed.

I [incorrectly] thought the fast-math optimizations would permit instruction reordering/transformations among themselves, but that effect would be restricted to that subset.

For this specific function, an example is calling `coeff_α(6, 0)` where the result shows a 1 ulp difference between the two versions of Julia. (Over a range of inputs, I calculated about 11% of the values differ by ±1 ulp across Julia versions.)


> The case where IR instructions have different FMF doesn't come up much with
> C/C++ source code AFAIK, so we haven't gotten much feedback on that. The
> "@fastmath" annotation that you show in the Julia source makes it easy to
> create these patterns with different FMF.

Yes, I initially tried in vain to create a C example which showed the effect as well, and that's also why I was motivated to file a bug to give feedback since it is so much easier with Julia's "@fastmath" to make some unusual-to-C/C++ code combinations.

Comment 5 Sanjay Patel 2020-06-15 09:52:12 PDT

(In reply to Chris Elrod from comment #3)
> mask = a == a
> b = select mask, a, zeroinitializer
> d = fadd fast b, c
> 
> Would this mean that it'd be legal for LLVM to optimize the above into:
> 
> d = fadd fast a, c
> 
> because if you assume `a`'s elements are not NaN, then a == a must all be
> true?

I agree that seems like a stretch to reason about 'a' based on the fact that 'd' is not NaN.

Let's take a minimal example of the transform in question with this bug report -- (x / y) * 42.0:
  %div = fdiv double %x, %y
  %mul = fmul reassoc double %div, 42.0

Currently, we will turn this into (x * 42.0) / y:
  %t = fmul reassoc double %x, 42.0
  %mul = fdiv reassoc double %t, %y

...because the multiply can be reassociated and there's no other use of the division value. 

This report is suggesting that's too liberal - ie, the division requires precise IEEE semantics, so we can't touch it.

That means "reassoc" on any single value alone is worthless - we can't reassociate unless 2 or more values agree to be reassociable.

I'm not opposed to this interpretation, but we will want to specify in the LLVM LangRef (and probably should have done something about that already).

 
> > The "@fastmath" annotation that you show in the Julia source makes it easy to create these patterns with different FMF.
> 
> FWIW, Julia will likely remove the `@fastmath` annotation, possibly
> replacing it with more flag-specific annotations (like `@nonans` or
> `@contract`).

Nice! I know almost nothing about Julia, but it's interesting to see FP optimizations specified at line/operation level in source.

Comment 6 Sanjay Patel 2020-06-15 13:00:52 PDT

(In reply to Justin Willmert from comment #4)
> I was expecting numerically identical results across versions of Julia. 

In general, this is impossible once *any* FP calculation is marked "fast".
That's because the original definition of -ffast-math from gcc is so (intentionally?) vague:
https://gcc.gnu.org/wiki/FloatingPointMath

Without a formal spec, FP math can produce different results for any math op, and there may not be an obvious "correct" answer.

Take for example the trailing sqrt in the original code. If converted to an estimate sequence, that result can differ between various hardware implementations - any of the initial estimate, refinement algorithm, or availability of FMA will produce different results. 

I did some analysis of that for x86 in bug 21385. Even identical asm is not guaranteed to produce identical results - if we use an estimate instruction, that can vary depending on CPU model.

So again, I'm not opposed to tightening the semantics, but I'm not entirely sure what effects that might have. We may end up losing an optimization somewhere and getting the opposite bug report where the stricter interpretation means we didn't produce as fast a sequence as someone else was expecting.

Comment 7 Justin Willmert 2020-06-17 09:43:18 PDT

(In reply to Sanjay Patel from comment #6)
> (In reply to Justin Willmert from comment #4)
> > I was expecting numerically identical results across versions of Julia. 
> 
> In general, this is impossible once *any* FP calculation is marked "fast".
> That's because the original definition of -ffast-math from gcc is so
> (intentionally?) vague:
> https://gcc.gnu.org/wiki/FloatingPointMath
> 
> Without a formal spec, FP math can produce different results for any math
> op, and there may not be an obvious "correct" answer.
> 
> Take for example the trailing sqrt in the original code. If converted to an
> estimate sequence, that result can differ between various hardware
> implementations - any of the initial estimate, refinement algorithm, or
> availability of FMA will produce different results. 

You're right — I was a bit imprecise with what I meant. I started investigating when I didn't get the same answer (on same CPU), and then the IR I found I didn't understand. My assumed reading of the relatively sparse description of what fast-math matched my experience with LLVM 8 but didn't with LLVM 9, so that's what motivated this bug report. (And in fact, the observations of what LLVM 8 and earlier empirically did probably informed my mental model.)

I'm perfectly willing to be told my interpretation was too optimistic, though :-) And it's not actually a critical issue for me since I can easily work around the differences across versions of Julia by just being a little less lazy — I was ultimately abusing Julia's @fastmath macro to just get a version of sqrt that doesn't do the domain (x < 0) check since I know by construction that won't happen in my algorithm.


> I did some analysis of that for x86 in bug 21385. Even identical asm is not
> guaranteed to produce identical results - if we use an estimate instruction,
> that can vary depending on CPU model.

Kind of a side comment at this point, but I've been doing single CPU tests thus far and plan to expand to a couple of different architectures soon. My goal for the moment is a best-reasonable effort algorithm, wherein I make use of numerical best practices in designing the algorithm, but I'm not (yet) going to bend over backwards if one CPU has poorer accuracy of say it's sqrt instruction than another. (I do still want the speed afforded by CPU instructions versus library calls.)


> So again, I'm not opposed to tightening the semantics, but I'm not entirely
> sure what effects that might have. We may end up losing an optimization
> somewhere and getting the opposite bug report where the stricter
> interpretation means we didn't produce as fast a sequence as someone else
> was expecting.

I think we're in agreement. It'd be nice to have some clarification on what the semantics are just because there are reasonable yet disjoint interpretations possible.

I don't _need_ this to be "fixed" to make progress, and Julia admittedly sits in a unique space where single-instruction fast-math flags are easy to make.