Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiSource/Applications/siod fails on X86-64 with clang 7.0.0-rc1 #37996

Closed
tstellar opened this issue Aug 20, 2018 · 21 comments
Closed

MultiSource/Applications/siod fails on X86-64 with clang 7.0.0-rc1 #37996

tstellar opened this issue Aug 20, 2018 · 21 comments
Assignees
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@tstellar
Copy link
Collaborator

Bugzilla Link 38648
Resolution FIXED
Resolved on Jun 06, 2019 14:22
Version 7.0
OS Linux
Blocks #40566
CC @topperc,@francisvm,@zmodem,@rotateright
Fixed by commit(s) r357986 r362747 r362751

Extended Description

This is a regression, this test executes correctly with clang 6.0.1. Running the command:

${TEST_SUITE_BUILDDIR}/MultiSource/Applications/siod/siod -v1 ${TEST_SUITE_SRCDIR}/test-suite/MultiSource/Applications/siod/test.scm

Produces an error:

Content-type: text/plain

10
9
8
7
6
5
4
3
2
1
-42 is negative
0 is zero
42 is positive
the value of (get 'answer 'value) is 42
the value of (get 'answer 'value) is xyzzy
the 33rd Fibonacci number is ERROR: damaged frame (see errobj)

I am going to try to bisect this.

@tstellar
Copy link
Collaborator Author

assigned to @jyknight

@tstellar
Copy link
Collaborator Author

This failure was introduced by: r322957.

@rotateright
Copy link
Contributor

Can you attach pre-processed source or IR to repro the problem?

@tstellar
Copy link
Collaborator Author

Pre-processed source
Here is the pre-processed source. Here are the compile arguments that are used:

clang -DNDEBUG -O2 -DNDEBUG -w -Werror=date-time -D__USE_MISC -D__USE_GNU -D__USE_SVID -D__USE_XOPEN_EXTENDED -D__USE_XOPEN -Dunix -o slib.c.o -c slib.i

@rotateright
Copy link
Contributor

Created attachment 20757 [details]
Pre-processed source

Here is the pre-processed source. Here are the compile arguments that are
used:

clang -DNDEBUG -O2 -DNDEBUG -w -Werror=date-time -D__USE_MISC -D__USE_GNU
-D__USE_SVID -D__USE_XOPEN_EXTENDED -D__USE_XOPEN -Dunix -o slib.c.o -c
slib.i

Thanks!

To make it easier to A/B test this, I added a flag that disables the optimization that was added with r322957 here:
https://reviews.llvm.org/rL340540

At first glance, that optimization is doing what we expected in _c_mark_and_sweep():
movabsq $9223372036854775800, %r14 ## imm = 0x7FFFFFFFFFFFFFF8
shrq %rdx
andq %r14, %rdx

becomes:
andq $-8, %rdx ; avoided a giant constant and saved a register

But there's another diff later:
movabsq $-6148914691236517205, %r10 ## imm = 0xAAAAAAAAAAAAAAAB

becomes:
movq _heaps@GOTPCREL(%rip), %r10

That seems...odd. cc'ing Craig for ideas about how to reduce/debug this.

@topperc
Copy link
Collaborator

topperc commented Aug 26, 2018

I'm not able to reproduce this error. I downloaded the 7.0-rc1 source and built it on linux, but this is the output i got from the test

Content-type: text/plain

10
9
8
7
6
5
4
3
2
1
-42 is negative
0 is zero
42 is positive
the value of (get 'answer 'value) is 42
the value of (get 'answer 'value) is xyzzy
the 33rd Fibonacci number is 3.52458e+06

@rotateright
Copy link
Contributor

Let's make sure we're compiling the same way - the "-O2 NDEBUG" appears to be necessary (and I'm not sure if that's how test-suite would build this by default).

I set up test-suite on Linux x86-64, and I can reproduce the problem, but I'm not seeing that same suspicious asm that I saw before (that was on macOS using the attached source file).

Here's how I built/ran the siod benchmark:

~/mysandbox $ ./bin/lnt runtest test-suite --sandbox ~/mysandbox/ --cc ~/myllvm/release/bin/clang --test-suite ~/llvm-test-suite --only-test MultiSource/Applications/siod --use-lit ~/myllvm/release/bin/llvm-lit --use-cmake cmake --build-threads 4 --threads 4 --cflags="-O2 -DNDEBUG -mllvm -x86-and-imm-shrink=0 -mllvm -stats -save-temps"

This passes as shown in comment 5.

Now, allow mask shrinking:

~/mysandbox $ ./bin/lnt runtest test-suite --sandbox ~/mysandbox/ --cc ~/myllvm/release/bin/clang --test-suite ~/llvm-test-suite --only-test MultiSource/Applications/siod --use-lit ~/myllvm/release/bin/llvm-lit --use-cmake cmake --build-threads 4 --threads 4 --cflags="-O2 -DNDEBUG -mllvm -x86-and-imm-shrink=1 -mllvm -stats -save-temps"

This fails as shown in the description.

@topperc
Copy link
Collaborator

topperc commented Aug 28, 2018

I managed to get it to fail with and shrinking enabled, but I got "ERROR: too few arguements"

If you modify test.scm to calculate a different fibonacci number, do you find that it stops failing for smaller values? For me the failure doesn't happen calculating the 20th fibonacci number, but fails for 21.

@tstellar
Copy link
Collaborator Author

I managed to get it to fail with and shrinking enabled, but I got "ERROR:
too few arguements"

If you modify test.scm to calculate a different fibonacci number, do you
find that it stops failing for smaller values? For me the failure doesn't
happen calculating the 20th fibonacci number, but fails for 21.

For me the lowest number that fails for me is 23.

@tstellar
Copy link
Collaborator Author

Reproducer script
Here is the script I've been using to reproduce. You may have to adjust some of the paths if you haven't run cmake.

@rotateright
Copy link
Contributor

I managed to get it to fail with and shrinking enabled, but I got "ERROR:
too few arguements"

If you modify test.scm to calculate a different fibonacci number, do you
find that it stops failing for smaller values? For me the failure doesn't
happen calculating the 20th fibonacci number, but fails for 21.

For me the lowest number that fails for me is 23.

That's what I'm seeing too.

There were some unnecessary params in my earlier test-suite invocation. It can be reduced to:

~/mysandbox$ ./bin/lnt runtest test-suite --sandbox ~/mysandbox/ --cc ~/myllvm/release/bin/clang --test-suite ~/llvm-test-suite --only-test MultiSource/Applications/siod --use-lit ~/myllvm/release/bin/llvm-lit --use-cmake cmake --cflags="-O2 -DNDEBUG -mllvm -x86-and-imm-shrink=1"

@zmodem
Copy link
Collaborator

zmodem commented Sep 4, 2018

Do you think we can fix this for the release? It seems bad if it's a regression from 6.0.1.

@topperc
Copy link
Collaborator

topperc commented Sep 6, 2018

I've looked a little closer and the ands that got optimized changed their mask from 0x7fff'ffff'ffff'fff8 in a register created by a movabsq that was hoisted outside of loop to using an and with a -8 immediate. The instruction proceeding all of the ands was a right shift by 1. So this seems correct to me.

But this freed up a register and changed the stack spills and reloads created by the register allocator.

The program has some questionable stack memory access behavior. Valgrind reports many conditional operations using uninitialized memory. I haven't checked it with address sanitizer yet. I wonder if the change in spills/reloads changed the behavior.

@topperc
Copy link
Collaborator

topperc commented Sep 6, 2018

The gc_mark_and_sweep function creates a variable named "end_of_stack", but doesn't initialize it. It then passes the address of this variable as an end pointer to a function called mark_locations. The "start" address passed comes from a global variable that points to another uninitialized object in an earlier callers stack frame. The mark_locations code will note that start and end are out of order due to stack growing down and will swap them. Making start then point to the beginning of the end_of_stack object. Then it calls mark_locations_array which will read from start to end looking for pointers. So this will read the uninitialized value from end_of_stack and it will read the return addresses and other spills, etc. along the way.

@rotateright
Copy link
Contributor

The gc_mark_and_sweep function creates a variable named "end_of_stack", but
doesn't initialize it. It then passes the address of this variable as an end
pointer to a function called mark_locations. The "start" address passed
comes from a global variable that points to another uninitialized object in
an earlier callers stack frame. The mark_locations code will note that start
and end are out of order due to stack growing down and will swap them.
Making start then point to the beginning of the end_of_stack object. Then it
calls mark_locations_array which will read from start to end looking for
pointers. So this will read the uninitialized value from end_of_stack and it
will read the return addresses and other spills, etc. along the way.

Nice detective work!

Not sure what the policy is for this sort of thing. Do we resolve this bug as invalid? Are we supposed to flag the test as potentially misbehaving or fix the test itself?

@zmodem
Copy link
Collaborator

zmodem commented Sep 7, 2018

The gc_mark_and_sweep function creates a variable named "end_of_stack", but
doesn't initialize it. It then passes the address of this variable as an end
pointer to a function called mark_locations. The "start" address passed
comes from a global variable that points to another uninitialized object in
an earlier callers stack frame. The mark_locations code will note that start
and end are out of order due to stack growing down and will swap them.
Making start then point to the beginning of the end_of_stack object. Then it
calls mark_locations_array which will read from start to end looking for
pointers. So this will read the uninitialized value from end_of_stack and it
will read the return addresses and other spills, etc. along the way.

Nice detective work!

+1, thanks for digging into this!

Not sure what the policy is for this sort of thing. Do we resolve this bug
as invalid? Are we supposed to flag the test as potentially misbehaving or
fix the test itself?

At least it means I won't worry about it for 7.0.0, so unblocking :-)

I think for the test-suite, it's common to fix the test to not misbehave if possible. If that's not possible / too inconvenient, I suppose we should flag or delete it since it's not adding value.

@topperc
Copy link
Collaborator

topperc commented Apr 2, 2019

One our engineers finally figured out the real problem here https://reviews.llvm.org/D60039

If I remember right from my analysis, Sanjay's patch did change some spill behavior here. I just didn't understand at the time how that would effect things.

@tstellar
Copy link
Collaborator Author

Hi James,

Do you think this is OK to merge into 8.0.1?

@jyknight
Copy link
Member

Yes, should be fine.

@tstellar
Copy link
Collaborator Author

tstellar commented Jun 6, 2019

Merged: r362747 and r362751.

@tstellar
Copy link
Collaborator Author

mentioned in issue #40566

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

5 participants