Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SciMark performance regression since r246985 #25695

Closed
llvmbot opened this issue Oct 26, 2015 · 27 comments
Closed

SciMark performance regression since r246985 #25695

llvmbot opened this issue Oct 26, 2015 · 27 comments
Labels
bugzilla Issues migrated from bugzilla loopoptim

Comments

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 26, 2015

Bugzilla Link 25321
Resolution FIXED
Resolved on Jan 09, 2017 14:44
Version 3.8
OS Windows NT
Reporter LLVM Bugzilla Contributor
CC @dschuff,@emaste,@zmodem,@hfinkel,@ismail,@jfbastien,@RKSimon,@slacka,@MatzeB,@rjmccall,@sanjoy,@rotateright

Extended Description

SciMark2 MonteCarlo performance had regressed to less than half in r246985.

To reproduce, download http://math.nist.gov/scimark2/scimark2_1c.zip and open it somewhere. Compile and run the benchmark with

clang -o scimark2 -O2 *.c
./scimark2
and look at the MonteCarlo results.

Comparing gcc 5.1, clang r246984, clang r246985 and r251299 (current svn),
gcc 5.1 and clang r246984 MonteCarlo performance are similar, although gcc takes the lead.
clang r246985 performance was cut to less than half and stayed the same in current svn. This may be a result of the code generated to access Random_struct, when 17 is changed into 18

typedef struct
{
int m[17];
int seed;
int i; /* originally = 4 */

with

typedef struct
{
int m[18];
int seed;
int i; /* originally = 4 */

the performance regression is gone.

uname -a
Linux PC 3.16.0-51-generic #​69~14.04.1-Ubuntu SMP Wed Oct 7 15:32:41 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

/bug$ gcc -v 2>&1 | grep version
gcc version 5.1.0 (Ubuntu 5.1.0-0ubuntu11
14.04.1)
~/bug$ gcc -O2 scimark2/*.c -o scimark -lm
./scimark | grep MonteCarlo
~/bug$ ./scimark | grep MonteCarlo
MonteCarlo: Mflops: 672.76

~/bug$ llvm/bin/clang -v 2>&1 | grep version
llvmclang version 3.8.0 (trunk 246984)
~/bug$ llvm/bin/clang -O2 scimark2/*.c -o scimark -lm
./scimark | grep MonteCarlo
~/bug$ ./scimark | grep MonteCarlo
MonteCarlo: Mflops: 631.23

~/bug$ llvm/fast/bin/clang -v 2>&1 | grep version
llvm/fast/bin/clang clang version 3.8.0 (trunk 246985)
~/bug$ llvm/fast/bin/clang -O2 scimark2/*.c -o scimark -lm
./scimark | grep MonteCarlo
~/bug$ ./scimark | grep MonteCarlo
MonteCarlo: Mflops: 270.40

/bug$ llvm/fast/bin/clang -v 2>&1 | grep version
llvm/fast/bin/clang -O2 scimark2/*.c -o scimark -lm
clang version 3.8.0 (trunk 251299)
.>
/bug$ llvm/fast/bin/clang -O2 scimark2/*.c -o scimark -lm
cimark | grep MonteCarlo
~/bug$ ./scimark | grep MonteCarlo
MonteCarlo: Mflops: 266.42

@rjmccall
Copy link
Contributor

Well, the positive side is that it should be straightforward to track this down by diff'ing the generated assembly. My guess is that we're merging a pair of accesses to adjacent int fields (i/j?) into a single 8-byte access, based on a completely proper assumption that the first is 8-byte-aligned, but somehow they aren't in practice.

@rjmccall
Copy link
Contributor

@rjmccall
Copy link
Contributor

@rjmccall
Copy link
Contributor

Pre-patch -O2

@rjmccall
Copy link
Contributor

Post-patch -O2

@rjmccall
Copy link
Contributor

Pre-patch -O2 asm output

@rjmccall
Copy link
Contributor

@rjmccall
Copy link
Contributor

Attached pre/post IRGen output (-O2 -S -emit-llvm -Xclang -disable-llvm-optzns), .ll output (-O2 -S -emit-llvm), and asm output (-O2 -S). This is all just looking at Random.c; the other files seem to treat this as a black box, so if changing the layout of Random_struct has a performance impact, it should be reflected in this file.

As expected, the IRGen output difference is purely that we're annotating a couple of accesses with 8-byte alignment instead of 4-byte. (Ignore all the accesses to %retval; those allocas will be SSA'ed into oblivion anyway.)

The optimizer output seems to differ in the following ways:

  • The loads of i and j at the start of Random_nextDouble have been fused.
  • The loads of i and j at the start of the loop body in RandomVector have been fused. This is apparently preventing i and j from being kept in registers across the loop boundary; phis tracking their current value have been eliminated and they are now reloaded at the start of the loop body.
  • Approximately the same thing happened to RandomMatrix.

Those changes in RandomVector and RandomMatrix seem like obvious culprits here.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 19, 2015

Ping?

@rjmccall
Copy link
Contributor

Unfortunately, I can't actually fix this myself. Moving to LLVM.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Feb 15, 2016

This performance regression is present in both the trunk and 3.8 branch compilers.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Feb 16, 2016

@llvmbot
Copy link
Collaborator Author

llvmbot commented Feb 16, 2016

Opened radar:://24676657 as this regression now exists in the Xcode 7.3 beta 3 seed.

@slacka
Copy link
Mannequin

slacka mannequin commented Feb 19, 2016

This affects clang 3.8. Should the priority be bumped up?

@zmodem
Copy link
Collaborator

zmodem commented Feb 20, 2016

This affects clang 3.8. Should the priority be bumped up?

Unfortunate as it is, I'm afraid we're too late in the 3.8 process to wait for this, especially since no fix has emerged.

@slacka
Copy link
Mannequin

slacka mannequin commented Apr 6, 2016

John McCall,
Are you working with the LLVM team to get this resolved?

@rjmccall
Copy link
Contributor

rjmccall commented Apr 6, 2016

I believe that there are people on the LLVM team who are aware of this regression, yes.

@RKSimon
Copy link
Collaborator

RKSimon commented Sep 30, 2016

Slightly reduced version to show an interesting bit:

typedef struct
{
int m[17];
int seed;
int i; /* originally = 4 /
int j; /
originally = 16 /
int /boolean/ haveRange; /
= false; /
double left; /
= 0.0; /
double right; /
= 1.0; /
double width; /
= 1.0; */
}
Random_struct, *Random;

double Random_nextDouble(Random R, int *sel, double left, double right, double width, int k, int I, int J)
{
if (*sel)
return R->left + left * (double) k * width;
else
return left * (double) k;
}

Random_nextDouble:
cmpl $0, (%rsi)
je .LBB0_2
vcvtsi2sdl %edx, %xmm0, %xmm1
vmulsd %xmm0, %xmm1, %xmm0
vmulsd %xmm2, %xmm0, %xmm0
vaddsd 88(%rdi), %xmm0, %xmm0
retq
.LBB0_2:
vcvtsi2sdl %edx, %xmm0, %xmm1
vmulsd %xmm0, %xmm1, %xmm0
retq

But if we commute the R->left addition, we manage to pull out the common code:

double Random_nextDouble(Random R, int *sel, double left, double right, double width, int k, int I, int J)
{
if (*sel)
return left * (double) k * width + R->left;
else
return left * (double) k;
}

Random_nextDouble:
vcvtsi2sdl %edx, %xmm0, %xmm1
cmpl $0, (%rsi)
vmulsd %xmm0, %xmm1, %xmm0
je .LBB0_2
vmulsd %xmm2, %xmm0, %xmm0
vaddsd 88(%rdi), %xmm0, %xmm0
.LBB0_2:
retq

@RKSimon
Copy link
Collaborator

RKSimon commented Nov 13, 2016

The hope is that NewGVN will solve this by performing the commutation in Random_nextDouble

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 19, 2016

The hope is that NewGVN will solve this by performing the commutation in
Random_nextDouble

Any chance that someone can mockup a patch on Phabricator to enable the proposed newgvn by default for the clang compiler to make testing easier on end-users who want to experiment with it? I noticed that the current clang compiler doesn't even seem to default on the old gvn pass as far as I can tell.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 19, 2016

Not sure if the initial NewGVN patch will fix this, but I can add a patch to enable NewGVN if you can give it a shot on phab.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 19, 2016

Not sure if the initial NewGVN patch will fix this, but I can add a patch to
enable NewGVN if you can give it a shot on phab.

Thanks. I wanted to test this since Comment 19 suggested that the newgvn compiler pass might help with this performance regression.

@RKSimon
Copy link
Collaborator

RKSimon commented Nov 20, 2016

That's my fault.. Comment 19 should've read: "The hope is that NewGVN will eventually solve this by performing the commutation in Random_nextDouble"

@hfinkel
Copy link
Collaborator

hfinkel commented Nov 20, 2016

...

I noticed that the current clang
compiler doesn't even seem to default on the old gvn pass as far as I can
tell.

We certainly use GVN in the default pipeline at -O2 and -O3, and have for a long time.

@slacka
Copy link
Mannequin

slacka mannequin commented Dec 28, 2016

From purely a performance standpoint, it appears as this regression has been fixed.

gcc 6.2: 357.05 Mflops
clang 4.0 290647 : 379.92 Mflops
clang 4.0 290647(-mllvm -enable-newgvn):396.98 Mflops

Could someone that understands this issue, analyze the binary again to verify that the issue with Random_nextDouble being fused has actually been fixed?

@RKSimon
Copy link
Collaborator

RKSimon commented Jan 7, 2017

The vcvtsi2sdl/vmulsd hoisting seems to be fixed.

Current trunk output (with/without newgvn):

Random_nextDouble:
vcvtsi2sdl %edx, %xmm1, %xmm1
cmpl $0, (%rsi)
vmulsd %xmm0, %xmm1, %xmm0
je .LBB0_2
vmulsd %xmm2, %xmm0, %xmm0
vaddsd 88(%rdi), %xmm0, %xmm0
.LBB0_2:
retq

@MatzeB
Copy link
Contributor

MatzeB commented Jan 9, 2017

Just for the record: This was most likely fixed by disabling load widening in the old GVN https://reviews.llvm.org/D24096
See also the discussion here: http://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla loopoptim
Projects
None yet
Development

No branches or pull requests

6 participants