25321 – SciMark performance regression since r246985

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 25321 - SciMark performance regression since r246985

Summary: SciMark performance regression since r246985

Status:	RESOLVED FIXED

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Loop Optimizer (show other bugs)
Version:	3.8
Hardware:	PC Windows NT

Importance:	P normal
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-10-26 11:29 PDT by Yaron Keren
Modified:	2017-01-09 14:44 PST (History)
CC List:	17 users (show)

See Also:	30995 29110
Fixed By Commit(s):

Attachments
Pre-patch -O2 -disable-llvm-optzns (26.20 KB, application/octet-stream) 2015-10-27 02:02 PDT, John McCall	Details
Post-patch -O2 -disable-llvm-optzns (26.31 KB, application/octet-stream) 2015-10-27 02:03 PDT, John McCall	Details
Pre-patch -O2 (18.96 KB, application/octet-stream) 2015-10-27 02:03 PDT, John McCall	Details
Post-patch -O2 (18.89 KB, application/octet-stream) 2015-10-27 02:04 PDT, John McCall	Details
Pre-patch -O2 asm output (11.39 KB, application/octet-stream) 2015-10-27 02:04 PDT, John McCall	Details
Post-patch -O2 asm output (11.44 KB, application/octet-stream) 2015-10-27 02:05 PDT, John McCall	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Yaron Keren 2015-10-26 11:29:04 PDT

SciMark2 MonteCarlo performance had regressed to less than half in r246985.

To reproduce, download http://math.nist.gov/scimark2/scimark2_1c.zip and open it somewhere. Compile and run the benchmark with
>clang -o scimark2 -O2 *.c
>./scimark2
and look at the MonteCarlo results.

Comparing gcc 5.1, clang r246984, clang r246985 and r251299 (current svn),
gcc 5.1 and clang r246984 MonteCarlo performance are similar, although gcc takes the lead.
clang r246985 performance was cut to less than half and stayed the same in current svn. This may be a result of the code generated to access Random_struct, when 17 is changed into 18 

typedef struct
{
  int m[17];                        
  int seed;                             
  int i;                                /* originally = 4 */ 

with

typedef struct
{
  int m[18];                        
  int seed;                             
  int i;                                /* originally = 4 */ 

the performance regression is gone.

>uname -a
Linux PC 3.16.0-51-generic #69~14.04.1-Ubuntu SMP Wed Oct 7 15:32:41 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

>~/bug$ gcc -v  2>&1 | grep version
gcc version 5.1.0 (Ubuntu 5.1.0-0ubuntu11~14.04.1)
>~/bug$ gcc -O2 scimark2/*.c -o scimark -lm
./scimark | grep MonteCarlo
>~/bug$ ./scimark | grep MonteCarlo
MonteCarlo:     Mflops:   672.76

>~/bug$ llvm/bin/clang -v  2>&1 | grep version
llvmclang version 3.8.0 (trunk 246984)
>~/bug$ llvm/bin/clang -O2 scimark2/*.c -o scimark -lm
./scimark | grep MonteCarlo
>~/bug$ ./scimark | grep MonteCarlo
MonteCarlo:     Mflops:   631.23

>~/bug$ llvm/fast/bin/clang -v  2>&1 | grep version
llvm/fast/bin/clang clang version 3.8.0 (trunk 246985)
>~/bug$ llvm/fast/bin/clang -O2 scimark2/*.c -o scimark -lm
./scimark | grep MonteCarlo
>~/bug$ ./scimark | grep MonteCarlo
MonteCarlo:     Mflops:   270.40

>~/bug$ llvm/fast/bin/clang -v  2>&1 | grep version
llvm/fast/bin/clang -O2 scimark2/*.c -o scimark -lm
clang version 3.8.0 (trunk 251299)
.>~/bug$ llvm/fast/bin/clang -O2 scimark2/*.c -o scimark -lm
cimark | grep MonteCarlo
>~/bug$ ./scimark | grep MonteCarlo
MonteCarlo:     Mflops:   266.42

Comment 1 John McCall 2015-10-26 23:40:55 PDT

Well, the positive side is that it should be straightforward to track this down by diff'ing the generated assembly.  My guess is that we're merging a pair of accesses to adjacent int fields (i/j?) into a single 8-byte access, based on a completely proper assumption that the first is 8-byte-aligned, but somehow they aren't in practice.

Comment 2 John McCall 2015-10-27 02:02:34 PDT

Created attachment 15167 [details]
Pre-patch -O2 -disable-llvm-optzns

Comment 3 John McCall 2015-10-27 02:03:15 PDT

Created attachment 15168 [details]
Post-patch -O2 -disable-llvm-optzns

Comment 4 John McCall 2015-10-27 02:03:52 PDT

Created attachment 15169 [details]
Pre-patch -O2

Comment 5 John McCall 2015-10-27 02:04:15 PDT

Created attachment 15170 [details]
Post-patch -O2

Comment 6 John McCall 2015-10-27 02:04:43 PDT

Created attachment 15171 [details]
Pre-patch -O2 asm output

Comment 7 John McCall 2015-10-27 02:05:19 PDT

Created attachment 15172 [details]
Post-patch -O2 asm output

Comment 8 John McCall 2015-10-27 02:50:39 PDT

Attached pre/post IRGen output (-O2 -S -emit-llvm -Xclang -disable-llvm-optzns), .ll output (-O2 -S -emit-llvm), and asm output (-O2 -S).  This is all just looking at Random.c; the other files seem to treat this as a black box, so if changing the layout of Random_struct has a performance impact, it should be reflected in this file.

As expected, the IRGen output difference is purely that we're annotating a couple of accesses with 8-byte alignment instead of 4-byte.  (Ignore all the accesses to %retval; those allocas will be SSA'ed into oblivion anyway.)

The optimizer output seems to differ in the following ways:
  - The loads of i and j at the start of Random_nextDouble have been fused.
  - The loads of i and j at the start of the loop body in RandomVector have been fused.  This is apparently preventing i and j from being kept in registers across the loop boundary; phis tracking their current value have been eliminated and they are now reloaded at the start of the loop body.
  - Approximately the same thing happened to RandomMatrix.

Those changes in RandomVector and RandomMatrix seem like obvious culprits here.

Comment 9 Yaron Keren 2015-11-19 12:52:08 PST

Ping?

Comment 10 John McCall 2015-11-19 13:31:15 PST

Unfortunately, I can't actually fix this myself.  Moving to LLVM.

Comment 11 Jack Howarth 2016-02-15 15:58:35 PST

This performance regression is present in both the trunk and 3.8 branch compilers.

Comment 12 Jack Howarth 2016-02-16 10:39:14 PST

The offending commit is at http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20150907/137868.html

Comment 13 Jack Howarth 2016-02-16 11:04:54 PST

Opened radar:://24676657 as this regression now exists in the Xcode 7.3 beta 3 seed.

Comment 14 Luke 2016-02-18 21:54:01 PST

This affects clang 3.8. Should the priority be bumped up?

Comment 15 Hans Wennborg 2016-02-19 19:06:32 PST

(In reply to comment #14)
> This affects clang 3.8. Should the priority be bumped up?

Unfortunate as it is, I'm afraid we're too late in the 3.8 process to wait for this, especially since no fix has emerged.

Comment 16 Luke 2016-04-06 16:01:35 PDT

John McCall,
Are you working with the LLVM team to get this resolved?

Comment 17 John McCall 2016-04-06 16:11:00 PDT

I believe that there are people on the LLVM team who are aware of this regression, yes.

Comment 18 Simon Pilgrim 2016-09-30 06:25:01 PDT

Slightly reduced version to show an interesting bit:

typedef struct
{
  int m[17];                        
  int seed;                             
  int i;                                /* originally = 4 */
  int j;                                /* originally =  16 */
  int /*boolean*/ haveRange;            /* = false; */
  double left;                          /*= 0.0; */
  double right;                         /* = 1.0; */
  double width;                         /* = 1.0; */
}
Random_struct, *Random;

double Random_nextDouble(Random R, int *sel, double left, double right, double width, int k, int I, int J) 
{
    if (*sel) 
        return R->left + left * (double) k * width;
    else
        return left * (double) k;
}

Random_nextDouble:
        cmpl    $0, (%rsi)
        je      .LBB0_2
        vcvtsi2sdl      %edx, %xmm0, %xmm1
        vmulsd  %xmm0, %xmm1, %xmm0
        vmulsd  %xmm2, %xmm0, %xmm0
        vaddsd  88(%rdi), %xmm0, %xmm0
        retq
.LBB0_2:
        vcvtsi2sdl      %edx, %xmm0, %xmm1
        vmulsd  %xmm0, %xmm1, %xmm0
        retq

But if we commute the R->left addition, we manage to pull out the common code:

double Random_nextDouble(Random R, int *sel, double left, double right, double width, int k, int I, int J) 
{
    if (*sel) 
        return left * (double) k * width + R->left;
    else
        return left * (double) k;
}

Random_nextDouble:
        vcvtsi2sdl      %edx, %xmm0, %xmm1
        cmpl    $0, (%rsi)
        vmulsd  %xmm0, %xmm1, %xmm0
        je      .LBB0_2
        vmulsd  %xmm2, %xmm0, %xmm0
        vaddsd  88(%rdi), %xmm0, %xmm0
.LBB0_2:
        retq

Comment 19 Simon Pilgrim 2016-11-13 05:35:49 PST

The hope is that NewGVN will solve this by performing the commutation in Random_nextDouble

Comment 20 Jack Howarth 2016-11-19 09:01:11 PST

(In reply to comment #19)
> The hope is that NewGVN will solve this by performing the commutation in
> Random_nextDouble

Any chance that someone can mockup a patch on Phabricator to enable the proposed newgvn by default for the clang compiler to make testing easier on end-users who want to experiment with it? I noticed that the current clang compiler doesn't even seem to default on the old gvn pass as far as I can tell.

Comment 21 Davide Italiano 2016-11-19 14:17:42 PST

Not sure if the initial NewGVN patch will fix this, but I can add a patch to enable NewGVN if you can give it a shot on phab.

Comment 22 Jack Howarth 2016-11-19 14:19:22 PST

(In reply to comment #21)
> Not sure if the initial NewGVN patch will fix this, but I can add a patch to
> enable NewGVN if you can give it a shot on phab.

Thanks. I wanted to test this since Comment 19 suggested that the newgvn compiler pass might help with this performance regression.

Comment 23 Simon Pilgrim 2016-11-19 16:49:51 PST

That's my fault.. Comment 19 should've read: "The hope is that NewGVN will eventually solve this by performing the commutation in Random_nextDouble"

Comment 24 Hal Finkel 2016-11-19 20:15:41 PST

(In reply to comment #20)
> (In reply to comment #19)
...
> I noticed that the current clang
> compiler doesn't even seem to default on the old gvn pass as far as I can
> tell.

We certainly use GVN in the default pipeline at -O2 and -O3, and have for a long time.

Comment 25 Luke 2016-12-28 00:56:40 PST

From purely a performance standpoint, it appears as this regression has been fixed. 

gcc 6.2: 				357.05 Mflops
clang 4.0 290647 :    			379.92 Mflops
clang 4.0 290647(-mllvm -enable-newgvn):396.98 Mflops

Could someone that understands this issue, analyze the binary again to verify that the issue with Random_nextDouble being fused has actually been fixed?

Comment 26 Simon Pilgrim 2017-01-07 10:19:26 PST

The vcvtsi2sdl/vmulsd hoisting seems to be fixed.

Current trunk output (with/without newgvn):

Random_nextDouble:
        vcvtsi2sdl      %edx, %xmm1, %xmm1
        cmpl    $0, (%rsi)
        vmulsd  %xmm0, %xmm1, %xmm0
        je      .LBB0_2
        vmulsd  %xmm2, %xmm0, %xmm0
        vaddsd  88(%rdi), %xmm0, %xmm0
.LBB0_2:
        retq

Comment 27 Matthias Braun 2017-01-09 10:57:53 PST

Just for the record: This was most likely fixed by disabling load widening in the old GVN https://reviews.llvm.org/D24096
See also the discussion here: http://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html