LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 25321 - SciMark performance regression since r246985
Summary: SciMark performance regression since r246985
Status: RESOLVED FIXED
Alias: None
Product: libraries
Classification: Unclassified
Component: Loop Optimizer (show other bugs)
Version: 3.8
Hardware: PC Windows NT
: P normal
Assignee: Unassigned LLVM Bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-10-26 11:29 PDT by Yaron Keren
Modified: 2017-01-09 14:44 PST (History)
17 users (show)

See Also:
Fixed By Commit(s):


Attachments
Pre-patch -O2 -disable-llvm-optzns (26.20 KB, application/octet-stream)
2015-10-27 02:02 PDT, John McCall
Details
Post-patch -O2 -disable-llvm-optzns (26.31 KB, application/octet-stream)
2015-10-27 02:03 PDT, John McCall
Details
Pre-patch -O2 (18.96 KB, application/octet-stream)
2015-10-27 02:03 PDT, John McCall
Details
Post-patch -O2 (18.89 KB, application/octet-stream)
2015-10-27 02:04 PDT, John McCall
Details
Pre-patch -O2 asm output (11.39 KB, application/octet-stream)
2015-10-27 02:04 PDT, John McCall
Details
Post-patch -O2 asm output (11.44 KB, application/octet-stream)
2015-10-27 02:05 PDT, John McCall
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yaron Keren 2015-10-26 11:29:04 PDT
SciMark2 MonteCarlo performance had regressed to less than half in r246985.

To reproduce, download http://math.nist.gov/scimark2/scimark2_1c.zip and open it somewhere. Compile and run the benchmark with
>clang -o scimark2 -O2 *.c
>./scimark2
and look at the MonteCarlo results.

Comparing gcc 5.1, clang r246984, clang r246985 and r251299 (current svn),
gcc 5.1 and clang r246984 MonteCarlo performance are similar, although gcc takes the lead.
clang r246985 performance was cut to less than half and stayed the same in current svn. This may be a result of the code generated to access Random_struct, when 17 is changed into 18 

typedef struct
{
  int m[17];                        
  int seed;                             
  int i;                                /* originally = 4 */ 

with

typedef struct
{
  int m[18];                        
  int seed;                             
  int i;                                /* originally = 4 */ 

the performance regression is gone.

>uname -a
Linux PC 3.16.0-51-generic #69~14.04.1-Ubuntu SMP Wed Oct 7 15:32:41 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

>~/bug$ gcc -v  2>&1 | grep version
gcc version 5.1.0 (Ubuntu 5.1.0-0ubuntu11~14.04.1)
>~/bug$ gcc -O2 scimark2/*.c -o scimark -lm
./scimark | grep MonteCarlo
>~/bug$ ./scimark | grep MonteCarlo
MonteCarlo:     Mflops:   672.76

>~/bug$ llvm/bin/clang -v  2>&1 | grep version
llvmclang version 3.8.0 (trunk 246984)
>~/bug$ llvm/bin/clang -O2 scimark2/*.c -o scimark -lm
./scimark | grep MonteCarlo
>~/bug$ ./scimark | grep MonteCarlo
MonteCarlo:     Mflops:   631.23

>~/bug$ llvm/fast/bin/clang -v  2>&1 | grep version
llvm/fast/bin/clang clang version 3.8.0 (trunk 246985)
>~/bug$ llvm/fast/bin/clang -O2 scimark2/*.c -o scimark -lm
./scimark | grep MonteCarlo
>~/bug$ ./scimark | grep MonteCarlo
MonteCarlo:     Mflops:   270.40

>~/bug$ llvm/fast/bin/clang -v  2>&1 | grep version
llvm/fast/bin/clang -O2 scimark2/*.c -o scimark -lm
clang version 3.8.0 (trunk 251299)
.>~/bug$ llvm/fast/bin/clang -O2 scimark2/*.c -o scimark -lm
cimark | grep MonteCarlo
>~/bug$ ./scimark | grep MonteCarlo
MonteCarlo:     Mflops:   266.42
Comment 1 John McCall 2015-10-26 23:40:55 PDT
Well, the positive side is that it should be straightforward to track this down by diff'ing the generated assembly.  My guess is that we're merging a pair of accesses to adjacent int fields (i/j?) into a single 8-byte access, based on a completely proper assumption that the first is 8-byte-aligned, but somehow they aren't in practice.
Comment 2 John McCall 2015-10-27 02:02:34 PDT
Created attachment 15167 [details]
Pre-patch -O2 -disable-llvm-optzns
Comment 3 John McCall 2015-10-27 02:03:15 PDT
Created attachment 15168 [details]
Post-patch -O2 -disable-llvm-optzns
Comment 4 John McCall 2015-10-27 02:03:52 PDT
Created attachment 15169 [details]
Pre-patch -O2
Comment 5 John McCall 2015-10-27 02:04:15 PDT
Created attachment 15170 [details]
Post-patch -O2
Comment 6 John McCall 2015-10-27 02:04:43 PDT
Created attachment 15171 [details]
Pre-patch -O2 asm output
Comment 7 John McCall 2015-10-27 02:05:19 PDT
Created attachment 15172 [details]
Post-patch -O2 asm output
Comment 8 John McCall 2015-10-27 02:50:39 PDT
Attached pre/post IRGen output (-O2 -S -emit-llvm -Xclang -disable-llvm-optzns), .ll output (-O2 -S -emit-llvm), and asm output (-O2 -S).  This is all just looking at Random.c; the other files seem to treat this as a black box, so if changing the layout of Random_struct has a performance impact, it should be reflected in this file.

As expected, the IRGen output difference is purely that we're annotating a couple of accesses with 8-byte alignment instead of 4-byte.  (Ignore all the accesses to %retval; those allocas will be SSA'ed into oblivion anyway.)

The optimizer output seems to differ in the following ways:
  - The loads of i and j at the start of Random_nextDouble have been fused.
  - The loads of i and j at the start of the loop body in RandomVector have been fused.  This is apparently preventing i and j from being kept in registers across the loop boundary; phis tracking their current value have been eliminated and they are now reloaded at the start of the loop body.
  - Approximately the same thing happened to RandomMatrix.

Those changes in RandomVector and RandomMatrix seem like obvious culprits here.
Comment 9 Yaron Keren 2015-11-19 12:52:08 PST
Ping?
Comment 10 John McCall 2015-11-19 13:31:15 PST
Unfortunately, I can't actually fix this myself.  Moving to LLVM.
Comment 11 Jack Howarth 2016-02-15 15:58:35 PST
This performance regression is present in both the trunk and 3.8 branch compilers.
Comment 12 Jack Howarth 2016-02-16 10:39:14 PST
The offending commit is at http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20150907/137868.html
Comment 13 Jack Howarth 2016-02-16 11:04:54 PST
Opened radar:://24676657 as this regression now exists in the Xcode 7.3 beta 3 seed.
Comment 14 Luke 2016-02-18 21:54:01 PST
This affects clang 3.8. Should the priority be bumped up?
Comment 15 Hans Wennborg 2016-02-19 19:06:32 PST
(In reply to comment #14)
> This affects clang 3.8. Should the priority be bumped up?

Unfortunate as it is, I'm afraid we're too late in the 3.8 process to wait for this, especially since no fix has emerged.
Comment 16 Luke 2016-04-06 16:01:35 PDT
John McCall,
Are you working with the LLVM team to get this resolved?
Comment 17 John McCall 2016-04-06 16:11:00 PDT
I believe that there are people on the LLVM team who are aware of this regression, yes.
Comment 18 Simon Pilgrim 2016-09-30 06:25:01 PDT
Slightly reduced version to show an interesting bit:

typedef struct
{
  int m[17];                        
  int seed;                             
  int i;                                /* originally = 4 */
  int j;                                /* originally =  16 */
  int /*boolean*/ haveRange;            /* = false; */
  double left;                          /*= 0.0; */
  double right;                         /* = 1.0; */
  double width;                         /* = 1.0; */
}
Random_struct, *Random;

double Random_nextDouble(Random R, int *sel, double left, double right, double width, int k, int I, int J) 
{
    if (*sel) 
        return R->left + left * (double) k * width;
    else
        return left * (double) k;
}

Random_nextDouble:
        cmpl    $0, (%rsi)
        je      .LBB0_2
        vcvtsi2sdl      %edx, %xmm0, %xmm1
        vmulsd  %xmm0, %xmm1, %xmm0
        vmulsd  %xmm2, %xmm0, %xmm0
        vaddsd  88(%rdi), %xmm0, %xmm0
        retq
.LBB0_2:
        vcvtsi2sdl      %edx, %xmm0, %xmm1
        vmulsd  %xmm0, %xmm1, %xmm0
        retq

But if we commute the R->left addition, we manage to pull out the common code:

double Random_nextDouble(Random R, int *sel, double left, double right, double width, int k, int I, int J) 
{
    if (*sel) 
        return left * (double) k * width + R->left;
    else
        return left * (double) k;
}

Random_nextDouble:
        vcvtsi2sdl      %edx, %xmm0, %xmm1
        cmpl    $0, (%rsi)
        vmulsd  %xmm0, %xmm1, %xmm0
        je      .LBB0_2
        vmulsd  %xmm2, %xmm0, %xmm0
        vaddsd  88(%rdi), %xmm0, %xmm0
.LBB0_2:
        retq
Comment 19 Simon Pilgrim 2016-11-13 05:35:49 PST
The hope is that NewGVN will solve this by performing the commutation in Random_nextDouble
Comment 20 Jack Howarth 2016-11-19 09:01:11 PST
(In reply to comment #19)
> The hope is that NewGVN will solve this by performing the commutation in
> Random_nextDouble

Any chance that someone can mockup a patch on Phabricator to enable the proposed newgvn by default for the clang compiler to make testing easier on end-users who want to experiment with it? I noticed that the current clang compiler doesn't even seem to default on the old gvn pass as far as I can tell.
Comment 21 Davide Italiano 2016-11-19 14:17:42 PST
Not sure if the initial NewGVN patch will fix this, but I can add a patch to enable NewGVN if you can give it a shot on phab.
Comment 22 Jack Howarth 2016-11-19 14:19:22 PST
(In reply to comment #21)
> Not sure if the initial NewGVN patch will fix this, but I can add a patch to
> enable NewGVN if you can give it a shot on phab.

Thanks. I wanted to test this since Comment 19 suggested that the newgvn compiler pass might help with this performance regression.
Comment 23 Simon Pilgrim 2016-11-19 16:49:51 PST
That's my fault.. Comment 19 should've read: "The hope is that NewGVN will eventually solve this by performing the commutation in Random_nextDouble"
Comment 24 Hal Finkel 2016-11-19 20:15:41 PST
(In reply to comment #20)
> (In reply to comment #19)
...
> I noticed that the current clang
> compiler doesn't even seem to default on the old gvn pass as far as I can
> tell.

We certainly use GVN in the default pipeline at -O2 and -O3, and have for a long time.
Comment 25 Luke 2016-12-28 00:56:40 PST
From purely a performance standpoint, it appears as this regression has been fixed. 

gcc 6.2: 				357.05 Mflops
clang 4.0 290647 :    			379.92 Mflops
clang 4.0 290647(-mllvm -enable-newgvn):396.98 Mflops

Could someone that understands this issue, analyze the binary again to verify that the issue with Random_nextDouble being fused has actually been fixed?
Comment 26 Simon Pilgrim 2017-01-07 10:19:26 PST
The vcvtsi2sdl/vmulsd hoisting seems to be fixed.

Current trunk output (with/without newgvn):

Random_nextDouble:
        vcvtsi2sdl      %edx, %xmm1, %xmm1
        cmpl    $0, (%rsi)
        vmulsd  %xmm0, %xmm1, %xmm0
        je      .LBB0_2
        vmulsd  %xmm2, %xmm0, %xmm0
        vaddsd  88(%rdi), %xmm0, %xmm0
.LBB0_2:
        retq
Comment 27 Matthias Braun 2017-01-09 10:57:53 PST
Just for the record: This was most likely fixed by disabling load widening in the old GVN https://reviews.llvm.org/D24096
See also the discussion here: http://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html