SciMark2 MonteCarlo performance had regressed to less than half in r246985. To reproduce, download http://math.nist.gov/scimark2/scimark2_1c.zip and open it somewhere. Compile and run the benchmark with >clang -o scimark2 -O2 *.c >./scimark2 and look at the MonteCarlo results. Comparing gcc 5.1, clang r246984, clang r246985 and r251299 (current svn), gcc 5.1 and clang r246984 MonteCarlo performance are similar, although gcc takes the lead. clang r246985 performance was cut to less than half and stayed the same in current svn. This may be a result of the code generated to access Random_struct, when 17 is changed into 18 typedef struct { int m[17]; int seed; int i; /* originally = 4 */ with typedef struct { int m[18]; int seed; int i; /* originally = 4 */ the performance regression is gone. >uname -a Linux PC 3.16.0-51-generic #69~14.04.1-Ubuntu SMP Wed Oct 7 15:32:41 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux >~/bug$ gcc -v 2>&1 | grep version gcc version 5.1.0 (Ubuntu 5.1.0-0ubuntu11~14.04.1) >~/bug$ gcc -O2 scimark2/*.c -o scimark -lm ./scimark | grep MonteCarlo >~/bug$ ./scimark | grep MonteCarlo MonteCarlo: Mflops: 672.76 >~/bug$ llvm/bin/clang -v 2>&1 | grep version llvmclang version 3.8.0 (trunk 246984) >~/bug$ llvm/bin/clang -O2 scimark2/*.c -o scimark -lm ./scimark | grep MonteCarlo >~/bug$ ./scimark | grep MonteCarlo MonteCarlo: Mflops: 631.23 >~/bug$ llvm/fast/bin/clang -v 2>&1 | grep version llvm/fast/bin/clang clang version 3.8.0 (trunk 246985) >~/bug$ llvm/fast/bin/clang -O2 scimark2/*.c -o scimark -lm ./scimark | grep MonteCarlo >~/bug$ ./scimark | grep MonteCarlo MonteCarlo: Mflops: 270.40 >~/bug$ llvm/fast/bin/clang -v 2>&1 | grep version llvm/fast/bin/clang -O2 scimark2/*.c -o scimark -lm clang version 3.8.0 (trunk 251299) .>~/bug$ llvm/fast/bin/clang -O2 scimark2/*.c -o scimark -lm cimark | grep MonteCarlo >~/bug$ ./scimark | grep MonteCarlo MonteCarlo: Mflops: 266.42
Well, the positive side is that it should be straightforward to track this down by diff'ing the generated assembly. My guess is that we're merging a pair of accesses to adjacent int fields (i/j?) into a single 8-byte access, based on a completely proper assumption that the first is 8-byte-aligned, but somehow they aren't in practice.
Created attachment 15167 [details] Pre-patch -O2 -disable-llvm-optzns
Created attachment 15168 [details] Post-patch -O2 -disable-llvm-optzns
Created attachment 15169 [details] Pre-patch -O2
Created attachment 15170 [details] Post-patch -O2
Created attachment 15171 [details] Pre-patch -O2 asm output
Created attachment 15172 [details] Post-patch -O2 asm output
Attached pre/post IRGen output (-O2 -S -emit-llvm -Xclang -disable-llvm-optzns), .ll output (-O2 -S -emit-llvm), and asm output (-O2 -S). This is all just looking at Random.c; the other files seem to treat this as a black box, so if changing the layout of Random_struct has a performance impact, it should be reflected in this file. As expected, the IRGen output difference is purely that we're annotating a couple of accesses with 8-byte alignment instead of 4-byte. (Ignore all the accesses to %retval; those allocas will be SSA'ed into oblivion anyway.) The optimizer output seems to differ in the following ways: - The loads of i and j at the start of Random_nextDouble have been fused. - The loads of i and j at the start of the loop body in RandomVector have been fused. This is apparently preventing i and j from being kept in registers across the loop boundary; phis tracking their current value have been eliminated and they are now reloaded at the start of the loop body. - Approximately the same thing happened to RandomMatrix. Those changes in RandomVector and RandomMatrix seem like obvious culprits here.
Ping?
Unfortunately, I can't actually fix this myself. Moving to LLVM.
This performance regression is present in both the trunk and 3.8 branch compilers.
The offending commit is at http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20150907/137868.html
Opened radar:://24676657 as this regression now exists in the Xcode 7.3 beta 3 seed.
This affects clang 3.8. Should the priority be bumped up?
(In reply to comment #14) > This affects clang 3.8. Should the priority be bumped up? Unfortunate as it is, I'm afraid we're too late in the 3.8 process to wait for this, especially since no fix has emerged.
John McCall, Are you working with the LLVM team to get this resolved?
I believe that there are people on the LLVM team who are aware of this regression, yes.
Slightly reduced version to show an interesting bit: typedef struct { int m[17]; int seed; int i; /* originally = 4 */ int j; /* originally = 16 */ int /*boolean*/ haveRange; /* = false; */ double left; /*= 0.0; */ double right; /* = 1.0; */ double width; /* = 1.0; */ } Random_struct, *Random; double Random_nextDouble(Random R, int *sel, double left, double right, double width, int k, int I, int J) { if (*sel) return R->left + left * (double) k * width; else return left * (double) k; } Random_nextDouble: cmpl $0, (%rsi) je .LBB0_2 vcvtsi2sdl %edx, %xmm0, %xmm1 vmulsd %xmm0, %xmm1, %xmm0 vmulsd %xmm2, %xmm0, %xmm0 vaddsd 88(%rdi), %xmm0, %xmm0 retq .LBB0_2: vcvtsi2sdl %edx, %xmm0, %xmm1 vmulsd %xmm0, %xmm1, %xmm0 retq But if we commute the R->left addition, we manage to pull out the common code: double Random_nextDouble(Random R, int *sel, double left, double right, double width, int k, int I, int J) { if (*sel) return left * (double) k * width + R->left; else return left * (double) k; } Random_nextDouble: vcvtsi2sdl %edx, %xmm0, %xmm1 cmpl $0, (%rsi) vmulsd %xmm0, %xmm1, %xmm0 je .LBB0_2 vmulsd %xmm2, %xmm0, %xmm0 vaddsd 88(%rdi), %xmm0, %xmm0 .LBB0_2: retq
The hope is that NewGVN will solve this by performing the commutation in Random_nextDouble
(In reply to comment #19) > The hope is that NewGVN will solve this by performing the commutation in > Random_nextDouble Any chance that someone can mockup a patch on Phabricator to enable the proposed newgvn by default for the clang compiler to make testing easier on end-users who want to experiment with it? I noticed that the current clang compiler doesn't even seem to default on the old gvn pass as far as I can tell.
Not sure if the initial NewGVN patch will fix this, but I can add a patch to enable NewGVN if you can give it a shot on phab.
(In reply to comment #21) > Not sure if the initial NewGVN patch will fix this, but I can add a patch to > enable NewGVN if you can give it a shot on phab. Thanks. I wanted to test this since Comment 19 suggested that the newgvn compiler pass might help with this performance regression.
That's my fault.. Comment 19 should've read: "The hope is that NewGVN will eventually solve this by performing the commutation in Random_nextDouble"
(In reply to comment #20) > (In reply to comment #19) ... > I noticed that the current clang > compiler doesn't even seem to default on the old gvn pass as far as I can > tell. We certainly use GVN in the default pipeline at -O2 and -O3, and have for a long time.
From purely a performance standpoint, it appears as this regression has been fixed. gcc 6.2: 357.05 Mflops clang 4.0 290647 : 379.92 Mflops clang 4.0 290647(-mllvm -enable-newgvn):396.98 Mflops Could someone that understands this issue, analyze the binary again to verify that the issue with Random_nextDouble being fused has actually been fixed?
The vcvtsi2sdl/vmulsd hoisting seems to be fixed. Current trunk output (with/without newgvn): Random_nextDouble: vcvtsi2sdl %edx, %xmm1, %xmm1 cmpl $0, (%rsi) vmulsd %xmm0, %xmm1, %xmm0 je .LBB0_2 vmulsd %xmm2, %xmm0, %xmm0 vaddsd 88(%rdi), %xmm0, %xmm0 .LBB0_2: retq
Just for the record: This was most likely fixed by disabling load widening in the old GVN https://reviews.llvm.org/D24096 See also the discussion here: http://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html