New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SciMark performance regression since r246985 #25695
Comments
Well, the positive side is that it should be straightforward to track this down by diff'ing the generated assembly. My guess is that we're merging a pair of accesses to adjacent int fields (i/j?) into a single 8-byte access, based on a completely proper assumption that the first is 8-byte-aligned, but somehow they aren't in practice. |
Attached pre/post IRGen output (-O2 -S -emit-llvm -Xclang -disable-llvm-optzns), .ll output (-O2 -S -emit-llvm), and asm output (-O2 -S). This is all just looking at Random.c; the other files seem to treat this as a black box, so if changing the layout of Random_struct has a performance impact, it should be reflected in this file. As expected, the IRGen output difference is purely that we're annotating a couple of accesses with 8-byte alignment instead of 4-byte. (Ignore all the accesses to %retval; those allocas will be SSA'ed into oblivion anyway.) The optimizer output seems to differ in the following ways:
Those changes in RandomVector and RandomMatrix seem like obvious culprits here. |
Ping? |
Unfortunately, I can't actually fix this myself. Moving to LLVM. |
This performance regression is present in both the trunk and 3.8 branch compilers. |
The offending commit is at http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20150907/137868.html |
Opened radar:://24676657 as this regression now exists in the Xcode 7.3 beta 3 seed. |
This affects clang 3.8. Should the priority be bumped up? |
Unfortunate as it is, I'm afraid we're too late in the 3.8 process to wait for this, especially since no fix has emerged. |
John McCall, |
I believe that there are people on the LLVM team who are aware of this regression, yes. |
Slightly reduced version to show an interesting bit: typedef struct double Random_nextDouble(Random R, int *sel, double left, double right, double width, int k, int I, int J) Random_nextDouble: But if we commute the R->left addition, we manage to pull out the common code: double Random_nextDouble(Random R, int *sel, double left, double right, double width, int k, int I, int J) Random_nextDouble: |
The hope is that NewGVN will solve this by performing the commutation in Random_nextDouble |
Any chance that someone can mockup a patch on Phabricator to enable the proposed newgvn by default for the clang compiler to make testing easier on end-users who want to experiment with it? I noticed that the current clang compiler doesn't even seem to default on the old gvn pass as far as I can tell. |
Not sure if the initial NewGVN patch will fix this, but I can add a patch to enable NewGVN if you can give it a shot on phab. |
Thanks. I wanted to test this since Comment 19 suggested that the newgvn compiler pass might help with this performance regression. |
That's my fault.. Comment 19 should've read: "The hope is that NewGVN will eventually solve this by performing the commutation in Random_nextDouble" |
...
We certainly use GVN in the default pipeline at -O2 and -O3, and have for a long time. |
From purely a performance standpoint, it appears as this regression has been fixed. gcc 6.2: 357.05 Mflops Could someone that understands this issue, analyze the binary again to verify that the issue with Random_nextDouble being fused has actually been fixed? |
The vcvtsi2sdl/vmulsd hoisting seems to be fixed. Current trunk output (with/without newgvn): Random_nextDouble: |
Just for the record: This was most likely fixed by disabling load widening in the old GVN https://reviews.llvm.org/D24096 |
Extended Description
SciMark2 MonteCarlo performance had regressed to less than half in r246985.
To reproduce, download http://math.nist.gov/scimark2/scimark2_1c.zip and open it somewhere. Compile and run the benchmark with
Comparing gcc 5.1, clang r246984, clang r246985 and r251299 (current svn),
gcc 5.1 and clang r246984 MonteCarlo performance are similar, although gcc takes the lead.
clang r246985 performance was cut to less than half and stayed the same in current svn. This may be a result of the code generated to access Random_struct, when 17 is changed into 18
typedef struct
{
int m[17];
int seed;
int i; /* originally = 4 */
with
typedef struct
{
int m[18];
int seed;
int i; /* originally = 4 */
the performance regression is gone.
The text was updated successfully, but these errors were encountered: