Skip to content

Suboptimal materialization of += 1 as -= -1 #51374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davidbolvansky opened this issue Oct 1, 2021 · 2 comments
Closed

Suboptimal materialization of += 1 as -= -1 #51374

davidbolvansky opened this issue Oct 1, 2021 · 2 comments
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@davidbolvansky
Copy link
Collaborator

Bugzilla Link 52032
Resolution FIXED
Resolved on Oct 29, 2021 13:20
Version trunk
OS Windows NT
CC @weiguozhi,@topperc,@jayfoad,@RKSimon,@phoebewang,@rotateright
Fixed by commit(s) 285b8ab

Extended Description

struct b
{
int a[100000];
};

void
plus1(struct b *a)
{
int i;
for (i=0;i<64;i++)
a->a[i]+=1;
}

Trunk -O3 -march=haswell:

plus1(b*): # @​plus1(b*)
vpcmpeqd %ymm0, %ymm0, %ymm0
vmovdqu (%rdi), %ymm1
vmovdqu 32(%rdi), %ymm2
vmovdqu 64(%rdi), %ymm3
vmovdqu 96(%rdi), %ymm4
vpsubd %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, (%rdi)
vpsubd %ymm0, %ymm2, %ymm1
vmovdqu %ymm1, 32(%rdi)
vpsubd %ymm0, %ymm3, %ymm1
vmovdqu %ymm1, 64(%rdi)
vpsubd %ymm0, %ymm4, %ymm1
vmovdqu %ymm1, 96(%rdi)
vmovdqu 128(%rdi), %ymm1
vpsubd %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 128(%rdi)
vmovdqu 160(%rdi), %ymm1
vpsubd %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 160(%rdi)
vmovdqu 192(%rdi), %ymm1
vpsubd %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 192(%rdi)
vmovdqu 224(%rdi), %ymm1
vpsubd %ymm0, %ymm1, %ymm0
vmovdqu %ymm0, 224(%rdi)
vzeroupper
retq

ICC/GCC produces:

plus1(b*):
vmovdqu .L_2il0floatpacket.0(%rip), %ymm7 #​12.14
vpaddd (%rdi), %ymm7, %ymm0 #​12.5
vpaddd 32(%rdi), %ymm7, %ymm1 #​12.5
vpaddd 64(%rdi), %ymm7, %ymm2 #​12.5
vpaddd 96(%rdi), %ymm7, %ymm3 #​12.5
vpaddd 128(%rdi), %ymm7, %ymm4 #​12.5
vpaddd 160(%rdi), %ymm7, %ymm5 #​12.5
vpaddd 192(%rdi), %ymm7, %ymm6 #​12.5
vpaddd 224(%rdi), %ymm7, %ymm8 #​12.5
vmovdqu %ymm0, (%rdi) #​12.5
vmovdqu %ymm1, 32(%rdi) #​12.5
vmovdqu %ymm2, 64(%rdi) #​12.5
vmovdqu %ymm3, 96(%rdi) #​12.5
vmovdqu %ymm4, 128(%rdi) #​12.5
vmovdqu %ymm5, 160(%rdi) #​12.5
vmovdqu %ymm6, 192(%rdi) #​12.5
vmovdqu %ymm8, 224(%rdi) #​12.5
vzeroupper #​13.1
ret #​13.1

LLVM could just use:
.LCPI0_0:
.long 1 # 0x1
plus1(b*): # @​plus1(b*)
vbroadcastss .LCPI0_0(%rip), %ymm0 # ymm0 = [1,1,1,1,1,1,1,1]

https://godbolt.org/z/vx4nGcor9

@rotateright
Copy link
Contributor

Proposal:
https://reviews.llvm.org/D112464

@rotateright
Copy link
Contributor

We should use broadcast and fold loads after:
https://reviews.llvm.org/rG285b8abce483
...but as I mentioned in the commit message, I'm not sure if this results in a perf win even if it looks better.

Using pcmpeq to create an all-ones vector still seems like the right trade-off for most cases.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

2 participants