Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

X86 SSE4.1 instruction problem #50713

Closed
kyriewang2 mannequin opened this issue Aug 6, 2021 · 12 comments
Closed

X86 SSE4.1 instruction problem #50713

kyriewang2 mannequin opened this issue Aug 6, 2021 · 12 comments
Assignees
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@kyriewang2
Copy link
Mannequin

kyriewang2 mannequin commented Aug 6, 2021

Bugzilla Link 51371
Resolution FIXED
Resolved on Aug 16, 2021 18:43
Version 7.0
OS MacOS X
CC @topperc,@DougGregor,@RKSimon,@Kyriewang2,@phoebewang,@zygoloid,@rotateright
Fixed by commit(s) 24dfba8

Extended Description

Hi,
I was build SSE performance work on mac intel. But I found the performance of my SSE4.1 version code using in xcode 12.4 is not as good as xcode 10.1, so I checked the assembly of my code. The one _mm_mul_epi() instruction was translated into three pmuludq, which is the SSE2 instruction, while this was normal when compiling on xcode 10.1 and _mm_mul_epi() was translated into pmuldq. So I checked the clang version, and I found this error occured when clang version larger than 7.0.0.
The simple case can be found: https://godbolt.org/z/Tf7qeocvz
I probably think this is a clang compiler bug. And I hope to get some advice on how to solve this issue.
Thanks.
Best,
Wade

@kyriewang2
Copy link
Mannequin Author

kyriewang2 mannequin commented Aug 6, 2021

assigned to @topperc

@topperc
Copy link
Collaborator

topperc commented Aug 6, 2021

There's something very strange going on in the IR out of the middle end

%37 = shl <2 x i64> %24, <i64 32, i64 32>, !dbg !​327
%38 = ashr exact <2 x i64> %37, <i64 32, i64 32>, !dbg !​327
%39 = shl <2 x i64> %36, <i64 32, i64 32>, !dbg !​327
%40 = ashr exact <2 x i64> %39, <i64 32, i64 32>, !dbg !​327
%41 = mul nsw <2 x i64> %40, %38, !dbg !​327

I assume InstCombine has gotten clever, but I'm not sure why yet.

@topperc
Copy link
Collaborator

topperc commented Aug 6, 2021

Disregard that last comment. That's our IR for pmuldq that we should pattern match. Will keep digging.

@topperc
Copy link
Collaborator

topperc commented Aug 6, 2021

I think because warping is loop invariant, it got pulled out of the loop along with the _mm_set1 and the 2xi64 shl+ashr we insert to sign extend the mul input for pmuldq pattern match.

Because we don't have a vsraq instruction we emit this

    pshufd  $0, %xmm0, %xmm0                # xmm0 = xmm0[0,0,0,0]                                                                                                                                    
    movdqa  %xmm0, %xmm1                                                                                                                                                                              
    psllq   $32, %xmm1                                                                                                                                                                                
    psrad   $31, %xmm1                                                                                                                                                                                
    pblendw $51, %xmm0, %xmm1               # xmm1 = xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]  

And we aren't able to create an AssertSExt from that to tell the block inside the loop that the input is already sign extended to allow pmuldq to be match. And the sign extend outside the loop would have been unnecessary if we formed pmuldq.

@topperc
Copy link
Collaborator

topperc commented Aug 6, 2021

Maybe we need an X86CodeGenPrepare to bring them back into the same basic block? I assume the same thing can happen to pmuludq.

@kyriewang2
Copy link
Mannequin Author

kyriewang2 mannequin commented Aug 6, 2021

I think because warping is loop invariant, it got pulled out of the loop
along with the _mm_set1 and the 2xi64 shl+ashr we insert to sign extend the
mul input for pmuldq pattern match.

Because we don't have a vsraq instruction we emit this

    pshufd  $0, %xmm0, %xmm0                # xmm0 = xmm0[0,0,0,0]      

    movdqa  %xmm0, %xmm1                                                

    psllq   $32, %xmm1                                                  

    psrad   $31, %xmm1                                                  

    pblendw $51, %xmm0, %xmm1               # xmm1 =

xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]

And we aren't able to create an AssertSExt from that to tell the block
inside the loop that the input is already sign extended to allow pmuldq to
be match. And the sign extend outside the loop would have been unnecessary
if we formed pmuldq.

Thanks for your analysis, and I also want to know is there any solution from code level to avoid this problem?

@topperc
Copy link
Collaborator

topperc commented Aug 7, 2021

To fix this in the compiler, I think we just need to add this to X86TargetLowering::shouldSinkOperands

I'm not sure how to work around it in code.

@topperc
Copy link
Collaborator

topperc commented Aug 7, 2021

Candidate patch https://reviews.llvm.org/D107689

@rotateright
Copy link
Contributor

Anything left to do here?

Not sure if this qualifies for backport to the release branch.

@RKSimon
Copy link
Collaborator

RKSimon commented Aug 16, 2021

Anything left to do here?

Not sure if this qualifies for backport to the release branch.

I don't think so - its a rather old perf regression.

Tentatively resolving this.

@kyriewang2
Copy link
Mannequin Author

kyriewang2 mannequin commented Aug 17, 2021

Anything left to do here?

Not sure if this qualifies for backport to the release branch.

I don't think so - its a rather old perf regression.

Tentatively resolving this.

Thank you for resolving this issue, so what's the release version of llvm which can avoid this issue?

@rotateright
Copy link
Contributor

Anything left to do here?

Not sure if this qualifies for backport to the release branch.

I don't think so - its a rather old perf regression.

Tentatively resolving this.

Thank you for resolving this issue, so what's the release version of llvm
which can avoid this issue?

It will be in 14.0 (about 6 months from now).

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

3 participants