51371 – X86 SSE4.1 instruction problem

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 51371 - X86 SSE4.1 instruction problem

Summary: X86 SSE4.1 instruction problem

Status:	RESOLVED FIXED

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Backend: X86 (show other bugs)
Version:	7.0
Hardware:	PC MacOS X

Importance:	P enhancement
Assignee:	Craig Topper

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-08-05 18:48 PDT by Wade Wang
Modified:	2021-08-16 18:43 PDT (History)
CC List:	10 users (show)

See Also:
Fixed By Commit(s):	24dfba8d507e383eeccfc0cfa30ad6340f71a377

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Wade Wang 2021-08-05 18:48:40 PDT

Hi,
  I was build SSE performance work on mac intel. But I found the performance of my SSE4.1 version code using in xcode 12.4 is not as good as xcode 10.1, so I checked the assembly of my code. The one _mm_mul_epi() instruction was translated into three pmuludq, which is the SSE2 instruction, while this was normal when compiling on xcode 10.1 and _mm_mul_epi() was translated into pmuldq. So I checked the clang version, and I found this error occured when clang version larger than 7.0.0.
  The simple case can be found: https://godbolt.org/z/Tf7qeocvz
  I probably think this is a clang compiler bug. And I hope to get some advice on how to solve this issue.
  Thanks.
Best,
  Wade

Comment 1 Craig Topper 2021-08-05 18:54:00 PDT

There's something very strange going on in the IR out of the middle end

  %37 = shl <2 x i64> %24, <i64 32, i64 32>, !dbg !327
  %38 = ashr exact <2 x i64> %37, <i64 32, i64 32>, !dbg !327
  %39 = shl <2 x i64> %36, <i64 32, i64 32>, !dbg !327
  %40 = ashr exact <2 x i64> %39, <i64 32, i64 32>, !dbg !327
  %41 = mul nsw <2 x i64> %40, %38, !dbg !327

I assume InstCombine has gotten clever, but I'm not sure why yet.

Comment 2 Craig Topper 2021-08-05 19:00:43 PDT

Disregard that last comment. That's our IR for pmuldq that we should pattern match. Will keep digging.

Comment 3 Craig Topper 2021-08-05 19:16:05 PDT

I think because warping is loop invariant, it got pulled out of the loop along with the _mm_set1 and the 2xi64 shl+ashr we insert to sign extend the mul input for pmuldq pattern match.

Because we don't have a vsraq instruction we emit this

        pshufd  $0, %xmm0, %xmm0                # xmm0 = xmm0[0,0,0,0]                                                                                                                                    
        movdqa  %xmm0, %xmm1                                                                                                                                                                              
        psllq   $32, %xmm1                                                                                                                                                                                
        psrad   $31, %xmm1                                                                                                                                                                                
        pblendw $51, %xmm0, %xmm1               # xmm1 = xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]  

And we aren't able to create an AssertSExt from that to tell the block inside the loop that the input is already sign extended to allow pmuldq to be match. And the sign extend outside the loop would have been unnecessary if we formed pmuldq.

Comment 4 Craig Topper 2021-08-05 20:34:47 PDT

Maybe we need an X86CodeGenPrepare to bring them back into the same basic block? I assume the same thing can happen to pmuludq.

Comment 5 Wade Wang 2021-08-06 02:38:58 PDT

(In reply to Craig Topper from comment #3)
> I think because warping is loop invariant, it got pulled out of the loop
> along with the _mm_set1 and the 2xi64 shl+ashr we insert to sign extend the
> mul input for pmuldq pattern match.
> 
> Because we don't have a vsraq instruction we emit this
> 
>         pshufd  $0, %xmm0, %xmm0                # xmm0 = xmm0[0,0,0,0]      
> 
>         movdqa  %xmm0, %xmm1                                                
> 
>         psllq   $32, %xmm1                                                  
> 
>         psrad   $31, %xmm1                                                  
> 
>         pblendw $51, %xmm0, %xmm1               # xmm1 =
> xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]  
> 
> And we aren't able to create an AssertSExt from that to tell the block
> inside the loop that the input is already sign extended to allow pmuldq to
> be match. And the sign extend outside the loop would have been unnecessary
> if we formed pmuldq.

Thanks for your analysis, and I also want to know is there any solution from code level to avoid this problem?

Comment 6 Craig Topper 2021-08-06 18:27:25 PDT

To fix this in the compiler, I think we just need to add this to X86TargetLowering::shouldSinkOperands

I'm not sure how to work around it in code.

Comment 7 Craig Topper 2021-08-06 23:13:42 PDT

Candidate patch https://reviews.llvm.org/D107689

Comment 8 Sanjay Patel 2021-08-16 09:50:51 PDT

Anything left to do here? 

Not sure if this qualifies for backport to the release branch.

Comment 9 Simon Pilgrim 2021-08-16 10:31:42 PDT

(In reply to Sanjay Patel from comment #8)
> Anything left to do here? 
> 
> Not sure if this qualifies for backport to the release branch.

I don't think so - its a rather old perf regression.

Tentatively resolving this.

Comment 10 Wade Wang 2021-08-16 18:25:41 PDT

(In reply to Simon Pilgrim from comment #9)
> (In reply to Sanjay Patel from comment #8)
> > Anything left to do here? 
> > 
> > Not sure if this qualifies for backport to the release branch.
> 
> I don't think so - its a rather old perf regression.
> 
> Tentatively resolving this.

Thank you for resolving this issue, so what's the release version of llvm which can avoid this issue?

Comment 11 Sanjay Patel 2021-08-16 18:43:59 PDT

(In reply to Wade Wang from comment #10)
> (In reply to Simon Pilgrim from comment #9)
> > (In reply to Sanjay Patel from comment #8)
> > > Anything left to do here? 
> > > 
> > > Not sure if this qualifies for backport to the release branch.
> > 
> > I don't think so - its a rather old perf regression.
> > 
> > Tentatively resolving this.
> 
> Thank you for resolving this issue, so what's the release version of llvm
> which can avoid this issue?

It will be in 14.0 (about 6 months from now).