X86 SSE4.1 instruction problem #50713

kyriewang2 · 2021-08-06T01:48:40Z


Bugzilla Link	51371
Resolution	FIXED
Resolved on	Aug 16, 2021 18:43
Version	7.0
OS	MacOS X
CC	@topperc,@DougGregor,@RKSimon,@Kyriewang2,@phoebewang,@zygoloid,@rotateright
Fixed by commit(s)	`24dfba8`

Extended Description

Hi,
I was build SSE performance work on mac intel. But I found the performance of my SSE4.1 version code using in xcode 12.4 is not as good as xcode 10.1, so I checked the assembly of my code. The one _mm_mul_epi() instruction was translated into three pmuludq, which is the SSE2 instruction, while this was normal when compiling on xcode 10.1 and _mm_mul_epi() was translated into pmuldq. So I checked the clang version, and I found this error occured when clang version larger than 7.0.0.
The simple case can be found: https://godbolt.org/z/Tf7qeocvz
I probably think this is a clang compiler bug. And I hope to get some advice on how to solve this issue.
Thanks.
Best,
Wade

The text was updated successfully, but these errors were encountered:

kyriewang2 · 2021-08-06T01:48:40Z

assigned to @topperc

topperc · 2021-08-06T01:54:00Z

There's something very strange going on in the IR out of the middle end

%37 = shl <2 x i64> %24, <i64 32, i64 32>, !dbg !327
%38 = ashr exact <2 x i64> %37, <i64 32, i64 32>, !dbg !327
%39 = shl <2 x i64> %36, <i64 32, i64 32>, !dbg !327
%40 = ashr exact <2 x i64> %39, <i64 32, i64 32>, !dbg !327
%41 = mul nsw <2 x i64> %40, %38, !dbg !327

I assume InstCombine has gotten clever, but I'm not sure why yet.

topperc · 2021-08-06T02:00:43Z

Disregard that last comment. That's our IR for pmuldq that we should pattern match. Will keep digging.

topperc · 2021-08-06T02:16:05Z

I think because warping is loop invariant, it got pulled out of the loop along with the _mm_set1 and the 2xi64 shl+ashr we insert to sign extend the mul input for pmuldq pattern match.

Because we don't have a vsraq instruction we emit this

    pshufd  $0, %xmm0, %xmm0                # xmm0 = xmm0[0,0,0,0]                                                                                                                                    
    movdqa  %xmm0, %xmm1                                                                                                                                                                              
    psllq   $32, %xmm1                                                                                                                                                                                
    psrad   $31, %xmm1                                                                                                                                                                                
    pblendw $51, %xmm0, %xmm1               # xmm1 = xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]

And we aren't able to create an AssertSExt from that to tell the block inside the loop that the input is already sign extended to allow pmuldq to be match. And the sign extend outside the loop would have been unnecessary if we formed pmuldq.

topperc · 2021-08-06T03:34:47Z

Maybe we need an X86CodeGenPrepare to bring them back into the same basic block? I assume the same thing can happen to pmuludq.

kyriewang2 · 2021-08-06T09:38:58Z

I think because warping is loop invariant, it got pulled out of the loop
along with the _mm_set1 and the 2xi64 shl+ashr we insert to sign extend the
mul input for pmuldq pattern match.

Because we don't have a vsraq instruction we emit this
    pshufd  $0, %xmm0, %xmm0                # xmm0 = xmm0[0,0,0,0]      

    movdqa  %xmm0, %xmm1                                                

    psllq   $32, %xmm1                                                  

    psrad   $31, %xmm1                                                  

    pblendw $51, %xmm0, %xmm1               # xmm1 =
xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]

And we aren't able to create an AssertSExt from that to tell the block
inside the loop that the input is already sign extended to allow pmuldq to
be match. And the sign extend outside the loop would have been unnecessary
if we formed pmuldq.

Thanks for your analysis, and I also want to know is there any solution from code level to avoid this problem?

topperc · 2021-08-07T01:27:25Z

To fix this in the compiler, I think we just need to add this to X86TargetLowering::shouldSinkOperands

I'm not sure how to work around it in code.

topperc · 2021-08-07T06:13:42Z

Candidate patch https://reviews.llvm.org/D107689

rotateright · 2021-08-16T16:50:51Z

Anything left to do here?

Not sure if this qualifies for backport to the release branch.

RKSimon · 2021-08-16T17:31:42Z

Anything left to do here?

Not sure if this qualifies for backport to the release branch.

I don't think so - its a rather old perf regression.

Tentatively resolving this.

kyriewang2 · 2021-08-17T01:25:41Z

Anything left to do here?

Not sure if this qualifies for backport to the release branch.

I don't think so - its a rather old perf regression.

Tentatively resolving this.

Thank you for resolving this issue, so what's the release version of llvm which can avoid this issue?

rotateright · 2021-08-17T01:43:59Z

Anything left to do here?

Not sure if this qualifies for backport to the release branch.

I don't think so - its a rather old perf regression.

Tentatively resolving this.

Thank you for resolving this issue, so what's the release version of llvm
which can avoid this issue?

It will be in 14.0 (about 6 months from now).

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

X86 SSE4.1 instruction problem #50713

X86 SSE4.1 instruction problem #50713

kyriewang2 mannequin commented Aug 6, 2021

kyriewang2 mannequin commented Aug 6, 2021

topperc commented Aug 6, 2021

topperc commented Aug 6, 2021

topperc commented Aug 6, 2021

topperc commented Aug 6, 2021

kyriewang2 mannequin commented Aug 6, 2021

topperc commented Aug 7, 2021

topperc commented Aug 7, 2021

rotateright commented Aug 16, 2021

RKSimon commented Aug 16, 2021

kyriewang2 mannequin commented Aug 17, 2021

rotateright commented Aug 17, 2021

X86 SSE4.1 instruction problem #50713

X86 SSE4.1 instruction problem #50713

Comments

kyriewang2 mannequin commented Aug 6, 2021

Extended Description

kyriewang2 mannequin commented Aug 6, 2021

topperc commented Aug 6, 2021

topperc commented Aug 6, 2021

topperc commented Aug 6, 2021

topperc commented Aug 6, 2021

kyriewang2 mannequin commented Aug 6, 2021

topperc commented Aug 7, 2021

topperc commented Aug 7, 2021

rotateright commented Aug 16, 2021

RKSimon commented Aug 16, 2021

kyriewang2 mannequin commented Aug 17, 2021

rotateright commented Aug 17, 2021