Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[x86] Failure to optimize vector shuffle in conversion #48425

Open
GabrielRavier opened this issue Feb 7, 2021 · 4 comments
Open

[x86] Failure to optimize vector shuffle in conversion #48425

GabrielRavier opened this issue Feb 7, 2021 · 4 comments
Labels
backend:X86 bugzilla Issues migrated from bugzilla confirmed Verified by a second party llvm:SLPVectorizer

Comments

@GabrielRavier
Copy link
Contributor

GabrielRavier commented Feb 7, 2021

Bugzilla Link 49081
Version trunk
OS Linux
Depends On #35080
CC @alexey-bataev,@anton-afanasyev,@topperc,@RKSimon,@phoebewang,@rotateright

Extended Description

typedef int v4si __attribute__((vector_size(16)));
typedef float v4sf __attribute__((vector_size(16)));

v4sf f(v4si f)
{
    return (v4sf){(float)f[1], (float)f[1], (float)f[2], (float)f[3]};
}

With -O3, GCC outputs this:

f(int __vector(4)):
  pshufd xmm0, xmm0, 229
  cvtdq2ps xmm0, xmm0
  ret

LLVM outputs this:

f(int __vector(4)):
  pshufd xmm1, xmm0, 85 # xmm1 = xmm0[1,1,1,1]
  cvtdq2ps xmm1, xmm1
  pshufd xmm2, xmm0, 238 # xmm2 = xmm0[2,3,2,3]
  cvtdq2ps xmm2, xmm2
  pshufd xmm0, xmm0, 255 # xmm0 = xmm0[3,3,3,3]
  cvtdq2ps xmm0, xmm0
  unpcklps xmm2, xmm0 # xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
  shufps xmm1, xmm2, 64 # xmm1 = xmm1[0,0],xmm2[0,1]
  movaps xmm0, xmm1
  ret
@RKSimon
Copy link
Collaborator

RKSimon commented Feb 9, 2021

Current Codegen: https://godbolt.org/z/oYn1df

Probably an SLP issue? [Bug #​35732] looks very similar

@anton-afanasyev
Copy link
Contributor

This issue is to be fixed by http://reviews.llvm.org/D57059 after committing (in process of review now). We have conversion for f[1], f[2] and f[3] here, for three fps, so I have checked patch "Initial support for the vectorization of the non-power-of-2 vectors" fits here:

./opt -slp-vectorizer -instcombine -S pr49081.ll
...
define dso_local <4 x float> @​foo(<4 x i32> %0) {
%shuffle = shufflevector <4 x i32> %0, <4 x i32> poison, <4 x i32> <i32 1, i32 2, i32 3, i32 undef>
%2 = sitofp <4 x i32> %shuffle to <4 x float>
%3 = shufflevector <4 x float> %2, <4 x float> undef, <4 x i32> <i32 0, i32 0, i32 1, i32 2>
ret <4 x float> %3
}

@anton-afanasyev
Copy link
Contributor

Commited llvm/test/Transforms/SLPVectorizer/X86/pr49081.ll to track fixing.

@rotateright
Copy link
Contributor

This issue is to be fixed by http://reviews.llvm.org/D57059 after committing
(in process of review now). We have conversion for f[1], f[2] and f[3] here,
for three fps, so I have checked patch "Initial support for the
vectorization of the non-power-of-2 vectors" fits here:

./opt -slp-vectorizer -instcombine -S pr49081.ll
...
define dso_local <4 x float> @​foo(<4 x i32> %0) {
%shuffle = shufflevector <4 x i32> %0, <4 x i32> poison, <4 x i32> <i32 1,
i32 2, i32 3, i32 undef>
%2 = sitofp <4 x i32> %shuffle to <4 x float>
%3 = shufflevector <4 x float> %2, <4 x float> undef, <4 x i32> <i32 0,
i32 0, i32 1, i32 2>
ret <4 x float> %3
}

For reference, that sequence was not optimizing in IR or backend, so added an instcombine transform to make it easier for SDAG:
https://reviews.llvm.org/rG0bab0f616119

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
@llvmbot llvmbot added the confirmed Verified by a second party label Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla confirmed Verified by a second party llvm:SLPVectorizer
Projects
None yet
Development

No branches or pull requests

5 participants