Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge consecutive load insertions into a vector where possible #38821

Closed
RKSimon opened this issue Oct 29, 2018 · 5 comments
Closed

Merge consecutive load insertions into a vector where possible #38821

RKSimon opened this issue Oct 29, 2018 · 5 comments
Labels
bugzilla Issues migrated from bugzilla llvm:optimizations

Comments

@RKSimon
Copy link
Collaborator

RKSimon commented Oct 29, 2018

Bugzilla Link 39473
Version trunk
OS Windows NT
CC @adibiagio,@filcab,@rotateright

Extended Description

#include <x86intrin.h>

__m128i load_00123456(const unsigned short *data) {
  return _mm_setr_epi16(data[0], data[0], data[1], data[2], data[3], data[4], data[5], data[6]);
}

-O3 -march=btver2

_Z13load_00123456PKt: # @_Z13load_00123456PKt
  movzwl (%rdi), %eax
  vmovd %eax, %xmm0
  vpinsrw $1, %eax, %xmm0, %xmm0
  vpinsrw $2, 2(%rdi), %xmm0, %xmm0
  vpinsrw $3, 4(%rdi), %xmm0, %xmm0
  vpinsrw $4, 6(%rdi), %xmm0, %xmm0
  vpinsrw $5, 8(%rdi), %xmm0, %xmm0
  vpinsrw $6, 10(%rdi), %xmm0, %xmm0
  vpinsrw $7, 12(%rdi), %xmm0, %xmm0
  retq

Many of the loads/insertions could be merged to something like:

_Z19load_00123456_mergePKt: # @_Z19load_00123456_mergePKt
  movzwl (%rdi), %eax
  vmovd %eax, %xmm0
  vpshuflw $224, %xmm0, %xmm0 # xmm0 = xmm0[0,0,2,3,4,5,6,7]
  vpshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0]
  vpinsrd $1, 2(%rdi), %xmm0, %xmm0
  vpinsrq $1, 6(%rdi), %xmm0, %xmm0
  retq

https://godbolt.org/z/-HLpsE

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@rotateright
Copy link
Contributor

rotateright commented Jul 4, 2022

We've made some progress on this example - the original code actually looks better on paper (llvm-mca) than the suggested asm:
https://godbolt.org/z/MvzM91hWo

@rotateright
Copy link
Contributor

The current IR for those examples looks like this:

define dso_local noundef <2 x i64> @_Z13load_00123456PKt(ptr nocapture noundef readonly %0) local_unnamed_addr #0 {
  %2 = load <2 x i16>, ptr %0, align 2, !tbaa !6
  %3 = getelementptr inbounds i16, ptr %0, i64 2
  %4 = load i16, ptr %3, align 2, !tbaa !6
  %5 = getelementptr inbounds i16, ptr %0, i64 3
  %6 = load <4 x i16>, ptr %5, align 2, !tbaa !6
  %7 = shufflevector <2 x i16> %2, <2 x i16> poison, <8 x i32> <i32 0, i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %8 = insertelement <8 x i16> %7, i16 %4, i64 3
  %9 = shufflevector <4 x i16> %6, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
  %10 = shufflevector <8 x i16> %8, <8 x i16> %9, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
  %11 = bitcast <8 x i16> %10 to <2 x i64>
  ret <2 x i64> %11
}

define dso_local noundef <2 x i64> @_Z19load_00123456_mergePKt(ptr nocapture noundef readonly %0) local_unnamed_addr #0 {
  %2 = load i16, ptr %0, align 2, !tbaa !6
  %3 = insertelement <8 x i16> undef, i16 %2, i64 0
  %4 = shufflevector <8 x i16> %3, <8 x i16> poison, <8 x i32> <i32 0, i32 0, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %5 = bitcast <8 x i16> %4 to <4 x i32>
  %6 = getelementptr inbounds i16, ptr %0, i64 1
  %7 = load i32, ptr %6, align 4, !tbaa !10
  %8 = insertelement <4 x i32> %5, i32 %7, i64 1
  %9 = bitcast <4 x i32> %8 to <2 x i64>
  %10 = getelementptr inbounds i16, ptr %0, i64 3
  %11 = load i64, ptr %10, align 8, !tbaa !12
  %12 = insertelement <2 x i64> %9, i64 %11, i64 1
  ret <2 x i64> %12
}

@rotateright
Copy link
Contributor

Given the source, I don't think there's a way to speculatively load the full vector - access to data[7] is not guaranteed.

Is there anything more we can do to improve this?

@RKSimon
Copy link
Collaborator Author

RKSimon commented Jul 4, 2022

Yes - without a better dereferencable range widening the scalatr load insertions is the best we can do.

The codegen, at least for SSE4+ looks pretty good now - although plain SSE2 seems to have regressed again since a high around 13.x, which I think is a SLP/cost issue.

We can probably close this, although we probably need at least some explicit SLP test coverage first.

@rotateright
Copy link
Contributor

Note that with AVX2, the suggested merged IR will actually look better because we can use a broadcast load of i16:
https://godbolt.org/z/rxvvWsG39

...but I've been staring at that for a while, and I do not see a way to coerce SLP or the cost model into producing the necessary re-arranged loads and shuffles. Once SLP is done, there's no way for the backend to undo it either AFAIK.
Test coverage added with 10ebaf7 in case someone finds a way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla llvm:optimizations
Projects
None yet
Development

No branches or pull requests

3 participants