Merge consecutive load insertions into a vector where possible #38821

RKSimon · 2018-10-29T13:24:56Z


Bugzilla Link	39473
Version	trunk
OS	Windows NT
CC	@adibiagio,@filcab,@rotateright

Extended Description

#include <x86intrin.h>

__m128i load_00123456(const unsigned short *data) {
  return _mm_setr_epi16(data[0], data[0], data[1], data[2], data[3], data[4], data[5], data[6]);
}

-O3 -march=btver2

_Z13load_00123456PKt: # @_Z13load_00123456PKt
  movzwl (%rdi), %eax
  vmovd %eax, %xmm0
  vpinsrw $1, %eax, %xmm0, %xmm0
  vpinsrw $2, 2(%rdi), %xmm0, %xmm0
  vpinsrw $3, 4(%rdi), %xmm0, %xmm0
  vpinsrw $4, 6(%rdi), %xmm0, %xmm0
  vpinsrw $5, 8(%rdi), %xmm0, %xmm0
  vpinsrw $6, 10(%rdi), %xmm0, %xmm0
  vpinsrw $7, 12(%rdi), %xmm0, %xmm0
  retq

Many of the loads/insertions could be merged to something like:

_Z19load_00123456_mergePKt: # @_Z19load_00123456_mergePKt
  movzwl (%rdi), %eax
  vmovd %eax, %xmm0
  vpshuflw $224, %xmm0, %xmm0 # xmm0 = xmm0[0,0,2,3,4,5,6,7]
  vpshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0]
  vpinsrd $1, 2(%rdi), %xmm0, %xmm0
  vpinsrq $1, 6(%rdi), %xmm0, %xmm0
  retq

https://godbolt.org/z/-HLpsE

The text was updated successfully, but these errors were encountered:

rotateright · 2022-07-04T12:54:06Z

We've made some progress on this example - the original code actually looks better on paper (llvm-mca) than the suggested asm:
https://godbolt.org/z/MvzM91hWo

rotateright · 2022-07-04T12:56:38Z

The current IR for those examples looks like this:

define dso_local noundef <2 x i64> @_Z13load_00123456PKt(ptr nocapture noundef readonly %0) local_unnamed_addr #0 {
  %2 = load <2 x i16>, ptr %0, align 2, !tbaa !6
  %3 = getelementptr inbounds i16, ptr %0, i64 2
  %4 = load i16, ptr %3, align 2, !tbaa !6
  %5 = getelementptr inbounds i16, ptr %0, i64 3
  %6 = load <4 x i16>, ptr %5, align 2, !tbaa !6
  %7 = shufflevector <2 x i16> %2, <2 x i16> poison, <8 x i32> <i32 0, i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %8 = insertelement <8 x i16> %7, i16 %4, i64 3
  %9 = shufflevector <4 x i16> %6, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
  %10 = shufflevector <8 x i16> %8, <8 x i16> %9, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
  %11 = bitcast <8 x i16> %10 to <2 x i64>
  ret <2 x i64> %11
}

define dso_local noundef <2 x i64> @_Z19load_00123456_mergePKt(ptr nocapture noundef readonly %0) local_unnamed_addr #0 {
  %2 = load i16, ptr %0, align 2, !tbaa !6
  %3 = insertelement <8 x i16> undef, i16 %2, i64 0
  %4 = shufflevector <8 x i16> %3, <8 x i16> poison, <8 x i32> <i32 0, i32 0, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %5 = bitcast <8 x i16> %4 to <4 x i32>
  %6 = getelementptr inbounds i16, ptr %0, i64 1
  %7 = load i32, ptr %6, align 4, !tbaa !10
  %8 = insertelement <4 x i32> %5, i32 %7, i64 1
  %9 = bitcast <4 x i32> %8 to <2 x i64>
  %10 = getelementptr inbounds i16, ptr %0, i64 3
  %11 = load i64, ptr %10, align 8, !tbaa !12
  %12 = insertelement <2 x i64> %9, i64 %11, i64 1
  ret <2 x i64> %12
}

rotateright · 2022-07-04T12:59:16Z

Given the source, I don't think there's a way to speculatively load the full vector - access to data[7] is not guaranteed.

Is there anything more we can do to improve this?

RKSimon · 2022-07-04T13:18:39Z

Yes - without a better dereferencable range widening the scalatr load insertions is the best we can do.

The codegen, at least for SSE4+ looks pretty good now - although plain SSE2 seems to have regressed again since a high around 13.x, which I think is a SLP/cost issue.

We can probably close this, although we probably need at least some explicit SLP test coverage first.

issue #38821

rotateright · 2022-07-04T15:19:02Z

Note that with AVX2, the suggested merged IR will actually look better because we can use a broadcast load of i16:
https://godbolt.org/z/rxvvWsG39

...but I've been staring at that for a while, and I do not see a way to coerce SLP or the cost model into producing the necessary re-arranged loads and shuffles. Once SLP is done, there's no way for the backend to undo it either AFAIK.
Test coverage added with 10ebaf7 in case someone finds a way.

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

rotateright added a commit that referenced this issue Jul 4, 2022

[SLP] add test for load combining + shuffling; NFC

10ebaf7

issue #38821

rotateright closed this as completed Jul 4, 2022

EugeneZelenko added llvm:optimizations and removed llvm:codegen labels Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge consecutive load insertions into a vector where possible #38821

Merge consecutive load insertions into a vector where possible #38821

RKSimon commented Oct 29, 2018 •

edited

Loading

rotateright commented Jul 4, 2022 •

edited

Loading

rotateright commented Jul 4, 2022

rotateright commented Jul 4, 2022

RKSimon commented Jul 4, 2022

rotateright commented Jul 4, 2022

Merge consecutive load insertions into a vector where possible #38821

Merge consecutive load insertions into a vector where possible #38821

Comments

RKSimon commented Oct 29, 2018 • edited Loading

Extended Description

rotateright commented Jul 4, 2022 • edited Loading

rotateright commented Jul 4, 2022

rotateright commented Jul 4, 2022

RKSimon commented Jul 4, 2022

rotateright commented Jul 4, 2022

RKSimon commented Oct 29, 2018 •

edited

Loading

rotateright commented Jul 4, 2022 •

edited

Loading