Follow up to [Bug #21710] '[X86][AVX] suboptimal expansion of 256 bit vector loads.' Merging of consecutive loads into a 256-bit ymm register now works well for simple cases, and the loads also fold nicely for bitwise ops (as well as basic float ops - fadd, fsub etc.). Vector shuffle optimizations however attempt to selectively load individual lanes and in doing so prevent the optimization from folding the load into the shuffle. e.g. __m256d vsht_d4(__m256d foo) { return __builtin_shufflevector( foo, foo, 0, 0, 2, 2 ); } define <4 x double> @_Z7vsht_d4Dv4_d(<4 x double> %foo) #1 { %1 = shufflevector <4 x double> %foo, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2> ret <4 x double> %1 } vpermilpd $0, %ymm0, %ymm0 # ymm0 = ymm0[0,0,2,2] retq __m256d vsht_d4_fold(const double* ptr) { __m256d foo = (__m256d){ ptr[0], ptr[1], ptr[2], ptr[3] }; return __builtin_shufflevector( foo, foo, 0, 0, 2, 2 ); } define <4 x double> @_Z12vsht_d4_foldPKd(double* nocapture readonly %ptr) #0 { %1 = load double* %ptr, align 8, !tbaa !1 %2 = insertelement <4 x double> undef, double %1, i32 0 %3 = getelementptr inbounds double* %ptr, i64 2 %4 = load double* %3, align 8, !tbaa !1 %5 = insertelement <4 x double> %2, double %4, i32 2 %6 = shufflevector <4 x double> %5, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2> ret <4 x double> %6 } vmovsd (%rdi), %xmm0 vmovsd 16(%rdi), %xmm1 vinsertf128 $1, %xmm1, %ymm0, %ymm0 vpermilpd $0, %ymm0, %ymm0 # ymm0 = ymm0[0,0,2,2] retq Manually editing the IR does permit the fold to occur: define <4 x double> @_Z12vsht_d4_foldPKd(double* nocapture readonly %ptr) #0 { %1 = load double* %ptr, align 8, !tbaa !1 %2 = insertelement <4 x double> undef, double %1, i32 0 %3 = getelementptr inbounds double* %ptr, i64 1 %4 = load double* %3, align 8, !tbaa !1 %5 = insertelement <4 x double> %2, double %4, i32 1 %6 = getelementptr inbounds double* %ptr, i64 2 %7 = load double* %6, align 8, !tbaa !1 %8 = insertelement <4 x double> %5, double %7, i32 2 %9 = getelementptr inbounds double* %ptr, i64 3 %10 = load double* %9, align 8, !tbaa !1 %11 = insertelement <4 x double> %8, double %10, i32 3 %12 = shufflevector <4 x double> %11, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2> ret <4 x double> %12 } vpermilpd $0, (%rdi), %ymm0 # ymm0 = mem[0,0,2,2] retq
As of r230021, the vpermild has become a vmovddup, but we're still not merging the loads: $ ./llc 21780.ll -o - -mattr=avx ... vmovsd (%rdi), %xmm0 ## xmm0 = mem[0],zero vmovsd 16(%rdi), %xmm1 ## xmm1 = mem[0],zero vinsertf128 $1, %xmm1, %ymm0, %ymm0 vmovddup %ymm0, %ymm0 ## ymm0 = ymm0[0,0,2,2] retq
I think the problem with doing a full vector load in this case is that we'd be reading unknown bytes of memory beyond the last scalar load. In general, this isn't safe (unmapped memory). If we change the test case to load the scalar elements at both ends of the vector, then we do get the vector load optimization to kick: define <4 x double> @_Z12vsht_d4_foldPKd(double* nocapture readonly %ptr) #0 { %1 = load double, double* %ptr, align 8 %2 = insertelement <4 x double> undef, double %1, i32 0 %3 = getelementptr double, double* %ptr, i64 3 ; Read last vector element %4 = load double, double* %3, align 8 %5 = insertelement <4 x double> %2, double %4, i32 3 ; And insert into last position %6 = shufflevector <4 x double> %5, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 3, i32 3> ret <4 x double> %6 } $ ./llc -mattr=avx 21780.ll -o - ... vpermilpd $12, (%rdi), %ymm0 ## ymm0 = mem[0,0,3,3] retq
Ah, now I see the source test case: __m256d foo = (__m256d){ ptr[0], ptr[1], ptr[2], ptr[3] }; return __builtin_shufflevector( foo, foo, 0, 0, 2, 2 ); So I'll stick with the assessment that the backend is doing all that it can given the IR. It looks like -instcombine is responsible for eliminating the unused loads in IR. To fix this, we'd need to teach instcombine to weigh a potential trade-off: bypass the elimination of some loads in exchange for reducing the total number of ops via a larger load.
Current codegen: $ ./llc 21780.ll -o - -mattr=avx vmovddup (%rdi), %xmm0 ## xmm0 = mem[0,0] vmovddup 16(%rdi), %xmm1 ## xmm1 = mem[0,0] vinsertf128 $1, %xmm1, %ymm0, %ymm0 retq
I was going back through our splat discussions in relation to bug 27141, and came back to this bug. Let's make this IR test case concrete: define <4 x double> @load_four_scalars_but_use_two(double* %ptr) { %arrayidx0 = getelementptr inbounds double, double* %ptr, i64 0 %arrayidx1 = getelementptr inbounds double, double* %ptr, i64 1 %arrayidx2 = getelementptr inbounds double, double* %ptr, i64 2 %arrayidx3 = getelementptr inbounds double, double* %ptr, i64 3 %ld0 = load double, double* %arrayidx0 %ld1 = load double, double* %arrayidx1 %ld2 = load double, double* %arrayidx2 %ld3 = load double, double* %arrayidx3 %ins0 = insertelement <4 x double> undef, double %ld0, i32 0 %ins1 = insertelement <4 x double> %ins0, double %ld1, i32 1 %ins2 = insertelement <4 x double> %ins1, double %ld2, i32 2 %ins3 = insertelement <4 x double> %ins2, double %ld3, i32 3 %shuffle = shufflevector <4 x double> %ins3, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2> ret <4 x double> %shuffle } ----------------------------------------------------------------------------- Here's the current optimization - delete the unused scalar loads and inserts: $ ./opt -instcombine 21780.ll -S ; ModuleID = '21780.ll' source_filename = "21780.ll" define <4 x double> @load_four_scalars_but_use_two(double* %ptr) { %arrayidx2 = getelementptr inbounds double, double* %ptr, i64 2 %ld0 = load double, double* %ptr, align 8 %ld2 = load double, double* %arrayidx2, align 8 %ins0 = insertelement <4 x double> undef, double %ld0, i32 0 %ins2 = insertelement <4 x double> %ins0, double %ld2, i32 2 %shuffle = shufflevector <4 x double> %ins2, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2> ret <4 x double> %shuffle } ----------------------------------------------------------------------------- What we'd prefer to see for an AVX machine: define <4 x double> @load_four_scalars_but_use_two(double* %ptr) { %bc = bitcast double* %ptr to <4 x double>* %ldvec = load <4 x double>, <4 x double>* %bc %shuffle = shufflevector <4 x double> %ldvec, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2> ret <4 x double> %shuffle } ----------------------------------------------------------------------------- To summarize earlier comments: after instcombine removes the loads, no other pass is allowed to recreate them. We've simply lost the information that it was safe to do those operations in the first place. This goes back to the recent llvm-dev thread on reading extra memory. So this has to be an instcombine transform. Given that the vector IR has less instructions, I think that's a reasonable transform compared to what we do today. It should be a backend responsibility to split that vector load into scalar ops if that would be profitable.
Note that there is a "load-combine" pass for IR that is not on by default. It doesn't work on the example IR in comment 5 because it is currently limited to integer loads. If we change the example types to ints, it does this: define <4 x i64> @load_four_scalars_but_use_two(i64* %ptr) { %1 = bitcast i64* %ptr to i256* %ld0.combined = load i256, i256* %1, align 8 <--- crazy wide integer type %combine.extract.trunc = trunc i256 %ld0.combined to i64 %combine.extract.shift = lshr i256 %ld0.combined, 64 %combine.extract.trunc1 = trunc i256 %combine.extract.shift to i64 %combine.extract.shift2 = lshr i256 %ld0.combined, 128 %combine.extract.trunc3 = trunc i256 %combine.extract.shift2 to i64 %combine.extract.shift4 = lshr i256 %ld0.combined, 192 %combine.extract.trunc5 = trunc i256 %combine.extract.shift4 to i64 %ins0 = insertelement <4 x i64> undef, i64 %combine.extract.trunc, i32 0 %ins1 = insertelement <4 x i64> %ins0, i64 %combine.extract.trunc1, i32 1 %ins2 = insertelement <4 x i64> %ins1, i64 %combine.extract.trunc3, i32 2 %ins3 = insertelement <4 x i64> %ins2, i64 %combine.extract.trunc5, i32 3 ret <4 x i64> %ins3 } Unfortunately, even if we fix that, I don't think this pass is a viable option for the transform we want to do because we have to do our transform before instcombine has a chance to eliminate the unused loads. The load-combine pass does provide a model for what is required though: alias analysis, etc. As we've seen with combining loads in the DAG, it's not an easy transform in general because of all of the error potential when reordering memory operations.
(In reply to comment #5) > To summarize earlier comments: after instcombine removes the loads, no other > pass is allowed to recreate them. We've simply lost the information that it > was safe to do those operations in the first place. This goes back to the > recent llvm-dev thread on reading extra memory. > > So this has to be an instcombine transform. Given that the vector IR has > less instructions, I think that's a reasonable transform compared to what we > do today. It should be a backend responsibility to split that vector load > into scalar ops if that would be profitable. Let me refine that statement: instcombine must preserve the fact that it has removed a load of memory in order for subsequent passes to act on that information. It's probably easier to have the SLP vectorizer or some other pass act on that information because instcombine isn't currently equipped for memop combining AFAIK. There does already appear to be a metadata type that would work for this use case: "The optional !dereferenceable_or_null metadata must reference a single metadata name <deref_bytes_node> corresponding to a metadata node with one i64 entry. The existence of the !dereferenceable_or_null metadata on the instruction tells the optimizer that the value loaded is known to be either dereferenceable or null. The number of bytes known to be dereferenceable is specified by the integer value in the metadata node." http://llvm.org/docs/LangRef.html#load-instruction
Current codegen: https://godbolt.org/z/sc6te_
Try to preserve the dereferenceable information before instcombine kills it: https://reviews.llvm.org/D64258