[X86] Compiler wrongly emits temporal stores when vectorizing a scalar nontemporal memcpy loop. #40105

adibiagio · 2019-02-18T12:04:14Z


Bugzilla Link	40759
Resolution	FIXED
Resolved on	Jun 17, 2019 11:45
Version	trunk
OS	Windows NT
CC	@topperc,@francisvm,@RKSimon,@rotateright,@wjristow
Fixed by commit(s)	363581

Extended Description

Example:

void foo(unsigned *A, unsigned *B, unsigned Elts) {
  for (unsigned I = 0; I < Elts; ++I) {
    unsigned X = A[I];
    __builtin_nontemporal_store(X, &B[I]);
  }
}

clang -O2 -march=btver2 -S -o -

.LBB0_8:                                # %for.body
        movl    (%rdi,%rcx,4), %edx
        movntil %edx, (%rsi,%rcx,4)     # <<==  OK. Nontemporal store.
        incq    %rcx
        cmpq    %rcx, %rax
        jne     .LBB0_8
        retq
.LBB0_5:                                # %vector.ph
        movl    %eax, %ecx
        xorl    %edx, %edx
        andl    $-32, %ecx
        .p2align        4, 0x90
.LBB0_6:                                # %vector.body
        vmovups (%rdi,%rdx,4), %ymm0
        vmovups 32(%rdi,%rdx,4), %ymm1
        vmovups 64(%rdi,%rdx,4), %ymm2
        vmovups 96(%rdi,%rdx,4), %ymm3
        vmovups %ymm0, (%rsi,%rdx,4)    # <<== WRONG. Temporal vector store.
        vmovups %ymm1, 32(%rsi,%rdx,4)  # Same...
        vmovups %ymm2, 64(%rsi,%rdx,4)  # Same...
        vmovups %ymm3, 96(%rsi,%rdx,4)  # Same...
        addq    $32, %rdx
        cmpq    %rdx, %rcx
        jne     .LBB0_6
# %bb.7:                                # %middle.block
        cmpq    %rax, %rcx
        jne     .LBB0_8

On X86, (V)MOVNTPS can be used to do non-temporal vector stores.
However, VMOVNTPS requires that the memory operand for the destination is aligned by 16-bytes (for the 128-bit stores), or 32-bytes (for the 256-bit stores).

In this example, store instructions are marked as 4-bytes aligned.
When the loop vectorizer kicks in, it generates a vector loop body, and all vector stores are correctly annotated with metadata flag "!nontemporal" and aligment 4.

However, on x86 there is no support for unaligned nontemporal stores.
So, ISel falls back to selecting normal (i.e. "temporal") unaligned stores (see the VMOVUPS from the assembly above).

When vectorizing a memcpy-like loop, we should probably check if the target has support for unaligned nontemporal vector stores before transforming the loop. Otherwise, we risk to accidentally introduce temporal stores that pollute the caches.

The text was updated successfully, but these errors were encountered:

wjristow · 2019-05-09T22:24:56Z

Proposed patch at:
https://reviews.llvm.org/D61764

RKSimon · 2019-05-10T11:29:16Z

Warren's patch solves the LV bug, which is the main issue to address.

In the backend we do have some scalarization options (as a fallback):

For SSE4A targets we can at least scalarize unaligned vectors to use MOVNTSD/MOVNTSS - shuffling/splitting the xmm/ymm data.

For SSE2 (non-SSE4A) targets we could use MOVNTI, although this would involve moving the vector over to gprs one i32/i64 at a time....

Both are slow but better than polluting caches....

wjristow · 2019-06-17T18:45:14Z

Fixed in r363581.

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86] Compiler wrongly emits temporal stores when vectorizing a scalar nontemporal memcpy loop. #40105

[X86] Compiler wrongly emits temporal stores when vectorizing a scalar nontemporal memcpy loop. #40105

adibiagio commented Feb 18, 2019

wjristow commented May 9, 2019

RKSimon commented May 10, 2019

wjristow commented Jun 17, 2019

[X86] Compiler wrongly emits temporal stores when vectorizing a scalar nontemporal memcpy loop. #40105

[X86] Compiler wrongly emits temporal stores when vectorizing a scalar nontemporal memcpy loop. #40105

Comments

adibiagio commented Feb 18, 2019

Extended Description

wjristow commented May 9, 2019

RKSimon commented May 10, 2019

wjristow commented Jun 17, 2019