Example: ``` void foo(unsigned *A, unsigned *B, unsigned Elts) { for (unsigned I = 0; I < Elts; ++I) { unsigned X = A[I]; __builtin_nontemporal_store(X, &B[I]); } } ``` > clang -O2 -march=btver2 -S -o - ``` .LBB0_8: # %for.body movl (%rdi,%rcx,4), %edx movntil %edx, (%rsi,%rcx,4) # <<== OK. Nontemporal store. incq %rcx cmpq %rcx, %rax jne .LBB0_8 retq .LBB0_5: # %vector.ph movl %eax, %ecx xorl %edx, %edx andl $-32, %ecx .p2align 4, 0x90 .LBB0_6: # %vector.body vmovups (%rdi,%rdx,4), %ymm0 vmovups 32(%rdi,%rdx,4), %ymm1 vmovups 64(%rdi,%rdx,4), %ymm2 vmovups 96(%rdi,%rdx,4), %ymm3 vmovups %ymm0, (%rsi,%rdx,4) # <<== WRONG. Temporal vector store. vmovups %ymm1, 32(%rsi,%rdx,4) # Same... vmovups %ymm2, 64(%rsi,%rdx,4) # Same... vmovups %ymm3, 96(%rsi,%rdx,4) # Same... addq $32, %rdx cmpq %rdx, %rcx jne .LBB0_6 # %bb.7: # %middle.block cmpq %rax, %rcx jne .LBB0_8 ``` On X86, (V)MOVNTPS can be used to do non-temporal vector stores. However, VMOVNTPS requires that the memory operand for the destination is aligned by 16-bytes (for the 128-bit stores), or 32-bytes (for the 256-bit stores). In this example, store instructions are marked as 4-bytes aligned. When the loop vectorizer kicks in, it generates a vector loop body, and all vector stores are correctly annotated with metadata flag "!nontemporal" and aligment 4. However, on x86 there is no support for unaligned nontemporal stores. So, ISel falls back to selecting normal (i.e. "temporal") unaligned stores (see the VMOVUPS from the assembly above). When vectorizing a memcpy-like loop, we should probably check if the target has support for unaligned nontemporal vector stores before transforming the loop. Otherwise, we risk to accidentally introduce temporal stores that pollute the caches.
Proposed patch at: https://reviews.llvm.org/D61764
Warren's patch solves the LV bug, which is the main issue to address. In the backend we do have some scalarization options (as a fallback): For SSE4A targets we can at least scalarize unaligned vectors to use MOVNTSD/MOVNTSS - shuffling/splitting the xmm/ymm data. For SSE2 (non-SSE4A) targets we could use MOVNTI, although this would involve moving the vector over to gprs one i32/i64 at a time.... Both are slow but better than polluting caches....
Fixed in r363581.