45808 – Suboptimal code for _mm256_zextsi128_si256(_mm_set1_epi8(-1))

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 45808 - Suboptimal code for _mm256_zextsi128_si256(_mm_set1_epi8(-1))

Summary: Suboptimal code for _mm256_zextsi128_si256(_mm_set1_epi8(-1))

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Backend: X86 (show other bugs)
Version:	trunk
Hardware:	PC Linux

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-05-05 13:03 PDT by Nemo Publius
Modified:	2020-07-01 14:18 PDT (History)
CC List:	7 users (show)

See Also:	42653
Fixed By Commit(s):	fe6f5ba0bffd,b8a725274c22,3521ecf1f8a3

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Nemo Publius 2020-05-05 13:03:37 PDT

Related: Bug #45806 and https://stackoverflow.com/q/61601902/

I am trying to produce an AVX2 mask with all-ones in the lower lane and all-zeroes in the upper lane of a YMM register. The code I am using is:

    __m256i mask = _mm256_zextsi128_si256(_mm_set1_epi8(-1));

This should produce a single instruction like `vpcmpeqd %xmm0,%xmm0,%xmm0`, but Clang insists on putting the value into memory and loading it.

However, Clang insists on putting this into memory and loading it.

The behavior in context is even more odd:

    __m256i minmax(__m256i v1, __m256i v2)
    {
        __m256i comp = _mm256_cmpgt_epi64(v1, v2);
        __m256i mask = _mm256_zextsi128_si256(_mm_set1_epi8(-1));
        return _mm256_blendv_epi8(v2, v1, _mm256_xor_si256(comp, mask));
    }

This goes through a bunch of contortions with extracting, shifting, and expanding 128-bit registers when I feel like the result I want is pretty straightforward.

Godbolt example: https://gcc.godbolt.org/z/GPhJ6s

Comment 1 Simon Pilgrim 2020-05-05 13:06:29 PDT

define <4 x i64> @_Z6minmaxDv4_xS_(<4 x i64> %0, <4 x i64> %1) {
  %3 = icmp sgt <4 x i64> %0, %1
  %4 = xor <4 x i1> %3, <i1 true, i1 true, i1 false, i1 false>
  %5 = select <4 x i1> %4, <4 x i64> %0, <4 x i64> %1
  ret <4 x i64> %5
}

Comment 2 Roman Lebedev 2020-05-05 13:21:15 PDT

(In reply to Simon Pilgrim from comment #1)
> define <4 x i64> @_Z6minmaxDv4_xS_(<4 x i64> %0, <4 x i64> %1) {
>   %3 = icmp sgt <4 x i64> %0, %1
>   %4 = xor <4 x i1> %3, <i1 true, i1 true, i1 false, i1 false>
So basically the code that is handling materialization of all-ones constants
as pcmpeq needs to be taught that if lower portion is all-ones and the rest
is zeros, it might still be profitable, i'm guessing?

>   %5 = select <4 x i1> %4, <4 x i64> %0, <4 x i64> %1
>   ret <4 x i64> %5
> }

Comment 3 Simon Pilgrim 2020-05-05 14:04:49 PDT

We might need to improve PromoteMaskArithmetic to better handle selects.

CC'ing Florian who did a load of improvements in D72524.

Comment 4 Simon Pilgrim 2020-05-05 14:08:09 PDT

(In reply to Roman Lebedev from comment #2)
> So basically the code that is handling materialization of all-ones constants
> as pcmpeq needs to be taught that if lower portion is all-ones and the rest
> is zeros, it might still be profitable, i'm guessing?

[Bug #42653] discusses something similar for rematerializable lower 'allones' subvector masks once we avoid the unnecessary packss/pmovsx

Comment 5 Simon Pilgrim 2020-05-08 03:00:22 PDT

rGfe6f5ba0bffd - added test case
rGb8a725274c22 - fixed PACKSS promotion issues

Current AVX2 Codegen:

.LCPI0_0:
        .quad   1                       # 0x1
        .quad   1                       # 0x1
        .quad   0                       # 0x0
        .quad   0                       # 0x0
_Z6minmaxDv4_xS_:                       # @_Z6minmaxDv4_xS_
        vpcmpgtq        %ymm1, %ymm0, %ymm2
        vpxor   .LCPI0_0(%rip), %ymm2, %ymm2
        vpsllq  $63, %ymm2, %ymm2
        vblendvpd       %ymm2, %ymm0, %ymm1, %ymm0
        retq

Comment 6 Roman Lebedev 2020-05-08 04:37:58 PDT

(In reply to Simon Pilgrim from comment #5)
> rGfe6f5ba0bffd - added test case
> rGb8a725274c22 - fixed PACKSS promotion issues
> 
> Current AVX2 Codegen:
> 
> .LCPI0_0:
>         .quad   1                       # 0x1
>         .quad   1                       # 0x1
>         .quad   0                       # 0x0
>         .quad   0                       # 0x0
> _Z6minmaxDv4_xS_:                       # @_Z6minmaxDv4_xS_
>         vpcmpgtq        %ymm1, %ymm0, %ymm2
produces either -1 or 0
>         vpxor   .LCPI0_0(%rip), %ymm2, %ymm2
Inverts lowest bit only
>         vpsllq  $63, %ymm2, %ymm2
moves lowest bit into highest bit
>         vblendvpd       %ymm2, %ymm0, %ymm1, %ymm0
uses highest bit to control blending

Can't we get rid of the vpsllq by using -1 instead of 1 in xor?

>         retq

Comment 7 Simon Pilgrim 2020-05-08 04:58:01 PDT

That's the next step - there is plenty of code that tries to do that kind of thing - and nearly all of it ignores vectors :-)

Comment 8 Simon Pilgrim 2020-06-22 07:03:06 PDT

Initial patch: https://reviews.llvm.org/D82257

This will make sure we're using -1/0 sign masks but doesn't materialize the constant using VPCMPEQ xmm (with implicit zeroing of the upper elements).

Comment 9 Simon Pilgrim 2020-07-01 14:18:55 PDT

Current AVX2 Codegen:

.LCPI0_0:
        .quad   -1                      # 0xffffffffffffffff
        .quad   -1                      # 0xffffffffffffffff
        .quad   0                       # 0x0
        .quad   0                       # 0x0
_Z6minmaxDv4_xS_:                       # @_Z6minmaxDv4_xS_
        vpcmpgtq        %ymm1, %ymm0, %ymm2
        vpxor   .LCPI0_0(%rip), %ymm2, %ymm2
        vblendvpd       %ymm2, %ymm0, %ymm1, %ymm0
        retq