49055 – [X86] or/add-like mismatch in uchar 'splat' to uint

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 49055 - [X86] or/add-like mismatch in uchar 'splat' to uint

Summary: [X86] or/add-like mismatch in uchar 'splat' to uint

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Scalar Optimizations (show other bugs)
Version:	trunk
Hardware:	PC Windows NT

Importance:	P enhancement
Assignee:	Sanjay Patel

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-02-05 04:12 PST by Simon Pilgrim
Modified:	2021-02-09 04:48 PST (History)
CC List:	4 users (show)

See Also:
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Simon Pilgrim 2021-02-05 04:12:56 PST

https://simd.godbolt.org/z/E15MrP

void loop_add(const unsigned char* __restrict pIn, unsigned int* __restrict pOut, int s) {
  for (int i = 0; i < s; i++) {
    unsigned int pixelChar = pIn[i];
    unsigned int pixel = pixelChar + (pixelChar << 8) + (pixelChar << 16) + (255 << 24);
    pOut[i] = pixel;
  }
}

This successfully combines to a zext+mul+add (I think technically it could be zext+mul+or)?

But this:

void loop_or(const unsigned char* __restrict pIn, unsigned int* __restrict pOut, int s) {
  for (int i = 0; i < s; i++) {
    unsigned int pixelChar = pIn[i];
    unsigned int pixel = pixelChar | (pixelChar << 8) | (pixelChar << 16) | (255 << 24);
    pOut[i] = pixel;
  }
}

completely fails and we end up with a long shift+or chain.

Comment 1 Sanjay Patel 2021-02-05 07:22:22 PST

The root question - which of these is canonical?

define i32 @src(i8 %x) {
  %zx = zext i8 %x to i32
  %s1 = shl nuw nsw i32 %zx, 8
  %s2 = shl nuw nsw i32 %zx, 16
  %or = or i32 %s1, %s2
  ret i32 %or
}

define i32 @tgt(i8 %x) {
  %zx = zext i8 %x to i32
  %or = mul i32 %zx, 65792
  ret i32 %or
}

https://alive2.llvm.org/ce/z/dqT39Y

The difference in instcombine is in InstCombinerImpl::SimplifyUsingDistributiveLaws() -> getBinOpsForFactorization().

We recognize add-of-shifts as a factorized multiply, but not or-of-shifts.

We're arguing for "mul" in this example, and we already do that partially, so the backend should be equipped to decompose back to shift+logic.

Interestingly, the -reassociate pass already has a reverse transform from -instcombine that turns the 'or' into 'add': "ShouldConvertOrWithNoCommonBitsToAdd"

If we allow shift opcodes there, that's probably the quickest fix.

Comment 2 Sanjay Patel 2021-02-05 10:59:34 PST

A little git digging turns up that the -reassociate logic was added relatively recently:
https://reviews.llvm.org/rG70472f3
https://reviews.llvm.org/rG93f3d7f

So it seems fine to add 'shl' to the opcode list. It's awkward that we have opposing canonicalizations in different passes, but I don't have any better ideas.

I'll add a reassociate test and a phase ordering test to prevent this from regressing.

Comment 3 Sanjay Patel 2021-02-09 04:48:09 PST

The small fix to -reassociate improves, but does not solve the larger example shown here:
https://reviews.llvm.org/rG6fd91be35444

We'll need to adjust reassociate and/or instcombine with at least 1 more change to get this to canonicalize as expected.