MSVC rotate intrinsics don't (just) generate rotates on x86-64 #36735

llvmbot · 2018-05-09T04:09:07Z


Bugzilla Link	37387
Resolution	FIXED
Resolved on	Nov 26, 2018 06:35
Version	6.0
OS	Windows NT
Reporter	LLVM Bugzilla Contributor
CC	@lattner,@topperc,@efriedma-quic,@gregbedwell,@RKSimon,@rnk,@nagisa,@rotateright,@Trass3r

Extended Description

This simple test:

// ---- begin
#include <intrin.h>

extern "C" unsigned long long f(unsigned long long a, int b)
{
return _rotl64(a, b);
}

extern "C" unsigned long long g(unsigned long long a, int b)
{
return (a << (b & 63)) | (a >> (-b & 63));
}
// ---- end

produces (on x86-64 using clang 6.0 release; only quoting the relevant bits):

---- begin

f: # @f

%bb.0:

movq	%rcx, %r8
andl	$63, %edx
movq	%r8, %rax
movl	%edx, %ecx
rolq	%cl, %rax
testl	%edx, %edx
cmoveq	%r8, %rax
retq

g: # @g

%bb.0:

movq	%rcx, %rax
movl	%edx, %ecx
rolq	%cl, %rax
retq

---- end

The corresponding IR is:

; ---- begin
; Function Attrs: norecurse nounwind readnone sspstrong uwtable
define i64 @f(i64, i32) local_unnamed_addr #0 {
%3 = and i32 %1, 63
%4 = zext i32 %3 to i64
%5 = sub nsw i64 64, %4
%6 = shl i64 %0, %4
%7 = lshr i64 %0, %5
%8 = or i64 %7, %6
%9 = icmp eq i32 %3, 0
%10 = select i1 %9, i64 %0, i64 %8
ret i64 %10
}

; Function Attrs: norecurse nounwind readnone sspstrong uwtable
define i64 @g(i64, i32) local_unnamed_addr #0 {
%3 = and i32 %1, 63
%4 = zext i32 %3 to i64
%5 = shl i64 %0, %4
%6 = sub nsw i32 0, %1
%7 = and i32 %6, 63
%8 = zext i32 %7 to i64
%9 = lshr i64 %0, %8
%10 = or i64 %5, %9
ret i64 %10
}

; ---- end

The problem is the expansion chosen for the rotr/rotl intrinsics in CGBuiltin.cpp CodeGenFunction::EmitBuiltinExpr, presumably to avoid implementation-specific behavior from the right shift by 64-b.

Note that the alternative expansion for rotate-left given in the code for g avoids the problematic select, is well-defined, and already gets matched to ROL (in the x86-64 backend anyway), so it seems like a good alternative.

The text was updated successfully, but these errors were encountered:

llvmbot · 2018-05-09T04:12:49Z

Forgot to mention: this is when using clang-cl, the test file was compiled via "clang-cl -c -O2 -FA clang_cl_rotl.cpp".

topperc · 2018-05-09T06:30:20Z

Should be easy enough to fix the emission code in CGBuiltin.cpp

But I think we may need some more work to recognize rotl8/rotl16

This doesn't work.

extern "C" unsigned char g(unsigned char a, int b)
{
return (a << (b & 7)) | (a >> (-b & 7));
}

I think we need to teach InstCombiner::narrowRotate about this form.

rotateright · 2018-05-09T14:14:02Z

IMO, rotates are up there with min/max for: "Is it too late to just add an IR intrinsic for this?" :)

If we can solve the original problem in clang, that would be great.

topperc · 2018-05-09T19:01:54Z

Patch for the intrinsic issue is here https://reviews.llvm.org/D46656

We still have problems in the middle end with short and char rotates written directly in C. I may file a separate bug for that.

topperc · 2018-05-10T05:23:59Z

CGBuiltin.cpp changed in r331943

rnk · 2018-05-10T21:46:14Z

At this point, I'd favor an intrinsic for the rotates. The likelihood that a someone actually cares if simplify demanded bits works on rotate is close to zero. It's far more likely that they'll open more bugs when our "optimizations" break backend pattern matching.

llvmbot · 2018-05-10T22:58:32Z

Speaking as a somewhat frustrated user: yes, an intrinsic would help.

Specifically, the problem with pattern-matching for these relatively intricate trees deep in the back-end is that there's a lot of stuff that happens in between the code the user enters and said pattern-matching, and there are a myriad ways for things to go awry in the meantime.

For a recent example derived from the exact same code that led to this bug report, see #34835 .

I just looked at the generated code with the replacement _rotl64 (function "g" in the above snippet), and noticed that while g fixes my toy repro, I still get poor code in the actual use case.

Here's a slightly more complicated repro that illustrates the issue I'm seeing in my real use case:

// ---- begin
#include <stdint.h>
#include <string.h>

static uint64_t rotl64(uint64_t a, int b)
{
return (a << (b & 63)) | (a >> (-b & 63));
}

uint64_t hash(const void *p, int rot_amount)
{
static const uint64_t mult = 0x6C7F3EDD8298DF37; // random 64-bit prime
uint32_t t;

memcpy(&t, p, sizeof(t));
uint64_t prod = (uint64_t)t * mult;
return prod + rotl64(prod, rot_amount);

}

void bulk_insert(uint32_t *hash_table, uint32_t mask, int rot, const uint8_t *p, size_t count)
{
for (size_t i = 0; i < count; ++i)
hash_table[hash(p + i, rot) & mask] = (uint32_t)i;
}
// ---- end

This is all for a simple multiplicative hash: grab 4 bytes (in this case), multiply by some random constant to mix the bits, then build the hash from both the top and bottom bits of the product. "rot_amount" here is chosen outside to suit a (variable) pow2 hash table size.

Again with clang-cl 6.0 FINAL, "clang-cl -c -O2 -FA hash_problem.cpp", I get: (just the relevant bits)

---- begin

"?hash@@YA_KPEBXH@Z": # @"\01?hash@@YA_KPEBXH@Z"

%bb.0:

movl	(%rcx), %eax
movabsq	$7818036599238221623, %r8 # imm = 0x6C7F3EDD8298DF37
imulq	%rax, %r8
movq	%r8, %rax
movl	%edx, %ecx
rolq	%cl, %rax
addq	%r8, %rax
retq

---- end

So far, so good But then if I look at bulk_insert: (SEH annotations elided to reduce noise)

---- begin

"?bulk_insert@@YAXPEAIIHPEBE_K@Z": # @"\01?bulk_insert@@YAXPEAIIHPEBE_K@Z"

%bb.0:

pushq	%r15
pushq	%r14
pushq	%r12
pushq	%rsi
pushq	%rdi
pushq	%rbx
movl	%r8d, %r10d
movq	%rcx, %r8
movq	88(%rsp), %r15
testq	%r15, %r15
je	.LBB1_5

%bb.1:

movabsq	$7818036599238221623, %r11 # imm = 0x6C7F3EDD8298DF37
movl	%r10d, %eax
andl	$63, %eax
negl	%r10d
andl	$63, %r10d
movl	%edx, %r12d
movl	%r15d, %r14d
andl	$1, %r14d
cmpq	$1, %r15
jne	.LBB1_6

%bb.2:

xorl	%esi, %esi
testq	%r14, %r14
jne	.LBB1_4
jmp	.LBB1_5

.LBB1_6:
subq %r14, %r15
xorl %esi, %esi
.p2align 4, 0x90
.LBB1_7: # =>This Inner Loop Header: Depth=1
movl (%r9,%rsi), %ebx
imulq %r11, %rbx
movq %rbx, %rdi
movl %eax, %ecx
shlq %cl, %rdi
movq %rbx, %rdx
movl %r10d, %ecx
shrq %cl, %rdx
orl %edx, %edi
addl %edi, %ebx
andq %r12, %rbx
movl %esi, (%r8,%rbx,4)
movl 1(%r9,%rsi), %edx
imulq %r11, %rdx
movq %rdx, %rdi
movl %eax, %ecx
shlq %cl, %rdi
movq %rdx, %rbx
movl %r10d, %ecx
shrq %cl, %rbx
orl %ebx, %edi
addl %edi, %edx
andq %r12, %rdx
leal 1(%rsi), %ecx
movl %ecx, (%r8,%rdx,4)
addq $2, %rsi
cmpq %rsi, %r15
jne .LBB1_7

%bb.3:

testq	%r14, %r14
je	.LBB1_5

.LBB1_4:
movl (%r9,%rsi), %ebx
imulq %r11, %rbx
movq %rbx, %rdi
movl %eax, %ecx
shlq %cl, %rdi
movq %rbx, %rax
movl %r10d, %ecx
shrq %cl, %rax
orl %eax, %edi
addl %edi, %ebx
andq %r12, %rbx
movl %esi, (%r8,%rbx,4)
.LBB1_5:
popq %rbx
popq %rdi
popq %rsi
popq %r12
popq %r14
popq %r15
retq

---- end

The computations of "rot_amount & 63" and "-rot_amount & 63" are identified as loop-invariant, hoisted, and apparently this makes the back-end pattern matching fail to recognize the rotates again, so now I get

movq	%rbx, %rdi
movl	%eax, %ecx
shlq	%cl, %rdi
movq	%rbx, %rax
movl	%r10d, %ecx
shrq	%cl, %rax
orl	%eax, %edi

instead of the expected

    movq    %rbx, %rdi
    movl    %eax, %ecx
    rolq    %cl, %rdi

And this is still attempting to be a small test case; the actual code I care about is more complicated, and presumably contains even more latent ways for this to go wrong.

As far as I see it, there's basically two options here:

keep tweaking the frontend sequences and backend DAG pattern matching for every newly-discovered instance of this, until eventually we run out of known failure modes and declare success? :)
add rotate intrinsics, and then there's no risk of other optimizations reshaping the dag into something the pattern rules don't recognize.

Furthermore, if you're going to rely on pattern matching to recognize rotates, I think the earlier in compilation (and the closer to the original source code) it happens, the better. If I know that the expression for "g" in my original report reliably turns into a rotate-left (when available), that's fine and I can work with that. But there's no way for me to write code in a source language such that I know the resulting DAG will have the rotate patterns I want.

rotateright · 2018-05-11T15:19:59Z

The current argument is that there's no compelling case for rotate IR intrinsics because the backend can always match the (or (shl), (shr (sub))) sequence in its various forms.

Here's the first search hit I got (2012) on this topic:
https://groups.google.com/forum/#!topic/llvm-dev/9ZreOEi2znI
(cc'ing Eli and Chris for thoughts because they replied in that thread)

But the example given in comment 7 - where we constant hoist the opposite shift amount out of a loop - seems very compelling to me. Given the block-at-a-time limit of the current DAG, there is no way to preserve that pattern besides moving/duplicating instructions in IR (CodeGenPrepare).

Let me minimize the example to reduce distractions:

unsigned rotateRight(unsigned a, int b) {
return (a >> b) | (a << (32 - b)); // unsafe, but minimal logic
}

void rotateInLoop(unsigned *x, unsigned N, unsigned *a, int b) {
for (unsigned i = 0; i < N; ++i)
x[ (a[i] >> b) | (a[i] << (32 - b)) ] = i; // shift amt is loop invariant
}

rotateRight is what we expect in IR and x86 asm:

$ ./clang -O2 ror.c -S -o - -fomit-frame-pointer -fno-unroll-loops -emit-llvm
...
%shr = lshr i32 %a, %b
%sub = sub nsw i32 32, %b
%shl = shl i32 %a, %sub
%or = or i32 %shl, %shr

$ ./clang -O2 ror.c -S -o - -fomit-frame-pointer -fno-unroll-loops |grep ror
rorl %cl, %edi

rotateInLoop can't be matched by DAGCombiner::MatchRotate():

for.loop.preheader:
%sub = sub nsw i32 32, %b <--- can't see this in the DAG
%wide.trip.count = zext i32 %N to i64
br label %for.body

for.loop:
...
%shr = lshr i32 %a, %b
%shl = shl i32 %a, %sub
%or = or i32 %shr, %shl <--- that's not a rotate w/o seeing the sub

$ ./clang -O2 ror.c -S -o - -fomit-frame-pointer -fno-unroll-loops | grep sh
shrq %cl, %r11
shll %cl, %esi

lattner · 2018-05-11T15:34:05Z

I have not been involved with core LLVM development for some time, but here are 2c:

GlobalIsel will eventually fix the "not being able to see across blocks" issue that you're referring to. In the meantime, if this is an important case, you can enhance codegen prepare to handle it.
Adding intrinsics for things like rotates and mins and max's have the benefit of allowing instcombine (and frontends) to form them aggressively and pattern match a bunch of things in a target independent way. It has the downside that lots of other optimizations having to know about them (e.g. simplifydemandedbits) to avoid losing performance in other cases.

On balance, it seems like LLVM is doing the right thing here, we need to fix isel for LOTS of reasons, and this is underway. Doing the wrong thing here in the interim doesn't seem warranted.

rotateright · 2018-05-11T15:46:36Z

I have not been involved with core LLVM development for some time, but here
are 2c:

GlobalIsel will eventually fix the "not being able to see across blocks"
issue that you're referring to. In the meantime, if this is an important
case, you can enhance codegen prepare to handle it.

Adding intrinsics for things like rotates and mins and max's have the
benefit of allowing instcombine (and frontends) to form them aggressively
and pattern match a bunch of things in a target independent way. It has the
downside that lots of other optimizations having to know about them (e.g.
simplifydemandedbits) to avoid losing performance in other cases.

On balance, it seems like LLVM is doing the right thing here, we need to fix
isel for LOTS of reasons, and this is underway. Doing the wrong thing here
in the interim doesn't seem warranted.

Thanks, Chris! I haven't been following the GlobalIsel progress, so I don't know how far away that is.

So if no intrinsics, we need at least the following to improve rotate matching:

CGP transform to un-hoist the sub in comment 8.
More logic in instcombine for the example in comment 2.

I think it's best if we open new bugs for those and link to other existing rotate bugs, so we have a better idea of how many variations of the patterns are out there.

llvmbot · 2018-05-11T19:07:20Z

At the risk of beating on a dead horse...

There is a significant distinction between rotate-by-literal-constant and rotate-by-variable-amount here. Rotate-by-literal is easily turned into shifting and ORs, and less tricky to detect, because when the shift amounts are integer literals, there's no risk of them ending up in a different BB (or getting turned into a form the backend doesn't recognize by some other transform).

Rotate-by-variable-amount is not easy to express correctly using shifts because the "obvious" (x << k) | (x >> (width - k)) runs into UB when k==0. The non-constant expressions for the shift amounts run a higher risk of getting transformed in some way that breaks the pattern the DAG rules look for.

There are not that many use cases for variable-amount rotates, but they're widely supported in CPUs and generally efficient, so I'd like to use them as a primitive when they're applicable.

Right now, if I want to actually get variable-amount rotates reliably in Clang, I have to use inline ASM wrapped in a bunch of platform #ifdefs. Inline ASM doesn't participate in other optimizations either, to put it mildly. :)

Talking specifically about variable-amount shifts, how many optimizations actually touch these to begin with? SimplifyDemandedBits was mentioned, but the only transform I could find in there that does not require a constant shift amount was that "the MSB of the result of an arithmetic shift is the same as the MSB of the original input".

It seems to me that the variable-amount form would be better served by an intrinsic, while constant-amount shifts benefit from the work done to optimize regular shifts and bitwise logic operations, and it's undesirable to have another class of shift operation to handle in a bunch of optimization passes.

Maybe the best of both worlds is to use an intrinsic for variable rotates, but turn rotates-by-constant into shifts and ORs, which other passes know to work on?

lattner · 2018-05-12T01:44:30Z

Fair enough, I haven't thought much about the variable size case. I agree that it is significantly different. If you'd like to push for it, please bring it up on llvm-dev, thanks!

Trass3r · 2018-11-11T05:12:48Z

Since the original issue has been fixed, should this be closed or turned into an "implement intrinsic" feature request?

topperc · 2018-11-11T06:17:05Z

We already have added the intrinsic. It's in clang as

__builtin_rotateleft8
__builtin_rotateleft16
__builtin_rotateleft32
__builtin_rotateleft64
__builtin_rotateright8
__builtin_rotateright16
__builtin_rotateright32
__builtin_rotateright64

nagisa · 2018-11-11T07:57:21Z

It seems that Clang still codegens the old sequence for _rot{l,r} builtins rather than these new LLVM intrinsics, as can be seen in lib/CodeGen/CGBuiltin.cpp at lines 1822-1860.

rotateright · 2018-11-11T14:47:17Z

It seems that Clang still codegens the old sequence for _rot{l,r} builtins
rather than these new LLVM intrinsics, as can be seen in
lib/CodeGen/CGBuiltin.cpp at lines 1822-1860.

Changing those was in the approved patch:
https://reviews.llvm.org/D50924

...but it caused a compiler crash on a Linux bot using gcc as the host compiler. I asked for help/policy on working around that, but I got no replies.

rotateright · 2018-11-26T14:35:44Z

It seems that Clang still codegens the old sequence for _rot{l,r} builtins
rather than these new LLVM intrinsics, as can be seen in
lib/CodeGen/CGBuiltin.cpp at lines 1822-1860.

I don't see any bot failures caused by the clang change this time, so should be fixed with:
https://reviews.llvm.org/rL347527

Note that there is still ongoing work to optimize/canonicalize funnel-shift and associated patterns. See related bugs to track those.

llvmbot · 2021-11-27T00:39:00Z

mentioned in issue llvm/llvm-bugzilla-archive#37421

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSVC rotate intrinsics don't (just) generate rotates on x86-64 #36735

MSVC rotate intrinsics don't (just) generate rotates on x86-64 #36735

llvmbot commented May 9, 2018

llvmbot commented May 9, 2018

topperc commented May 9, 2018

rotateright commented May 9, 2018

topperc commented May 9, 2018

topperc commented May 10, 2018

rnk commented May 10, 2018

llvmbot commented May 10, 2018

rotateright commented May 11, 2018

lattner commented May 11, 2018

rotateright commented May 11, 2018

llvmbot commented May 11, 2018

lattner commented May 12, 2018

Trass3r commented Nov 11, 2018

topperc commented Nov 11, 2018

nagisa commented Nov 11, 2018

rotateright commented Nov 11, 2018

rotateright commented Nov 26, 2018

llvmbot commented Nov 27, 2021

MSVC rotate intrinsics don't (just) generate rotates on x86-64 #36735

MSVC rotate intrinsics don't (just) generate rotates on x86-64 #36735

Comments

llvmbot commented May 9, 2018

Extended Description

---- begin

%bb.0:

%bb.0:

---- end

llvmbot commented May 9, 2018

topperc commented May 9, 2018

rotateright commented May 9, 2018

topperc commented May 9, 2018

topperc commented May 10, 2018

rnk commented May 10, 2018

llvmbot commented May 10, 2018

---- begin

%bb.0:

---- end

---- begin

%bb.0:

%bb.1:

%bb.2:

%bb.3:

---- end

rotateright commented May 11, 2018

lattner commented May 11, 2018

rotateright commented May 11, 2018

llvmbot commented May 11, 2018

lattner commented May 12, 2018

Trass3r commented Nov 11, 2018

topperc commented Nov 11, 2018

nagisa commented Nov 11, 2018

rotateright commented Nov 11, 2018

rotateright commented Nov 26, 2018

llvmbot commented Nov 27, 2021