-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSVC rotate intrinsics don't (just) generate rotates on x86-64 #36735
Comments
Forgot to mention: this is when using clang-cl, the test file was compiled via "clang-cl -c -O2 -FA clang_cl_rotl.cpp". |
Should be easy enough to fix the emission code in CGBuiltin.cpp But I think we may need some more work to recognize rotl8/rotl16 This doesn't work. extern "C" unsigned char g(unsigned char a, int b) I think we need to teach InstCombiner::narrowRotate about this form. |
IMO, rotates are up there with min/max for: "Is it too late to just add an IR intrinsic for this?" :) If we can solve the original problem in clang, that would be great. |
Patch for the intrinsic issue is here https://reviews.llvm.org/D46656 We still have problems in the middle end with short and char rotates written directly in C. I may file a separate bug for that. |
CGBuiltin.cpp changed in r331943 |
At this point, I'd favor an intrinsic for the rotates. The likelihood that a someone actually cares if simplify demanded bits works on rotate is close to zero. It's far more likely that they'll open more bugs when our "optimizations" break backend pattern matching. |
Speaking as a somewhat frustrated user: yes, an intrinsic would help. Specifically, the problem with pattern-matching for these relatively intricate trees deep in the back-end is that there's a lot of stuff that happens in between the code the user enters and said pattern-matching, and there are a myriad ways for things to go awry in the meantime. For a recent example derived from the exact same code that led to this bug report, see #34835 . I just looked at the generated code with the replacement _rotl64 (function "g" in the above snippet), and noticed that while g fixes my toy repro, I still get poor code in the actual use case. Here's a slightly more complicated repro that illustrates the issue I'm seeing in my real use case: // ---- begin static uint64_t rotl64(uint64_t a, int b) uint64_t hash(const void *p, int rot_amount)
} void bulk_insert(uint32_t *hash_table, uint32_t mask, int rot, const uint8_t *p, size_t count) This is all for a simple multiplicative hash: grab 4 bytes (in this case), multiply by some random constant to mix the bits, then build the hash from both the top and bottom bits of the product. "rot_amount" here is chosen outside to suit a (variable) pow2 hash table size. Again with clang-cl 6.0 FINAL, "clang-cl -c -O2 -FA hash_problem.cpp", I get: (just the relevant bits) ---- begin"?hash@@YA_KPEBXH@Z": # @"\01?hash@@YA_KPEBXH@Z" %bb.0:
---- endSo far, so good But then if I look at bulk_insert: (SEH annotations elided to reduce noise) ---- begin"?bulk_insert@@YAXPEAIIHPEBE_K@Z": # @"\01?bulk_insert@@YAXPEAIIHPEBE_K@Z" %bb.0:
%bb.1:
%bb.2:
.LBB1_6: %bb.3:
.LBB1_4: ---- endThe computations of "rot_amount & 63" and "-rot_amount & 63" are identified as loop-invariant, hoisted, and apparently this makes the back-end pattern matching fail to recognize the rotates again, so now I get
instead of the expected
And this is still attempting to be a small test case; the actual code I care about is more complicated, and presumably contains even more latent ways for this to go wrong. As far as I see it, there's basically two options here:
Furthermore, if you're going to rely on pattern matching to recognize rotates, I think the earlier in compilation (and the closer to the original source code) it happens, the better. If I know that the expression for "g" in my original report reliably turns into a rotate-left (when available), that's fine and I can work with that. But there's no way for me to write code in a source language such that I know the resulting DAG will have the rotate patterns I want. |
The current argument is that there's no compelling case for rotate IR intrinsics because the backend can always match the (or (shl), (shr (sub))) sequence in its various forms. Here's the first search hit I got (2012) on this topic: But the example given in comment 7 - where we constant hoist the opposite shift amount out of a loop - seems very compelling to me. Given the block-at-a-time limit of the current DAG, there is no way to preserve that pattern besides moving/duplicating instructions in IR (CodeGenPrepare). Let me minimize the example to reduce distractions: unsigned rotateRight(unsigned a, int b) { void rotateInLoop(unsigned *x, unsigned N, unsigned *a, int b) { rotateRight is what we expect in IR and x86 asm: $ ./clang -O2 ror.c -S -o - -fomit-frame-pointer -fno-unroll-loops -emit-llvm $ ./clang -O2 ror.c -S -o - -fomit-frame-pointer -fno-unroll-loops |grep ror rotateInLoop can't be matched by DAGCombiner::MatchRotate(): for.loop.preheader: for.loop: $ ./clang -O2 ror.c -S -o - -fomit-frame-pointer -fno-unroll-loops | grep sh |
I have not been involved with core LLVM development for some time, but here are 2c:
On balance, it seems like LLVM is doing the right thing here, we need to fix isel for LOTS of reasons, and this is underway. Doing the wrong thing here in the interim doesn't seem warranted. |
Thanks, Chris! I haven't been following the GlobalIsel progress, so I don't know how far away that is. So if no intrinsics, we need at least the following to improve rotate matching:
I think it's best if we open new bugs for those and link to other existing rotate bugs, so we have a better idea of how many variations of the patterns are out there. |
At the risk of beating on a dead horse...
Rotate-by-variable-amount is not easy to express correctly using shifts because the "obvious" (x << k) | (x >> (width - k)) runs into UB when k==0. The non-constant expressions for the shift amounts run a higher risk of getting transformed in some way that breaks the pattern the DAG rules look for.
Right now, if I want to actually get variable-amount rotates reliably in Clang, I have to use inline ASM wrapped in a bunch of platform #ifdefs. Inline ASM doesn't participate in other optimizations either, to put it mildly. :)
It seems to me that the variable-amount form would be better served by an intrinsic, while constant-amount shifts benefit from the work done to optimize regular shifts and bitwise logic operations, and it's undesirable to have another class of shift operation to handle in a bunch of optimization passes. Maybe the best of both worlds is to use an intrinsic for variable rotates, but turn rotates-by-constant into shifts and ORs, which other passes know to work on? |
Fair enough, I haven't thought much about the variable size case. I agree that it is significantly different. If you'd like to push for it, please bring it up on llvm-dev, thanks! |
Since the original issue has been fixed, should this be closed or turned into an "implement intrinsic" feature request? |
We already have added the intrinsic. It's in clang as __builtin_rotateleft8 |
It seems that Clang still codegens the old sequence for |
Changing those was in the approved patch: ...but it caused a compiler crash on a Linux bot using gcc as the host compiler. I asked for help/policy on working around that, but I got no replies. |
I don't see any bot failures caused by the clang change this time, so should be fixed with: Note that there is still ongoing work to optimize/canonicalize funnel-shift and associated patterns. See related bugs to track those. |
mentioned in issue llvm/llvm-bugzilla-archive#37421 |
Extended Description
This simple test:
// ---- begin
#include <intrin.h>
extern "C" unsigned long long f(unsigned long long a, int b)
{
return _rotl64(a, b);
}
extern "C" unsigned long long g(unsigned long long a, int b)
{
return (a << (b & 63)) | (a >> (-b & 63));
}
// ---- end
produces (on x86-64 using clang 6.0 release; only quoting the relevant bits):
---- begin
f: # @f
%bb.0:
g: # @g
%bb.0:
---- end
The corresponding IR is:
; ---- begin
; Function Attrs: norecurse nounwind readnone sspstrong uwtable
define i64 @f(i64, i32) local_unnamed_addr #0 {
%3 = and i32 %1, 63
%4 = zext i32 %3 to i64
%5 = sub nsw i64 64, %4
%6 = shl i64 %0, %4
%7 = lshr i64 %0, %5
%8 = or i64 %7, %6
%9 = icmp eq i32 %3, 0
%10 = select i1 %9, i64 %0, i64 %8
ret i64 %10
}
; Function Attrs: norecurse nounwind readnone sspstrong uwtable
define i64 @g(i64, i32) local_unnamed_addr #0 {
%3 = and i32 %1, 63
%4 = zext i32 %3 to i64
%5 = shl i64 %0, %4
%6 = sub nsw i32 0, %1
%7 = and i32 %6, 63
%8 = zext i32 %7 to i64
%9 = lshr i64 %0, %8
%10 = or i64 %5, %9
ret i64 %10
}
; ---- end
The problem is the expansion chosen for the rotr/rotl intrinsics in CGBuiltin.cpp CodeGenFunction::EmitBuiltinExpr, presumably to avoid implementation-specific behavior from the right shift by 64-b.
Note that the alternative expansion for rotate-left given in the code for g avoids the problematic select, is well-defined, and already gets matched to ROL (in the x86-64 backend anyway), so it seems like a good alternative.
The text was updated successfully, but these errors were encountered: