LLVM 19.0.0git
AMDGPURegisterBankInfo.cpp
Go to the documentation of this file.
1//===- AMDGPURegisterBankInfo.cpp -------------------------------*- C++ -*-==//
2//
3// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4// See https://llvm.org/LICENSE.txt for license information.
5// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6//
7//===----------------------------------------------------------------------===//
8/// \file
9/// This file implements the targeting of the RegisterBankInfo class for
10/// AMDGPU.
11///
12/// \par
13///
14/// AMDGPU has unique register bank constraints that require special high level
15/// strategies to deal with. There are two main true physical register banks
16/// VGPR (vector), and SGPR (scalar). Additionally the VCC register bank is a
17/// sort of pseudo-register bank needed to represent SGPRs used in a vector
18/// boolean context. There is also the AGPR bank, which is a special purpose
19/// physical register bank present on some subtargets.
20///
21/// Copying from VGPR to SGPR is generally illegal, unless the value is known to
22/// be uniform. It is generally not valid to legalize operands by inserting
23/// copies as on other targets. Operations which require uniform, SGPR operands
24/// generally require scalarization by repeatedly executing the instruction,
25/// activating each set of lanes using a unique set of input values. This is
26/// referred to as a waterfall loop.
27///
28/// \par Booleans
29///
30/// Booleans (s1 values) requires special consideration. A vector compare result
31/// is naturally a bitmask with one bit per lane, in a 32 or 64-bit
32/// register. These are represented with the VCC bank. During selection, we need
33/// to be able to unambiguously go back from a register class to a register
34/// bank. To distinguish whether an SGPR should use the SGPR or VCC register
35/// bank, we need to know the use context type. An SGPR s1 value always means a
36/// VCC bank value, otherwise it will be the SGPR bank. A scalar compare sets
37/// SCC, which is a 1-bit unaddressable register. This will need to be copied to
38/// a 32-bit virtual register. Taken together, this means we need to adjust the
39/// type of boolean operations to be regbank legal. All SALU booleans need to be
40/// widened to 32-bits, and all VALU booleans need to be s1 values.
41///
42/// A noteworthy exception to the s1-means-vcc rule is for legalization artifact
43/// casts. G_TRUNC s1 results, and G_SEXT/G_ZEXT/G_ANYEXT sources are never vcc
44/// bank. A non-boolean source (such as a truncate from a 1-bit load from
45/// memory) will require a copy to the VCC bank which will require clearing the
46/// high bits and inserting a compare.
47///
48/// \par Constant bus restriction
49///
50/// VALU instructions have a limitation known as the constant bus
51/// restriction. Most VALU instructions can use SGPR operands, but may read at
52/// most 1 SGPR or constant literal value (this to 2 in gfx10 for most
53/// instructions). This is one unique SGPR, so the same SGPR may be used for
54/// multiple operands. From a register bank perspective, any combination of
55/// operands should be legal as an SGPR, but this is contextually dependent on
56/// the SGPR operands all being the same register. There is therefore optimal to
57/// choose the SGPR with the most uses to minimize the number of copies.
58///
59/// We avoid trying to solve this problem in RegBankSelect. Any VALU G_*
60/// operation should have its source operands all mapped to VGPRs (except for
61/// VCC), inserting copies from any SGPR operands. This the most trivial legal
62/// mapping. Anything beyond the simplest 1:1 instruction selection would be too
63/// complicated to solve here. Every optimization pattern or instruction
64/// selected to multiple outputs would have to enforce this rule, and there
65/// would be additional complexity in tracking this rule for every G_*
66/// operation. By forcing all inputs to VGPRs, it also simplifies the task of
67/// picking the optimal operand combination from a post-isel optimization pass.
68///
69//===----------------------------------------------------------------------===//
70
72
73#include "AMDGPU.h"
75#include "AMDGPUInstrInfo.h"
76#include "GCNSubtarget.h"
78#include "SIRegisterInfo.h"
84#include "llvm/IR/IntrinsicsAMDGPU.h"
85
86#define GET_TARGET_REGBANK_IMPL
87#include "AMDGPUGenRegisterBank.inc"
88
89// This file will be TableGen'ed at some point.
90#include "AMDGPUGenRegisterBankInfo.def"
91
92using namespace llvm;
93using namespace MIPatternMatch;
94
95namespace {
96
97// Observer to apply a register bank to new registers created by LegalizerHelper.
98class ApplyRegBankMapping final : public GISelChangeObserver {
99private:
101 const AMDGPURegisterBankInfo &RBI;
103 const RegisterBank *NewBank;
105
106public:
107 ApplyRegBankMapping(MachineIRBuilder &B, const AMDGPURegisterBankInfo &RBI_,
108 MachineRegisterInfo &MRI_, const RegisterBank *RB)
109 : B(B), RBI(RBI_), MRI(MRI_), NewBank(RB) {
110 assert(!B.isObservingChanges());
111 B.setChangeObserver(*this);
112 }
113
114 ~ApplyRegBankMapping() override {
115 for (MachineInstr *MI : NewInsts)
116 applyBank(*MI);
117
118 B.stopObservingChanges();
119 }
120
121 /// Set any registers that don't have a set register class or bank to SALU.
122 void applyBank(MachineInstr &MI) {
123 const unsigned Opc = MI.getOpcode();
124 if (Opc == AMDGPU::G_ANYEXT || Opc == AMDGPU::G_ZEXT ||
125 Opc == AMDGPU::G_SEXT) {
126 // LegalizerHelper wants to use the basic legalization artifacts when
127 // widening etc. We don't handle selection with vcc in artifact sources,
128 // so we need to use a select instead to handle these properly.
129 Register DstReg = MI.getOperand(0).getReg();
130 Register SrcReg = MI.getOperand(1).getReg();
131 const RegisterBank *SrcBank = RBI.getRegBank(SrcReg, MRI, *RBI.TRI);
132 if (SrcBank == &AMDGPU::VCCRegBank) {
133 const LLT S32 = LLT::scalar(32);
134 assert(MRI.getType(SrcReg) == LLT::scalar(1));
135 assert(MRI.getType(DstReg) == S32);
136 assert(NewBank == &AMDGPU::VGPRRegBank);
137
138 // Replace the extension with a select, which really uses the boolean
139 // source.
140 B.setInsertPt(*MI.getParent(), MI);
141
142 auto True = B.buildConstant(S32, Opc == AMDGPU::G_SEXT ? -1 : 1);
143 auto False = B.buildConstant(S32, 0);
144 B.buildSelect(DstReg, SrcReg, True, False);
145 MRI.setRegBank(True.getReg(0), *NewBank);
146 MRI.setRegBank(False.getReg(0), *NewBank);
147 MI.eraseFromParent();
148 }
149
150 assert(!MRI.getRegClassOrRegBank(DstReg));
151 MRI.setRegBank(DstReg, *NewBank);
152 return;
153 }
154
155#ifndef NDEBUG
156 if (Opc == AMDGPU::G_TRUNC) {
157 Register DstReg = MI.getOperand(0).getReg();
158 const RegisterBank *DstBank = RBI.getRegBank(DstReg, MRI, *RBI.TRI);
159 assert(DstBank != &AMDGPU::VCCRegBank);
160 }
161#endif
162
163 for (MachineOperand &Op : MI.operands()) {
164 if (!Op.isReg())
165 continue;
166
167 // We may see physical registers if building a real MI
168 Register Reg = Op.getReg();
169 if (Reg.isPhysical() || MRI.getRegClassOrRegBank(Reg))
170 continue;
171
172 const RegisterBank *RB = NewBank;
173 if (MRI.getType(Reg) == LLT::scalar(1)) {
174 assert(NewBank == &AMDGPU::VGPRRegBank &&
175 "s1 operands should only be used for vector bools");
176 assert((MI.getOpcode() != AMDGPU::G_TRUNC &&
177 MI.getOpcode() != AMDGPU::G_ANYEXT) &&
178 "not expecting legalization artifacts here");
179 RB = &AMDGPU::VCCRegBank;
180 }
181
182 MRI.setRegBank(Reg, *RB);
183 }
184 }
185
186 void erasingInstr(MachineInstr &MI) override {}
187
188 void createdInstr(MachineInstr &MI) override {
189 // At this point, the instruction was just inserted and has no operands.
190 NewInsts.push_back(&MI);
191 }
192
193 void changingInstr(MachineInstr &MI) override {}
194 void changedInstr(MachineInstr &MI) override {
195 // FIXME: In principle we should probably add the instruction to NewInsts,
196 // but the way the LegalizerHelper uses the observer, we will always see the
197 // registers we need to set the regbank on also referenced in a new
198 // instruction.
199 }
200};
201
202} // anonymous namespace
203
205 : Subtarget(ST), TRI(Subtarget.getRegisterInfo()),
206 TII(Subtarget.getInstrInfo()) {
207
208 // HACK: Until this is fully tablegen'd.
209 static llvm::once_flag InitializeRegisterBankFlag;
210
211 static auto InitializeRegisterBankOnce = [this]() {
212 assert(&getRegBank(AMDGPU::SGPRRegBankID) == &AMDGPU::SGPRRegBank &&
213 &getRegBank(AMDGPU::VGPRRegBankID) == &AMDGPU::VGPRRegBank &&
214 &getRegBank(AMDGPU::AGPRRegBankID) == &AMDGPU::AGPRRegBank);
215 (void)this;
216 };
217
218 llvm::call_once(InitializeRegisterBankFlag, InitializeRegisterBankOnce);
219}
220
221static bool isVectorRegisterBank(const RegisterBank &Bank) {
222 unsigned BankID = Bank.getID();
223 return BankID == AMDGPU::VGPRRegBankID || BankID == AMDGPU::AGPRRegBankID;
224}
225
227 return RB != &AMDGPU::SGPRRegBank;
228}
229
231 const RegisterBank &Src,
232 TypeSize Size) const {
233 // TODO: Should there be a UniformVGPRRegBank which can use readfirstlane?
234 if (Dst.getID() == AMDGPU::SGPRRegBankID &&
235 (isVectorRegisterBank(Src) || Src.getID() == AMDGPU::VCCRegBankID)) {
236 return std::numeric_limits<unsigned>::max();
237 }
238
239 // Bool values are tricky, because the meaning is based on context. The SCC
240 // and VCC banks are for the natural scalar and vector conditions produced by
241 // a compare.
242 //
243 // Legalization doesn't know about the necessary context, so an s1 use may
244 // have been a truncate from an arbitrary value, in which case a copy (lowered
245 // as a compare with 0) needs to be inserted.
246 if (Size == 1 &&
247 (Dst.getID() == AMDGPU::SGPRRegBankID) &&
248 (isVectorRegisterBank(Src) ||
249 Src.getID() == AMDGPU::SGPRRegBankID ||
250 Src.getID() == AMDGPU::VCCRegBankID))
251 return std::numeric_limits<unsigned>::max();
252
253 // There is no direct copy between AGPRs.
254 if (Dst.getID() == AMDGPU::AGPRRegBankID &&
255 Src.getID() == AMDGPU::AGPRRegBankID)
256 return 4;
257
258 return RegisterBankInfo::copyCost(Dst, Src, Size);
259}
260
262 const ValueMapping &ValMapping,
263 const RegisterBank *CurBank) const {
264 // Check if this is a breakdown for G_LOAD to move the pointer from SGPR to
265 // VGPR.
266 // FIXME: Is there a better way to do this?
267 if (ValMapping.NumBreakDowns >= 2 || ValMapping.BreakDown[0].Length >= 64)
268 return 10; // This is expensive.
269
270 assert(ValMapping.NumBreakDowns == 2 &&
271 ValMapping.BreakDown[0].Length == 32 &&
272 ValMapping.BreakDown[0].StartIdx == 0 &&
273 ValMapping.BreakDown[1].Length == 32 &&
274 ValMapping.BreakDown[1].StartIdx == 32 &&
275 ValMapping.BreakDown[0].RegBank == ValMapping.BreakDown[1].RegBank);
276
277 // 32-bit extract of a 64-bit value is just access of a subregister, so free.
278 // TODO: Cost of 0 hits assert, though it's not clear it's what we really
279 // want.
280
281 // TODO: 32-bit insert to a 64-bit SGPR may incur a non-free copy due to SGPR
282 // alignment restrictions, but this probably isn't important.
283 return 1;
284}
285
286const RegisterBank &
288 LLT Ty) const {
289 if (&RC == &AMDGPU::SReg_1RegClass)
290 return AMDGPU::VCCRegBank;
291
292 // We promote real scalar booleans to SReg_32. Any SGPR using s1 is really a
293 // VCC-like use.
294 if (TRI->isSGPRClass(&RC)) {
295 // FIXME: This probably came from a copy from a physical register, which
296 // should be inferable from the copied to-type. We don't have many boolean
297 // physical register constraints so just assume a normal SGPR for now.
298 if (!Ty.isValid())
299 return AMDGPU::SGPRRegBank;
300
301 return Ty == LLT::scalar(1) ? AMDGPU::VCCRegBank : AMDGPU::SGPRRegBank;
302 }
303
304 return TRI->isAGPRClass(&RC) ? AMDGPU::AGPRRegBank : AMDGPU::VGPRRegBank;
305}
306
307template <unsigned NumOps>
310 const MachineInstr &MI, const MachineRegisterInfo &MRI,
311 const std::array<unsigned, NumOps> RegSrcOpIdx,
312 ArrayRef<OpRegBankEntry<NumOps>> Table) const {
313
314 InstructionMappings AltMappings;
315
317
318 unsigned Sizes[NumOps];
319 for (unsigned I = 0; I < NumOps; ++I) {
320 Register Reg = MI.getOperand(RegSrcOpIdx[I]).getReg();
321 Sizes[I] = getSizeInBits(Reg, MRI, *TRI);
322 }
323
324 for (unsigned I = 0, E = MI.getNumExplicitDefs(); I != E; ++I) {
325 unsigned SizeI = getSizeInBits(MI.getOperand(I).getReg(), MRI, *TRI);
326 Operands[I] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, SizeI);
327 }
328
329 // getInstrMapping's default mapping uses ID 1, so start at 2.
330 unsigned MappingID = 2;
331 for (const auto &Entry : Table) {
332 for (unsigned I = 0; I < NumOps; ++I) {
333 int OpIdx = RegSrcOpIdx[I];
334 Operands[OpIdx] = AMDGPU::getValueMapping(Entry.RegBanks[I], Sizes[I]);
335 }
336
337 AltMappings.push_back(&getInstructionMapping(MappingID++, Entry.Cost,
339 Operands.size()));
340 }
341
342 return AltMappings;
343}
344
347 const MachineInstr &MI, const MachineRegisterInfo &MRI) const {
348 switch (cast<GIntrinsic>(MI).getIntrinsicID()) {
349 case Intrinsic::amdgcn_readlane: {
350 static const OpRegBankEntry<3> Table[2] = {
351 // Perfectly legal.
352 { { AMDGPU::SGPRRegBankID, AMDGPU::VGPRRegBankID, AMDGPU::SGPRRegBankID }, 1 },
353
354 // Need a readfirstlane for the index.
355 { { AMDGPU::SGPRRegBankID, AMDGPU::VGPRRegBankID, AMDGPU::VGPRRegBankID }, 2 }
356 };
357
358 const std::array<unsigned, 3> RegSrcOpIdx = { { 0, 2, 3 } };
359 return addMappingFromTable<3>(MI, MRI, RegSrcOpIdx, Table);
360 }
361 case Intrinsic::amdgcn_writelane: {
362 static const OpRegBankEntry<4> Table[4] = {
363 // Perfectly legal.
364 { { AMDGPU::VGPRRegBankID, AMDGPU::SGPRRegBankID, AMDGPU::SGPRRegBankID, AMDGPU::VGPRRegBankID }, 1 },
365
366 // Need readfirstlane of first op
367 { { AMDGPU::VGPRRegBankID, AMDGPU::VGPRRegBankID, AMDGPU::SGPRRegBankID, AMDGPU::VGPRRegBankID }, 2 },
368
369 // Need readfirstlane of second op
370 { { AMDGPU::VGPRRegBankID, AMDGPU::SGPRRegBankID, AMDGPU::VGPRRegBankID, AMDGPU::VGPRRegBankID }, 2 },
371
372 // Need readfirstlane of both ops
373 { { AMDGPU::VGPRRegBankID, AMDGPU::VGPRRegBankID, AMDGPU::VGPRRegBankID, AMDGPU::VGPRRegBankID }, 3 }
374 };
375
376 // rsrc, voffset, offset
377 const std::array<unsigned, 4> RegSrcOpIdx = { { 0, 2, 3, 4 } };
378 return addMappingFromTable<4>(MI, MRI, RegSrcOpIdx, Table);
379 }
380 default:
382 }
383}
384
387 const MachineInstr &MI, const MachineRegisterInfo &MRI) const {
388
389 switch (cast<GIntrinsic>(MI).getIntrinsicID()) {
390 case Intrinsic::amdgcn_s_buffer_load: {
391 static const OpRegBankEntry<2> Table[4] = {
392 // Perfectly legal.
393 { { AMDGPU::SGPRRegBankID, AMDGPU::SGPRRegBankID }, 1 },
394
395 // Only need 1 register in loop
396 { { AMDGPU::SGPRRegBankID, AMDGPU::VGPRRegBankID }, 300 },
397
398 // Have to waterfall the resource.
399 { { AMDGPU::VGPRRegBankID, AMDGPU::SGPRRegBankID }, 1000 },
400
401 // Have to waterfall the resource, and the offset.
402 { { AMDGPU::VGPRRegBankID, AMDGPU::VGPRRegBankID }, 1500 }
403 };
404
405 // rsrc, offset
406 const std::array<unsigned, 2> RegSrcOpIdx = { { 2, 3 } };
407 return addMappingFromTable<2>(MI, MRI, RegSrcOpIdx, Table);
408 }
409 case Intrinsic::amdgcn_ds_ordered_add:
410 case Intrinsic::amdgcn_ds_ordered_swap: {
411 // VGPR = M0, VGPR
412 static const OpRegBankEntry<3> Table[2] = {
413 // Perfectly legal.
414 { { AMDGPU::VGPRRegBankID, AMDGPU::SGPRRegBankID, AMDGPU::VGPRRegBankID }, 1 },
415
416 // Need a readfirstlane for m0
417 { { AMDGPU::VGPRRegBankID, AMDGPU::VGPRRegBankID, AMDGPU::VGPRRegBankID }, 2 }
418 };
419
420 const std::array<unsigned, 3> RegSrcOpIdx = { { 0, 2, 3 } };
421 return addMappingFromTable<3>(MI, MRI, RegSrcOpIdx, Table);
422 }
423 case Intrinsic::amdgcn_s_sendmsg:
424 case Intrinsic::amdgcn_s_sendmsghalt: {
425 // FIXME: Should have no register for immediate
426 static const OpRegBankEntry<1> Table[2] = {
427 // Perfectly legal.
428 { { AMDGPU::SGPRRegBankID }, 1 },
429
430 // Need readlane
431 { { AMDGPU::VGPRRegBankID }, 3 }
432 };
433
434 const std::array<unsigned, 1> RegSrcOpIdx = { { 2 } };
435 return addMappingFromTable<1>(MI, MRI, RegSrcOpIdx, Table);
436 }
437 default:
439 }
440}
441
442// FIXME: Returns uniform if there's no source value information. This is
443// probably wrong.
445 if (!MI.hasOneMemOperand())
446 return false;
447
448 const MachineMemOperand *MMO = *MI.memoperands_begin();
449 const unsigned AS = MMO->getAddrSpace();
450 const bool IsConst = AS == AMDGPUAS::CONSTANT_ADDRESS ||
452 const unsigned MemSize = 8 * MMO->getSize().getValue();
453
454 // Require 4-byte alignment.
455 return (MMO->getAlign() >= Align(4) ||
457 ((MemSize == 16 && MMO->getAlign() >= Align(2)) ||
458 (MemSize == 8 && MMO->getAlign() >= Align(1))))) &&
459 // Can't do a scalar atomic load.
460 !MMO->isAtomic() &&
461 // Don't use scalar loads for volatile accesses to non-constant address
462 // spaces.
463 (IsConst || !MMO->isVolatile()) &&
464 // Memory must be known constant, or not written before this load.
465 (IsConst || MMO->isInvariant() || (MMO->getFlags() & MONoClobber)) &&
467}
468
471 const MachineInstr &MI) const {
472
473 const MachineFunction &MF = *MI.getParent()->getParent();
474 const MachineRegisterInfo &MRI = MF.getRegInfo();
475
476
477 InstructionMappings AltMappings;
478 switch (MI.getOpcode()) {
479 case TargetOpcode::G_CONSTANT:
480 case TargetOpcode::G_IMPLICIT_DEF: {
481 unsigned Size = getSizeInBits(MI.getOperand(0).getReg(), MRI, *TRI);
482 if (Size == 1) {
483 static const OpRegBankEntry<1> Table[3] = {
484 { { AMDGPU::VGPRRegBankID }, 1 },
485 { { AMDGPU::SGPRRegBankID }, 1 },
486 { { AMDGPU::VCCRegBankID }, 1 }
487 };
488
489 return addMappingFromTable<1>(MI, MRI, {{ 0 }}, Table);
490 }
491
492 [[fallthrough]];
493 }
494 case TargetOpcode::G_FCONSTANT:
495 case TargetOpcode::G_FRAME_INDEX:
496 case TargetOpcode::G_GLOBAL_VALUE: {
497 static const OpRegBankEntry<1> Table[2] = {
498 { { AMDGPU::VGPRRegBankID }, 1 },
499 { { AMDGPU::SGPRRegBankID }, 1 }
500 };
501
502 return addMappingFromTable<1>(MI, MRI, {{ 0 }}, Table);
503 }
504 case TargetOpcode::G_AND:
505 case TargetOpcode::G_OR:
506 case TargetOpcode::G_XOR: {
507 unsigned Size = getSizeInBits(MI.getOperand(0).getReg(), MRI, *TRI);
508
509 if (Size == 1) {
510 // s_{and|or|xor}_b32 set scc when the result of the 32-bit op is not 0.
511 const InstructionMapping &SCCMapping = getInstructionMapping(
512 1, 1, getOperandsMapping(
513 {AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 32),
514 AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 32),
515 AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 32)}),
516 3); // Num Operands
517 AltMappings.push_back(&SCCMapping);
518
519 const InstructionMapping &VCCMapping0 = getInstructionMapping(
520 2, 1, getOperandsMapping(
521 {AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, Size),
522 AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, Size),
523 AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, Size)}),
524 3); // Num Operands
525 AltMappings.push_back(&VCCMapping0);
526 return AltMappings;
527 }
528
529 if (Size != 64)
530 break;
531
532 const InstructionMapping &SSMapping = getInstructionMapping(
533 1, 1, getOperandsMapping(
534 {AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size),
535 AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size),
536 AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size)}),
537 3); // Num Operands
538 AltMappings.push_back(&SSMapping);
539
540 const InstructionMapping &VVMapping = getInstructionMapping(
541 2, 2, getOperandsMapping(
542 {AMDGPU::getValueMappingSGPR64Only(AMDGPU::VGPRRegBankID, Size),
543 AMDGPU::getValueMappingSGPR64Only(AMDGPU::VGPRRegBankID, Size),
544 AMDGPU::getValueMappingSGPR64Only(AMDGPU::VGPRRegBankID, Size)}),
545 3); // Num Operands
546 AltMappings.push_back(&VVMapping);
547 break;
548 }
549 case TargetOpcode::G_LOAD:
550 case TargetOpcode::G_ZEXTLOAD:
551 case TargetOpcode::G_SEXTLOAD: {
552 unsigned Size = getSizeInBits(MI.getOperand(0).getReg(), MRI, *TRI);
553 LLT PtrTy = MRI.getType(MI.getOperand(1).getReg());
554 unsigned PtrSize = PtrTy.getSizeInBits();
555 unsigned AS = PtrTy.getAddressSpace();
556
560 const InstructionMapping &SSMapping = getInstructionMapping(
561 1, 1, getOperandsMapping(
562 {AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size),
563 AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, PtrSize)}),
564 2); // Num Operands
565 AltMappings.push_back(&SSMapping);
566 }
567
568 const InstructionMapping &VVMapping = getInstructionMapping(
569 2, 1,
571 {AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size),
572 AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, PtrSize)}),
573 2); // Num Operands
574 AltMappings.push_back(&VVMapping);
575
576 // It may be possible to have a vgpr = load sgpr mapping here, because
577 // the mubuf instructions support this kind of load, but probably for only
578 // gfx7 and older. However, the addressing mode matching in the instruction
579 // selector should be able to do a better job of detecting and selecting
580 // these kinds of loads from the vgpr = load vgpr mapping.
581
582 return AltMappings;
583
584 }
585 case TargetOpcode::G_SELECT: {
586 unsigned Size = getSizeInBits(MI.getOperand(0).getReg(), MRI, *TRI);
587 const InstructionMapping &SSMapping = getInstructionMapping(1, 1,
588 getOperandsMapping({AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size),
589 AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 1),
590 AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size),
591 AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size)}),
592 4); // Num Operands
593 AltMappings.push_back(&SSMapping);
594
595 const InstructionMapping &VVMapping = getInstructionMapping(2, 1,
596 getOperandsMapping({AMDGPU::getValueMappingSGPR64Only(AMDGPU::VGPRRegBankID, Size),
597 AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, 1),
598 AMDGPU::getValueMappingSGPR64Only(AMDGPU::VGPRRegBankID, Size),
599 AMDGPU::getValueMappingSGPR64Only(AMDGPU::VGPRRegBankID, Size)}),
600 4); // Num Operands
601 AltMappings.push_back(&VVMapping);
602
603 return AltMappings;
604 }
605 case TargetOpcode::G_UADDE:
606 case TargetOpcode::G_USUBE:
607 case TargetOpcode::G_SADDE:
608 case TargetOpcode::G_SSUBE: {
609 unsigned Size = getSizeInBits(MI.getOperand(0).getReg(), MRI, *TRI);
610 const InstructionMapping &SSMapping = getInstructionMapping(1, 1,
612 {AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size),
613 AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 1),
614 AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size),
615 AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size),
616 AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 1)}),
617 5); // Num Operands
618 AltMappings.push_back(&SSMapping);
619
620 const InstructionMapping &VVMapping = getInstructionMapping(2, 1,
621 getOperandsMapping({AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size),
622 AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, 1),
623 AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size),
624 AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size),
625 AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, 1)}),
626 5); // Num Operands
627 AltMappings.push_back(&VVMapping);
628 return AltMappings;
629 }
630 case AMDGPU::G_BRCOND: {
631 assert(MRI.getType(MI.getOperand(0).getReg()).getSizeInBits() == 1);
632
633 // TODO: Change type to 32 for scalar
635 1, 1, getOperandsMapping(
636 {AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 1), nullptr}),
637 2); // Num Operands
638 AltMappings.push_back(&SMapping);
639
641 1, 1, getOperandsMapping(
642 {AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, 1), nullptr }),
643 2); // Num Operands
644 AltMappings.push_back(&VMapping);
645 return AltMappings;
646 }
647 case AMDGPU::G_INTRINSIC:
648 case AMDGPU::G_INTRINSIC_CONVERGENT:
650 case AMDGPU::G_INTRINSIC_W_SIDE_EFFECTS:
651 case AMDGPU::G_INTRINSIC_CONVERGENT_W_SIDE_EFFECTS:
653 default:
654 break;
655 }
657}
658
662 LLT HalfTy,
663 Register Reg) const {
664 assert(HalfTy.getSizeInBits() == 32);
665 MachineRegisterInfo *MRI = B.getMRI();
666 Register LoLHS = MRI->createGenericVirtualRegister(HalfTy);
667 Register HiLHS = MRI->createGenericVirtualRegister(HalfTy);
668 const RegisterBank *Bank = getRegBank(Reg, *MRI, *TRI);
669 MRI->setRegBank(LoLHS, *Bank);
670 MRI->setRegBank(HiLHS, *Bank);
671
672 Regs.push_back(LoLHS);
673 Regs.push_back(HiLHS);
674
675 B.buildInstr(AMDGPU::G_UNMERGE_VALUES)
676 .addDef(LoLHS)
677 .addDef(HiLHS)
678 .addUse(Reg);
679}
680
681/// Replace the current type each register in \p Regs has with \p NewTy
683 LLT NewTy) {
684 for (Register Reg : Regs) {
685 assert(MRI.getType(Reg).getSizeInBits() == NewTy.getSizeInBits());
686 MRI.setType(Reg, NewTy);
687 }
688}
689
691 if (Ty.isVector()) {
694 Ty.getElementType());
695 }
696
697 assert(Ty.getScalarSizeInBits() % 2 == 0);
698 return LLT::scalar(Ty.getScalarSizeInBits() / 2);
699}
700
701// Build one or more V_READFIRSTLANE_B32 instructions to move the given vector
702// source value into a scalar register.
705 Register Src) const {
706 LLT Ty = MRI.getType(Src);
707 const RegisterBank *Bank = getRegBank(Src, MRI, *TRI);
708
709 if (Bank == &AMDGPU::SGPRRegBank)
710 return Src;
711
712 unsigned Bits = Ty.getSizeInBits();
713 assert(Bits % 32 == 0);
714
715 if (Bank != &AMDGPU::VGPRRegBank) {
716 // We need to copy from AGPR to VGPR
717 Src = B.buildCopy(Ty, Src).getReg(0);
718 MRI.setRegBank(Src, AMDGPU::VGPRRegBank);
719 }
720
721 LLT S32 = LLT::scalar(32);
722 unsigned NumParts = Bits / 32;
725
726 if (Bits == 32) {
727 SrcParts.push_back(Src);
728 } else {
729 auto Unmerge = B.buildUnmerge(S32, Src);
730 for (unsigned i = 0; i < NumParts; ++i)
731 SrcParts.push_back(Unmerge.getReg(i));
732 }
733
734 for (unsigned i = 0; i < NumParts; ++i) {
735 Register SrcPart = SrcParts[i];
736 Register DstPart = MRI.createVirtualRegister(&AMDGPU::SReg_32RegClass);
737 MRI.setType(DstPart, NumParts == 1 ? Ty : S32);
738
739 const TargetRegisterClass *Constrained =
740 constrainGenericRegister(SrcPart, AMDGPU::VGPR_32RegClass, MRI);
741 (void)Constrained;
742 assert(Constrained && "Failed to constrain readfirstlane src reg");
743
744 B.buildInstr(AMDGPU::V_READFIRSTLANE_B32, {DstPart}, {SrcPart});
745
746 DstParts.push_back(DstPart);
747 }
748
749 if (Bits == 32)
750 return DstParts[0];
751
752 Register Dst = B.buildMergeLikeInstr(Ty, DstParts).getReg(0);
753 MRI.setRegBank(Dst, AMDGPU::SGPRRegBank);
754 return Dst;
755}
756
757/// Legalize instruction \p MI where operands in \p OpIndices must be SGPRs. If
758/// any of the required SGPR operands are VGPRs, perform a waterfall loop to
759/// execute the instruction for each unique combination of values in all lanes
760/// in the wave. The block will be split such that rest of the instructions are
761/// moved to a new block.
762///
763/// Essentially performs this loop:
764//
765/// Save Execution Mask
766/// For (Lane : Wavefront) {
767/// Enable Lane, Disable all other lanes
768/// SGPR = read SGPR value for current lane from VGPR
769/// VGPRResult[Lane] = use_op SGPR
770/// }
771/// Restore Execution Mask
772///
773/// There is additional complexity to try for compare values to identify the
774/// unique values used.
777 SmallSet<Register, 4> &SGPROperandRegs) const {
778 // Track use registers which have already been expanded with a readfirstlane
779 // sequence. This may have multiple uses if moving a sequence.
780 DenseMap<Register, Register> WaterfalledRegMap;
781
782 MachineBasicBlock &MBB = B.getMBB();
783 MachineFunction *MF = &B.getMF();
784
786 const unsigned MovExecOpc =
787 Subtarget.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
788 const unsigned MovExecTermOpc =
789 Subtarget.isWave32() ? AMDGPU::S_MOV_B32_term : AMDGPU::S_MOV_B64_term;
790
791 const unsigned XorTermOpc = Subtarget.isWave32() ?
792 AMDGPU::S_XOR_B32_term : AMDGPU::S_XOR_B64_term;
793 const unsigned AndSaveExecOpc = Subtarget.isWave32() ?
794 AMDGPU::S_AND_SAVEEXEC_B32 : AMDGPU::S_AND_SAVEEXEC_B64;
795 const unsigned ExecReg = Subtarget.isWave32() ?
796 AMDGPU::EXEC_LO : AMDGPU::EXEC;
797
798#ifndef NDEBUG
799 const int OrigRangeSize = std::distance(Range.begin(), Range.end());
800#endif
801
802 MachineRegisterInfo &MRI = *B.getMRI();
803 Register SaveExecReg = MRI.createVirtualRegister(WaveRC);
804 Register InitSaveExecReg = MRI.createVirtualRegister(WaveRC);
805
806 // Don't bother using generic instructions/registers for the exec mask.
807 B.buildInstr(TargetOpcode::IMPLICIT_DEF)
808 .addDef(InitSaveExecReg);
809
810 Register PhiExec = MRI.createVirtualRegister(WaveRC);
811 Register NewExec = MRI.createVirtualRegister(WaveRC);
812
813 // To insert the loop we need to split the block. Move everything before this
814 // point to a new block, and insert a new empty block before this instruction.
817 MachineBasicBlock *RemainderBB = MF->CreateMachineBasicBlock();
818 MachineBasicBlock *RestoreExecBB = MF->CreateMachineBasicBlock();
820 ++MBBI;
821 MF->insert(MBBI, LoopBB);
822 MF->insert(MBBI, BodyBB);
823 MF->insert(MBBI, RestoreExecBB);
824 MF->insert(MBBI, RemainderBB);
825
826 LoopBB->addSuccessor(BodyBB);
827 BodyBB->addSuccessor(RestoreExecBB);
828 BodyBB->addSuccessor(LoopBB);
829
830 // Move the rest of the block into a new block.
832 RemainderBB->splice(RemainderBB->begin(), &MBB, Range.end(), MBB.end());
833
834 MBB.addSuccessor(LoopBB);
835 RestoreExecBB->addSuccessor(RemainderBB);
836
837 B.setInsertPt(*LoopBB, LoopBB->end());
838
839 B.buildInstr(TargetOpcode::PHI)
840 .addDef(PhiExec)
841 .addReg(InitSaveExecReg)
842 .addMBB(&MBB)
843 .addReg(NewExec)
844 .addMBB(BodyBB);
845
846 const DebugLoc &DL = B.getDL();
847
848 MachineInstr &FirstInst = *Range.begin();
849
850 // Move the instruction into the loop body. Note we moved everything after
851 // Range.end() already into a new block, so Range.end() is no longer valid.
852 BodyBB->splice(BodyBB->end(), &MBB, Range.begin(), MBB.end());
853
854 // Figure out the iterator range after splicing the instructions.
855 MachineBasicBlock::iterator NewBegin = FirstInst.getIterator();
856 auto NewEnd = BodyBB->end();
857
858 B.setMBB(*LoopBB);
859
860 LLT S1 = LLT::scalar(1);
861 Register CondReg;
862
863 assert(std::distance(NewBegin, NewEnd) == OrigRangeSize);
864
865 for (MachineInstr &MI : make_range(NewBegin, NewEnd)) {
866 for (MachineOperand &Op : MI.all_uses()) {
867 Register OldReg = Op.getReg();
868 if (!SGPROperandRegs.count(OldReg))
869 continue;
870
871 // See if we already processed this register in another instruction in the
872 // sequence.
873 auto OldVal = WaterfalledRegMap.find(OldReg);
874 if (OldVal != WaterfalledRegMap.end()) {
875 Op.setReg(OldVal->second);
876 continue;
877 }
878
879 Register OpReg = Op.getReg();
880 LLT OpTy = MRI.getType(OpReg);
881
882 const RegisterBank *OpBank = getRegBank(OpReg, MRI, *TRI);
883 if (OpBank != &AMDGPU::VGPRRegBank) {
884 // Insert copy from AGPR to VGPR before the loop.
885 B.setMBB(MBB);
886 OpReg = B.buildCopy(OpTy, OpReg).getReg(0);
887 MRI.setRegBank(OpReg, AMDGPU::VGPRRegBank);
888 B.setMBB(*LoopBB);
889 }
890
891 Register CurrentLaneReg = buildReadFirstLane(B, MRI, OpReg);
892
893 // Build the comparison(s).
894 unsigned OpSize = OpTy.getSizeInBits();
895 bool Is64 = OpSize % 64 == 0;
896 unsigned PartSize = Is64 ? 64 : 32;
897 LLT PartTy = LLT::scalar(PartSize);
898 unsigned NumParts = OpSize / PartSize;
900 SmallVector<Register, 8> CurrentLaneParts;
901
902 if (NumParts == 1) {
903 OpParts.push_back(OpReg);
904 CurrentLaneParts.push_back(CurrentLaneReg);
905 } else {
906 auto UnmergeOp = B.buildUnmerge(PartTy, OpReg);
907 auto UnmergeCurrentLane = B.buildUnmerge(PartTy, CurrentLaneReg);
908 for (unsigned i = 0; i < NumParts; ++i) {
909 OpParts.push_back(UnmergeOp.getReg(i));
910 CurrentLaneParts.push_back(UnmergeCurrentLane.getReg(i));
911 MRI.setRegBank(OpParts[i], AMDGPU::VGPRRegBank);
912 MRI.setRegBank(CurrentLaneParts[i], AMDGPU::SGPRRegBank);
913 }
914 }
915
916 for (unsigned i = 0; i < NumParts; ++i) {
917 auto CmpReg = B.buildICmp(CmpInst::ICMP_EQ, S1, CurrentLaneParts[i],
918 OpParts[i]).getReg(0);
919 MRI.setRegBank(CmpReg, AMDGPU::VCCRegBank);
920
921 if (!CondReg) {
922 CondReg = CmpReg;
923 } else {
924 CondReg = B.buildAnd(S1, CondReg, CmpReg).getReg(0);
925 MRI.setRegBank(CondReg, AMDGPU::VCCRegBank);
926 }
927 }
928
929 Op.setReg(CurrentLaneReg);
930
931 // Make sure we don't re-process this register again.
932 WaterfalledRegMap.insert(std::pair(OldReg, Op.getReg()));
933 }
934 }
935
936 // The ballot becomes a no-op during instruction selection.
937 CondReg = B.buildIntrinsic(Intrinsic::amdgcn_ballot,
938 {LLT::scalar(Subtarget.isWave32() ? 32 : 64)})
939 .addReg(CondReg)
940 .getReg(0);
941 MRI.setRegClass(CondReg, WaveRC);
942
943 // Update EXEC, save the original EXEC value to VCC.
944 B.buildInstr(AndSaveExecOpc)
945 .addDef(NewExec)
946 .addReg(CondReg, RegState::Kill);
947
948 MRI.setSimpleHint(NewExec, CondReg);
949
950 B.setInsertPt(*BodyBB, BodyBB->end());
951
952 // Update EXEC, switch all done bits to 0 and all todo bits to 1.
953 B.buildInstr(XorTermOpc)
954 .addDef(ExecReg)
955 .addReg(ExecReg)
956 .addReg(NewExec);
957
958 // XXX - s_xor_b64 sets scc to 1 if the result is nonzero, so can we use
959 // s_cbranch_scc0?
960
961 // Loop back to V_READFIRSTLANE_B32 if there are still variants to cover.
962 B.buildInstr(AMDGPU::SI_WATERFALL_LOOP).addMBB(LoopBB);
963
964 // Save the EXEC mask before the loop.
965 BuildMI(MBB, MBB.end(), DL, TII->get(MovExecOpc), SaveExecReg)
966 .addReg(ExecReg);
967
968 // Restore the EXEC mask after the loop.
969 B.setMBB(*RestoreExecBB);
970 B.buildInstr(MovExecTermOpc)
971 .addDef(ExecReg)
972 .addReg(SaveExecReg);
973
974 // Set the insert point after the original instruction, so any new
975 // instructions will be in the remainder.
976 B.setInsertPt(*RemainderBB, RemainderBB->begin());
977
978 return true;
979}
980
981// Return any unique registers used by \p MI at \p OpIndices that need to be
982// handled in a waterfall loop. Returns these registers in \p
983// SGPROperandRegs. Returns true if there are any operands to handle and a
984// waterfall loop is necessary.
986 SmallSet<Register, 4> &SGPROperandRegs, MachineInstr &MI,
987 MachineRegisterInfo &MRI, ArrayRef<unsigned> OpIndices) const {
988 for (unsigned Op : OpIndices) {
989 assert(MI.getOperand(Op).isUse());
990 Register Reg = MI.getOperand(Op).getReg();
991 const RegisterBank *OpBank = getRegBank(Reg, MRI, *TRI);
992 if (OpBank->getID() != AMDGPU::SGPRRegBankID)
993 SGPROperandRegs.insert(Reg);
994 }
995
996 // No operands need to be replaced, so no need to loop.
997 return !SGPROperandRegs.empty();
998}
999
1001 MachineIRBuilder &B, MachineInstr &MI, ArrayRef<unsigned> OpIndices) const {
1002 // Use a set to avoid extra readfirstlanes in the case where multiple operands
1003 // are the same register.
1004 SmallSet<Register, 4> SGPROperandRegs;
1005
1006 if (!collectWaterfallOperands(SGPROperandRegs, MI, *B.getMRI(), OpIndices))
1007 return false;
1008
1009 MachineBasicBlock::iterator I = MI.getIterator();
1010 return executeInWaterfallLoop(B, make_range(I, std::next(I)),
1011 SGPROperandRegs);
1012}
1013
1014// Legalize an operand that must be an SGPR by inserting a readfirstlane.
1016 MachineIRBuilder &B, MachineInstr &MI, unsigned OpIdx) const {
1017 Register Reg = MI.getOperand(OpIdx).getReg();
1018 MachineRegisterInfo &MRI = *B.getMRI();
1019 const RegisterBank *Bank = getRegBank(Reg, MRI, *TRI);
1020 if (Bank == &AMDGPU::SGPRRegBank)
1021 return;
1022
1023 Reg = buildReadFirstLane(B, MRI, Reg);
1024 MI.getOperand(OpIdx).setReg(Reg);
1025}
1026
1027/// Split \p Ty into 2 pieces. The first will have \p FirstSize bits, and the
1028/// rest will be in the remainder.
1029static std::pair<LLT, LLT> splitUnequalType(LLT Ty, unsigned FirstSize) {
1030 unsigned TotalSize = Ty.getSizeInBits();
1031 if (!Ty.isVector())
1032 return {LLT::scalar(FirstSize), LLT::scalar(TotalSize - FirstSize)};
1033
1034 LLT EltTy = Ty.getElementType();
1035 unsigned EltSize = EltTy.getSizeInBits();
1036 assert(FirstSize % EltSize == 0);
1037
1038 unsigned FirstPartNumElts = FirstSize / EltSize;
1039 unsigned RemainderElts = (TotalSize - FirstSize) / EltSize;
1040
1041 return {LLT::scalarOrVector(ElementCount::getFixed(FirstPartNumElts), EltTy),
1042 LLT::scalarOrVector(ElementCount::getFixed(RemainderElts), EltTy)};
1043}
1044
1046 if (!Ty.isVector())
1047 return LLT::scalar(128);
1048
1049 LLT EltTy = Ty.getElementType();
1050 assert(128 % EltTy.getSizeInBits() == 0);
1051 return LLT::fixed_vector(128 / EltTy.getSizeInBits(), EltTy);
1052}
1053
1057 MachineInstr &MI) const {
1058 MachineRegisterInfo &MRI = *B.getMRI();
1059 Register DstReg = MI.getOperand(0).getReg();
1060 const LLT LoadTy = MRI.getType(DstReg);
1061 unsigned LoadSize = LoadTy.getSizeInBits();
1062 const unsigned MaxNonSmrdLoadSize = 128;
1063
1064 const RegisterBank *DstBank =
1065 OpdMapper.getInstrMapping().getOperandMapping(0).BreakDown[0].RegBank;
1066 if (DstBank == &AMDGPU::SGPRRegBank) {
1067 // There are some special cases that we need to look at for 32 bit and 96
1068 // bit SGPR loads otherwise we have nothing to do.
1069 if (LoadSize != 32 && (LoadSize != 96 || Subtarget.hasScalarDwordx3Loads()))
1070 return false;
1071
1072 MachineMemOperand *MMO = *MI.memoperands_begin();
1073 const unsigned MemSize = 8 * MMO->getSize().getValue();
1074 // Scalar loads of size 8 or 16 bit with proper alignment may be widened to
1075 // 32 bit. Check to see if we need to widen the memory access, 8 or 16 bit
1076 // scalar loads should have a load size of 32 but memory access size of less
1077 // than 32.
1078 if (LoadSize == 32 &&
1079 (MemSize == 32 || LoadTy.isVector() || !isScalarLoadLegal(MI)))
1080 return false;
1081
1082 if (LoadSize == 32 &&
1083 ((MemSize == 8 && MMO->getAlign() >= Align(1)) ||
1084 (MemSize == 16 && MMO->getAlign() >= Align(2))) &&
1087 return false;
1088
1089 Register PtrReg = MI.getOperand(1).getReg();
1090
1091 ApplyRegBankMapping ApplyBank(B, *this, MRI, DstBank);
1092
1093 if (LoadSize == 32) {
1094 // This is an extending load from a sub-dword size. Widen the memory
1095 // access size to 4 bytes and clear the extra high bits appropriately
1096 const LLT S32 = LLT::scalar(32);
1097 if (MI.getOpcode() == AMDGPU::G_SEXTLOAD) {
1098 // Must extend the sign bit into higher bits for a G_SEXTLOAD
1099 auto WideLoad = B.buildLoadFromOffset(S32, PtrReg, *MMO, 0);
1100 B.buildSExtInReg(MI.getOperand(0), WideLoad, MemSize);
1101 } else if (MI.getOpcode() == AMDGPU::G_ZEXTLOAD) {
1102 // Must extend zero into higher bits with an AND for a G_ZEXTLOAD
1103 auto WideLoad = B.buildLoadFromOffset(S32, PtrReg, *MMO, 0);
1104 B.buildZExtInReg(MI.getOperand(0), WideLoad, MemSize);
1105 } else
1106 // We do not need to touch the higher bits for regular loads.
1107 B.buildLoadFromOffset(MI.getOperand(0), PtrReg, *MMO, 0);
1108 } else {
1109 // 96-bit loads are only available for vector loads. We need to split this
1110 // into a 64-bit part, and 32 (unless we can widen to a 128-bit load).
1111 if (MMO->getAlign() < Align(16)) {
1112 LegalizerHelper Helper(B.getMF(), ApplyBank, B);
1113 LLT Part64, Part32;
1114 std::tie(Part64, Part32) = splitUnequalType(LoadTy, 64);
1115 if (Helper.reduceLoadStoreWidth(cast<GAnyLoad>(MI), 0, Part64) !=
1117 return false;
1118 return true;
1119 }
1120 LLT WiderTy = widen96To128(LoadTy);
1121 auto WideLoad = B.buildLoadFromOffset(WiderTy, PtrReg, *MMO, 0);
1122 if (WiderTy.isScalar()) {
1123 B.buildTrunc(MI.getOperand(0), WideLoad);
1124 } else {
1125 B.buildDeleteTrailingVectorElements(MI.getOperand(0).getReg(),
1126 WideLoad);
1127 }
1128 }
1129
1130 MI.eraseFromParent();
1131 return true;
1132 }
1133
1134 // 128-bit loads are supported for all instruction types.
1135 if (LoadSize <= MaxNonSmrdLoadSize)
1136 return false;
1137
1138 SmallVector<Register, 16> DefRegs(OpdMapper.getVRegs(0));
1139 SmallVector<Register, 1> SrcRegs(OpdMapper.getVRegs(1));
1140
1141 if (SrcRegs.empty())
1142 SrcRegs.push_back(MI.getOperand(1).getReg());
1143
1144 assert(LoadSize % MaxNonSmrdLoadSize == 0);
1145
1146 // RegBankSelect only emits scalar types, so we need to reset the pointer
1147 // operand to a pointer type.
1148 Register BasePtrReg = SrcRegs[0];
1149 LLT PtrTy = MRI.getType(MI.getOperand(1).getReg());
1150 MRI.setType(BasePtrReg, PtrTy);
1151
1152 unsigned NumSplitParts = LoadTy.getSizeInBits() / MaxNonSmrdLoadSize;
1153 const LLT LoadSplitTy = LoadTy.divide(NumSplitParts);
1154 ApplyRegBankMapping O(B, *this, MRI, &AMDGPU::VGPRRegBank);
1155 LegalizerHelper Helper(B.getMF(), O, B);
1156
1157 if (LoadTy.isVector()) {
1158 if (Helper.fewerElementsVector(MI, 0, LoadSplitTy) != LegalizerHelper::Legalized)
1159 return false;
1160 } else {
1161 if (Helper.narrowScalar(MI, 0, LoadSplitTy) != LegalizerHelper::Legalized)
1162 return false;
1163 }
1164
1165 MRI.setRegBank(DstReg, AMDGPU::VGPRRegBank);
1166 return true;
1167}
1168
1172 MachineInstr &MI) const {
1173 MachineRegisterInfo &MRI = *B.getMRI();
1174 const MachineFunction &MF = B.getMF();
1175 const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
1176 const auto &TFI = *ST.getFrameLowering();
1177
1178 // Guard in case the stack growth direction ever changes with scratch
1179 // instructions.
1180 if (TFI.getStackGrowthDirection() == TargetFrameLowering::StackGrowsDown)
1181 return false;
1182
1183 Register Dst = MI.getOperand(0).getReg();
1184 Register AllocSize = MI.getOperand(1).getReg();
1185 Align Alignment = assumeAligned(MI.getOperand(2).getImm());
1186
1187 const RegisterBank *SizeBank = getRegBank(AllocSize, MRI, *TRI);
1188
1189 // TODO: Need to emit a wave reduction to get the maximum size.
1190 if (SizeBank != &AMDGPU::SGPRRegBank)
1191 return false;
1192
1193 LLT PtrTy = MRI.getType(Dst);
1194 LLT IntPtrTy = LLT::scalar(PtrTy.getSizeInBits());
1195
1197 Register SPReg = Info->getStackPtrOffsetReg();
1198 ApplyRegBankMapping ApplyBank(B, *this, MRI, &AMDGPU::SGPRRegBank);
1199
1200 auto WaveSize = B.buildConstant(LLT::scalar(32), ST.getWavefrontSizeLog2());
1201 auto ScaledSize = B.buildShl(IntPtrTy, AllocSize, WaveSize);
1202
1203 auto SPCopy = B.buildCopy(PtrTy, SPReg);
1204 if (Alignment > TFI.getStackAlign()) {
1205 auto PtrAdd = B.buildPtrAdd(PtrTy, SPCopy, ScaledSize);
1206 B.buildMaskLowPtrBits(Dst, PtrAdd,
1207 Log2(Alignment) + ST.getWavefrontSizeLog2());
1208 } else {
1209 B.buildPtrAdd(Dst, SPCopy, ScaledSize);
1210 }
1211
1212 MI.eraseFromParent();
1213 return true;
1214}
1215
1219 int RsrcIdx) const {
1220 const int NumDefs = MI.getNumExplicitDefs();
1221
1222 // The reported argument index is relative to the IR intrinsic call arguments,
1223 // so we need to shift by the number of defs and the intrinsic ID.
1224 RsrcIdx += NumDefs + 1;
1225
1226 // Insert copies to VGPR arguments.
1227 applyDefaultMapping(OpdMapper);
1228
1229 // Fixup any SGPR arguments.
1230 SmallVector<unsigned, 4> SGPRIndexes;
1231 for (int I = NumDefs, NumOps = MI.getNumOperands(); I != NumOps; ++I) {
1232 if (!MI.getOperand(I).isReg())
1233 continue;
1234
1235 // If this intrinsic has a sampler, it immediately follows rsrc.
1236 if (I == RsrcIdx || I == RsrcIdx + 1)
1237 SGPRIndexes.push_back(I);
1238 }
1239
1240 executeInWaterfallLoop(B, MI, SGPRIndexes);
1241 return true;
1242}
1243
1244// Analyze a combined offset from an llvm.amdgcn.s.buffer intrinsic and store
1245// the three offsets (voffset, soffset and instoffset)
1247 MachineIRBuilder &B, Register CombinedOffset, Register &VOffsetReg,
1248 Register &SOffsetReg, int64_t &InstOffsetVal, Align Alignment) const {
1249 const LLT S32 = LLT::scalar(32);
1250 MachineRegisterInfo *MRI = B.getMRI();
1251
1252 if (std::optional<int64_t> Imm =
1253 getIConstantVRegSExtVal(CombinedOffset, *MRI)) {
1254 uint32_t SOffset, ImmOffset;
1255 if (TII->splitMUBUFOffset(*Imm, SOffset, ImmOffset, Alignment)) {
1256 VOffsetReg = B.buildConstant(S32, 0).getReg(0);
1257 SOffsetReg = B.buildConstant(S32, SOffset).getReg(0);
1258 InstOffsetVal = ImmOffset;
1259
1260 B.getMRI()->setRegBank(VOffsetReg, AMDGPU::VGPRRegBank);
1261 B.getMRI()->setRegBank(SOffsetReg, AMDGPU::SGPRRegBank);
1262 return SOffset + ImmOffset;
1263 }
1264 }
1265
1266 Register Base;
1267 unsigned Offset;
1268
1269 std::tie(Base, Offset) =
1270 AMDGPU::getBaseWithConstantOffset(*MRI, CombinedOffset);
1271
1272 uint32_t SOffset, ImmOffset;
1273 if ((int)Offset > 0 &&
1274 TII->splitMUBUFOffset(Offset, SOffset, ImmOffset, Alignment)) {
1275 if (getRegBank(Base, *MRI, *TRI) == &AMDGPU::VGPRRegBank) {
1276 VOffsetReg = Base;
1277 SOffsetReg = B.buildConstant(S32, SOffset).getReg(0);
1278 B.getMRI()->setRegBank(SOffsetReg, AMDGPU::SGPRRegBank);
1279 InstOffsetVal = ImmOffset;
1280 return 0; // XXX - Why is this 0?
1281 }
1282
1283 // If we have SGPR base, we can use it for soffset.
1284 if (SOffset == 0) {
1285 VOffsetReg = B.buildConstant(S32, 0).getReg(0);
1286 B.getMRI()->setRegBank(VOffsetReg, AMDGPU::VGPRRegBank);
1287 SOffsetReg = Base;
1288 InstOffsetVal = ImmOffset;
1289 return 0; // XXX - Why is this 0?
1290 }
1291 }
1292
1293 // Handle the variable sgpr + vgpr case.
1294 MachineInstr *Add = getOpcodeDef(AMDGPU::G_ADD, CombinedOffset, *MRI);
1295 if (Add && (int)Offset >= 0) {
1296 Register Src0 = getSrcRegIgnoringCopies(Add->getOperand(1).getReg(), *MRI);
1297 Register Src1 = getSrcRegIgnoringCopies(Add->getOperand(2).getReg(), *MRI);
1298
1299 const RegisterBank *Src0Bank = getRegBank(Src0, *MRI, *TRI);
1300 const RegisterBank *Src1Bank = getRegBank(Src1, *MRI, *TRI);
1301
1302 if (Src0Bank == &AMDGPU::VGPRRegBank && Src1Bank == &AMDGPU::SGPRRegBank) {
1303 VOffsetReg = Src0;
1304 SOffsetReg = Src1;
1305 return 0;
1306 }
1307
1308 if (Src0Bank == &AMDGPU::SGPRRegBank && Src1Bank == &AMDGPU::VGPRRegBank) {
1309 VOffsetReg = Src1;
1310 SOffsetReg = Src0;
1311 return 0;
1312 }
1313 }
1314
1315 // Ensure we have a VGPR for the combined offset. This could be an issue if we
1316 // have an SGPR offset and a VGPR resource.
1317 if (getRegBank(CombinedOffset, *MRI, *TRI) == &AMDGPU::VGPRRegBank) {
1318 VOffsetReg = CombinedOffset;
1319 } else {
1320 VOffsetReg = B.buildCopy(S32, CombinedOffset).getReg(0);
1321 B.getMRI()->setRegBank(VOffsetReg, AMDGPU::VGPRRegBank);
1322 }
1323
1324 SOffsetReg = B.buildConstant(S32, 0).getReg(0);
1325 B.getMRI()->setRegBank(SOffsetReg, AMDGPU::SGPRRegBank);
1326 return 0;
1327}
1328
1330 MachineIRBuilder &B, const OperandsMapper &OpdMapper) const {
1331 MachineInstr &MI = OpdMapper.getMI();
1332 MachineRegisterInfo &MRI = OpdMapper.getMRI();
1333
1334 const LLT S32 = LLT::scalar(32);
1335 Register Dst = MI.getOperand(0).getReg();
1336 LLT Ty = MRI.getType(Dst);
1337
1338 const RegisterBank *RSrcBank =
1339 OpdMapper.getInstrMapping().getOperandMapping(1).BreakDown[0].RegBank;
1340 const RegisterBank *OffsetBank =
1341 OpdMapper.getInstrMapping().getOperandMapping(2).BreakDown[0].RegBank;
1342 if (RSrcBank == &AMDGPU::SGPRRegBank &&
1343 OffsetBank == &AMDGPU::SGPRRegBank)
1344 return true; // Legal mapping
1345
1346 // FIXME: 96-bit case was widened during legalize. We need to narrow it back
1347 // here but don't have an MMO.
1348
1349 unsigned LoadSize = Ty.getSizeInBits();
1350 int NumLoads = 1;
1351 if (LoadSize == 256 || LoadSize == 512) {
1352 NumLoads = LoadSize / 128;
1353 Ty = Ty.divide(NumLoads);
1354 }
1355
1356 // Use the alignment to ensure that the required offsets will fit into the
1357 // immediate offsets.
1358 const Align Alignment = NumLoads > 1 ? Align(16 * NumLoads) : Align(1);
1359
1360 MachineFunction &MF = B.getMF();
1361
1362 Register SOffset;
1363 Register VOffset;
1364 int64_t ImmOffset = 0;
1365
1366 unsigned MMOOffset = setBufferOffsets(B, MI.getOperand(2).getReg(), VOffset,
1367 SOffset, ImmOffset, Alignment);
1368
1369 // TODO: 96-bit loads were widened to 128-bit results. Shrink the result if we
1370 // can, but we need to track an MMO for that.
1371 const unsigned MemSize = (Ty.getSizeInBits() + 7) / 8;
1372 const Align MemAlign(4); // FIXME: ABI type alignment?
1377 MemSize, MemAlign);
1378 if (MMOOffset != 0)
1379 BaseMMO = MF.getMachineMemOperand(BaseMMO, MMOOffset, MemSize);
1380
1381 // If only the offset is divergent, emit a MUBUF buffer load instead. We can
1382 // assume that the buffer is unswizzled.
1383
1384 Register RSrc = MI.getOperand(1).getReg();
1385 Register VIndex = B.buildConstant(S32, 0).getReg(0);
1386 B.getMRI()->setRegBank(VIndex, AMDGPU::VGPRRegBank);
1387
1388 SmallVector<Register, 4> LoadParts(NumLoads);
1389
1390 MachineBasicBlock::iterator MII = MI.getIterator();
1391 MachineInstrSpan Span(MII, &B.getMBB());
1392
1393 for (int i = 0; i < NumLoads; ++i) {
1394 if (NumLoads == 1) {
1395 LoadParts[i] = Dst;
1396 } else {
1397 LoadParts[i] = MRI.createGenericVirtualRegister(Ty);
1398 MRI.setRegBank(LoadParts[i], AMDGPU::VGPRRegBank);
1399 }
1400
1401 MachineMemOperand *MMO = BaseMMO;
1402 if (i != 0)
1403 BaseMMO = MF.getMachineMemOperand(BaseMMO, MMOOffset + 16 * i, MemSize);
1404
1405 B.buildInstr(AMDGPU::G_AMDGPU_BUFFER_LOAD)
1406 .addDef(LoadParts[i]) // vdata
1407 .addUse(RSrc) // rsrc
1408 .addUse(VIndex) // vindex
1409 .addUse(VOffset) // voffset
1410 .addUse(SOffset) // soffset
1411 .addImm(ImmOffset + 16 * i) // offset(imm)
1412 .addImm(0) // cachepolicy, swizzled buffer(imm)
1413 .addImm(0) // idxen(imm)
1414 .addMemOperand(MMO);
1415 }
1416
1417 // TODO: If only the resource is a VGPR, it may be better to execute the
1418 // scalar load in the waterfall loop if the resource is expected to frequently
1419 // be dynamically uniform.
1420 if (RSrcBank != &AMDGPU::SGPRRegBank) {
1421 // Remove the original instruction to avoid potentially confusing the
1422 // waterfall loop logic.
1423 B.setInstr(*Span.begin());
1424 MI.eraseFromParent();
1425
1426 SmallSet<Register, 4> OpsToWaterfall;
1427
1428 OpsToWaterfall.insert(RSrc);
1429 executeInWaterfallLoop(B, make_range(Span.begin(), Span.end()),
1430 OpsToWaterfall);
1431 }
1432
1433 if (NumLoads != 1) {
1434 if (Ty.isVector())
1435 B.buildConcatVectors(Dst, LoadParts);
1436 else
1437 B.buildMergeLikeInstr(Dst, LoadParts);
1438 }
1439
1440 // We removed the instruction earlier with a waterfall loop.
1441 if (RSrcBank == &AMDGPU::SGPRRegBank)
1442 MI.eraseFromParent();
1443
1444 return true;
1445}
1446
1448 const OperandsMapper &OpdMapper,
1449 bool Signed) const {
1450 MachineInstr &MI = OpdMapper.getMI();
1451 MachineRegisterInfo &MRI = OpdMapper.getMRI();
1452
1453 // Insert basic copies
1454 applyDefaultMapping(OpdMapper);
1455
1456 Register DstReg = MI.getOperand(0).getReg();
1457 LLT Ty = MRI.getType(DstReg);
1458
1459 const LLT S32 = LLT::scalar(32);
1460
1461 unsigned FirstOpnd = isa<GIntrinsic>(MI) ? 2 : 1;
1462 Register SrcReg = MI.getOperand(FirstOpnd).getReg();
1463 Register OffsetReg = MI.getOperand(FirstOpnd + 1).getReg();
1464 Register WidthReg = MI.getOperand(FirstOpnd + 2).getReg();
1465
1466 const RegisterBank *DstBank =
1467 OpdMapper.getInstrMapping().getOperandMapping(0).BreakDown[0].RegBank;
1468 if (DstBank == &AMDGPU::VGPRRegBank) {
1469 if (Ty == S32)
1470 return true;
1471
1472 // There is no 64-bit vgpr bitfield extract instructions so the operation
1473 // is expanded to a sequence of instructions that implement the operation.
1474 ApplyRegBankMapping ApplyBank(B, *this, MRI, &AMDGPU::VGPRRegBank);
1475
1476 const LLT S64 = LLT::scalar(64);
1477 // Shift the source operand so that extracted bits start at bit 0.
1478 auto ShiftOffset = Signed ? B.buildAShr(S64, SrcReg, OffsetReg)
1479 : B.buildLShr(S64, SrcReg, OffsetReg);
1480 auto UnmergeSOffset = B.buildUnmerge({S32, S32}, ShiftOffset);
1481
1482 // A 64-bit bitfield extract uses the 32-bit bitfield extract instructions
1483 // if the width is a constant.
1484 if (auto ConstWidth = getIConstantVRegValWithLookThrough(WidthReg, MRI)) {
1485 // Use the 32-bit bitfield extract instruction if the width is a constant.
1486 // Depending on the width size, use either the low or high 32-bits.
1487 auto Zero = B.buildConstant(S32, 0);
1488 auto WidthImm = ConstWidth->Value.getZExtValue();
1489 if (WidthImm <= 32) {
1490 // Use bitfield extract on the lower 32-bit source, and then sign-extend
1491 // or clear the upper 32-bits.
1492 auto Extract =
1493 Signed ? B.buildSbfx(S32, UnmergeSOffset.getReg(0), Zero, WidthReg)
1494 : B.buildUbfx(S32, UnmergeSOffset.getReg(0), Zero, WidthReg);
1495 auto Extend =
1496 Signed ? B.buildAShr(S32, Extract, B.buildConstant(S32, 31)) : Zero;
1497 B.buildMergeLikeInstr(DstReg, {Extract, Extend});
1498 } else {
1499 // Use bitfield extract on upper 32-bit source, and combine with lower
1500 // 32-bit source.
1501 auto UpperWidth = B.buildConstant(S32, WidthImm - 32);
1502 auto Extract =
1503 Signed
1504 ? B.buildSbfx(S32, UnmergeSOffset.getReg(1), Zero, UpperWidth)
1505 : B.buildUbfx(S32, UnmergeSOffset.getReg(1), Zero, UpperWidth);
1506 B.buildMergeLikeInstr(DstReg, {UnmergeSOffset.getReg(0), Extract});
1507 }
1508 MI.eraseFromParent();
1509 return true;
1510 }
1511
1512 // Expand to Src >> Offset << (64 - Width) >> (64 - Width) using 64-bit
1513 // operations.
1514 auto ExtShift = B.buildSub(S32, B.buildConstant(S32, 64), WidthReg);
1515 auto SignBit = B.buildShl(S64, ShiftOffset, ExtShift);
1516 if (Signed)
1517 B.buildAShr(S64, SignBit, ExtShift);
1518 else
1519 B.buildLShr(S64, SignBit, ExtShift);
1520 MI.eraseFromParent();
1521 return true;
1522 }
1523
1524 // The scalar form packs the offset and width in a single operand.
1525
1526 ApplyRegBankMapping ApplyBank(B, *this, MRI, &AMDGPU::SGPRRegBank);
1527
1528 // Ensure the high bits are clear to insert the offset.
1529 auto OffsetMask = B.buildConstant(S32, maskTrailingOnes<unsigned>(6));
1530 auto ClampOffset = B.buildAnd(S32, OffsetReg, OffsetMask);
1531
1532 // Zeros out the low bits, so don't bother clamping the input value.
1533 auto ShiftWidth = B.buildShl(S32, WidthReg, B.buildConstant(S32, 16));
1534
1535 // Transformation function, pack the offset and width of a BFE into
1536 // the format expected by the S_BFE_I32 / S_BFE_U32. In the second
1537 // source, bits [5:0] contain the offset and bits [22:16] the width.
1538 auto MergedInputs = B.buildOr(S32, ClampOffset, ShiftWidth);
1539
1540 // TODO: It might be worth using a pseudo here to avoid scc clobber and
1541 // register class constraints.
1542 unsigned Opc = Ty == S32 ? (Signed ? AMDGPU::S_BFE_I32 : AMDGPU::S_BFE_U32) :
1543 (Signed ? AMDGPU::S_BFE_I64 : AMDGPU::S_BFE_U64);
1544
1545 auto MIB = B.buildInstr(Opc, {DstReg}, {SrcReg, MergedInputs});
1546 if (!constrainSelectedInstRegOperands(*MIB, *TII, *TRI, *this))
1547 llvm_unreachable("failed to constrain BFE");
1548
1549 MI.eraseFromParent();
1550 return true;
1551}
1552
1554 MachineIRBuilder &B, const OperandsMapper &OpdMapper) const {
1555 MachineInstr &MI = OpdMapper.getMI();
1556 MachineRegisterInfo &MRI = OpdMapper.getMRI();
1557
1558 // Insert basic copies.
1559 applyDefaultMapping(OpdMapper);
1560
1561 Register Dst0 = MI.getOperand(0).getReg();
1562 Register Dst1 = MI.getOperand(1).getReg();
1563 Register Src0 = MI.getOperand(2).getReg();
1564 Register Src1 = MI.getOperand(3).getReg();
1565 Register Src2 = MI.getOperand(4).getReg();
1566
1567 if (MRI.getRegBankOrNull(Src0) == &AMDGPU::VGPRRegBank)
1568 return true;
1569
1570 bool IsUnsigned = MI.getOpcode() == AMDGPU::G_AMDGPU_MAD_U64_U32;
1571 LLT S1 = LLT::scalar(1);
1572 LLT S32 = LLT::scalar(32);
1573
1574 bool DstOnValu = MRI.getRegBankOrNull(Src2) == &AMDGPU::VGPRRegBank;
1575 bool Accumulate = true;
1576
1577 if (!DstOnValu) {
1578 if (mi_match(Src2, MRI, m_ZeroInt()))
1579 Accumulate = false;
1580 }
1581
1582 // Keep the multiplication on the SALU.
1583 Register DstHi;
1584 Register DstLo = B.buildMul(S32, Src0, Src1).getReg(0);
1585 bool MulHiInVgpr = false;
1586
1587 MRI.setRegBank(DstLo, AMDGPU::SGPRRegBank);
1588
1589 if (Subtarget.hasSMulHi()) {
1590 DstHi = IsUnsigned ? B.buildUMulH(S32, Src0, Src1).getReg(0)
1591 : B.buildSMulH(S32, Src0, Src1).getReg(0);
1592 MRI.setRegBank(DstHi, AMDGPU::SGPRRegBank);
1593 } else {
1594 Register VSrc0 = B.buildCopy(S32, Src0).getReg(0);
1595 Register VSrc1 = B.buildCopy(S32, Src1).getReg(0);
1596
1597 MRI.setRegBank(VSrc0, AMDGPU::VGPRRegBank);
1598 MRI.setRegBank(VSrc1, AMDGPU::VGPRRegBank);
1599
1600 DstHi = IsUnsigned ? B.buildUMulH(S32, VSrc0, VSrc1).getReg(0)
1601 : B.buildSMulH(S32, VSrc0, VSrc1).getReg(0);
1602 MRI.setRegBank(DstHi, AMDGPU::VGPRRegBank);
1603
1604 if (!DstOnValu) {
1605 DstHi = buildReadFirstLane(B, MRI, DstHi);
1606 } else {
1607 MulHiInVgpr = true;
1608 }
1609 }
1610
1611 // Accumulate and produce the "carry-out" bit.
1612 //
1613 // The "carry-out" is defined as bit 64 of the result when computed as a
1614 // big integer. For unsigned multiply-add, this matches the usual definition
1615 // of carry-out. For signed multiply-add, bit 64 is the sign bit of the
1616 // result, which is determined as:
1617 // sign(Src0 * Src1) + sign(Src2) + carry-out from unsigned 64-bit add
1618 LLT CarryType = DstOnValu ? S1 : S32;
1619 const RegisterBank &CarryBank =
1620 DstOnValu ? AMDGPU::VCCRegBank : AMDGPU::SGPRRegBank;
1621 const RegisterBank &DstBank =
1622 DstOnValu ? AMDGPU::VGPRRegBank : AMDGPU::SGPRRegBank;
1623 Register Carry;
1624 Register Zero;
1625
1626 if (!IsUnsigned) {
1627 Zero = B.buildConstant(S32, 0).getReg(0);
1628 MRI.setRegBank(Zero,
1629 MulHiInVgpr ? AMDGPU::VGPRRegBank : AMDGPU::SGPRRegBank);
1630
1631 Carry = B.buildICmp(CmpInst::ICMP_SLT, MulHiInVgpr ? S1 : S32, DstHi, Zero)
1632 .getReg(0);
1633 MRI.setRegBank(Carry, MulHiInVgpr ? AMDGPU::VCCRegBank
1634 : AMDGPU::SGPRRegBank);
1635
1636 if (DstOnValu && !MulHiInVgpr) {
1637 Carry = B.buildTrunc(S1, Carry).getReg(0);
1638 MRI.setRegBank(Carry, AMDGPU::VCCRegBank);
1639 }
1640 }
1641
1642 if (Accumulate) {
1643 if (DstOnValu) {
1644 DstLo = B.buildCopy(S32, DstLo).getReg(0);
1645 DstHi = B.buildCopy(S32, DstHi).getReg(0);
1646 MRI.setRegBank(DstLo, AMDGPU::VGPRRegBank);
1647 MRI.setRegBank(DstHi, AMDGPU::VGPRRegBank);
1648 }
1649
1650 auto Unmerge = B.buildUnmerge(S32, Src2);
1651 Register Src2Lo = Unmerge.getReg(0);
1652 Register Src2Hi = Unmerge.getReg(1);
1653 MRI.setRegBank(Src2Lo, DstBank);
1654 MRI.setRegBank(Src2Hi, DstBank);
1655
1656 if (!IsUnsigned) {
1657 auto Src2Sign = B.buildICmp(CmpInst::ICMP_SLT, CarryType, Src2Hi, Zero);
1658 MRI.setRegBank(Src2Sign.getReg(0), CarryBank);
1659
1660 Carry = B.buildXor(CarryType, Carry, Src2Sign).getReg(0);
1661 MRI.setRegBank(Carry, CarryBank);
1662 }
1663
1664 auto AddLo = B.buildUAddo(S32, CarryType, DstLo, Src2Lo);
1665 DstLo = AddLo.getReg(0);
1666 Register CarryLo = AddLo.getReg(1);
1667 MRI.setRegBank(DstLo, DstBank);
1668 MRI.setRegBank(CarryLo, CarryBank);
1669
1670 auto AddHi = B.buildUAdde(S32, CarryType, DstHi, Src2Hi, CarryLo);
1671 DstHi = AddHi.getReg(0);
1672 MRI.setRegBank(DstHi, DstBank);
1673
1674 Register CarryHi = AddHi.getReg(1);
1675 MRI.setRegBank(CarryHi, CarryBank);
1676
1677 if (IsUnsigned) {
1678 Carry = CarryHi;
1679 } else {
1680 Carry = B.buildXor(CarryType, Carry, CarryHi).getReg(0);
1681 MRI.setRegBank(Carry, CarryBank);
1682 }
1683 } else {
1684 if (IsUnsigned) {
1685 Carry = B.buildConstant(CarryType, 0).getReg(0);
1686 MRI.setRegBank(Carry, CarryBank);
1687 }
1688 }
1689
1690 B.buildMergeLikeInstr(Dst0, {DstLo, DstHi});
1691
1692 if (DstOnValu) {
1693 B.buildCopy(Dst1, Carry);
1694 } else {
1695 B.buildTrunc(Dst1, Carry);
1696 }
1697
1698 MI.eraseFromParent();
1699 return true;
1700}
1701
1702// Return a suitable opcode for extending the operands of Opc when widening.
1703static unsigned getExtendOp(unsigned Opc) {
1704 switch (Opc) {
1705 case TargetOpcode::G_ASHR:
1706 case TargetOpcode::G_SMIN:
1707 case TargetOpcode::G_SMAX:
1708 return TargetOpcode::G_SEXT;
1709 case TargetOpcode::G_LSHR:
1710 case TargetOpcode::G_UMIN:
1711 case TargetOpcode::G_UMAX:
1712 return TargetOpcode::G_ZEXT;
1713 default:
1714 return TargetOpcode::G_ANYEXT;
1715 }
1716}
1717
1718// Emit a legalized extension from <2 x s16> to 2 32-bit components, avoiding
1719// any illegal vector extend or unmerge operations.
1720static std::pair<Register, Register>
1721unpackV2S16ToS32(MachineIRBuilder &B, Register Src, unsigned ExtOpcode) {
1722 const LLT S32 = LLT::scalar(32);
1723 auto Bitcast = B.buildBitcast(S32, Src);
1724
1725 if (ExtOpcode == TargetOpcode::G_SEXT) {
1726 auto ExtLo = B.buildSExtInReg(S32, Bitcast, 16);
1727 auto ShiftHi = B.buildAShr(S32, Bitcast, B.buildConstant(S32, 16));
1728 return std::pair(ExtLo.getReg(0), ShiftHi.getReg(0));
1729 }
1730
1731 auto ShiftHi = B.buildLShr(S32, Bitcast, B.buildConstant(S32, 16));
1732 if (ExtOpcode == TargetOpcode::G_ZEXT) {
1733 auto ExtLo = B.buildAnd(S32, Bitcast, B.buildConstant(S32, 0xffff));
1734 return std::pair(ExtLo.getReg(0), ShiftHi.getReg(0));
1735 }
1736
1737 assert(ExtOpcode == TargetOpcode::G_ANYEXT);
1738 return std::pair(Bitcast.getReg(0), ShiftHi.getReg(0));
1739}
1740
1741// For cases where only a single copy is inserted for matching register banks.
1742// Replace the register in the instruction operand
1744 const AMDGPURegisterBankInfo::OperandsMapper &OpdMapper, unsigned OpIdx) {
1745 SmallVector<unsigned, 1> SrcReg(OpdMapper.getVRegs(OpIdx));
1746 if (!SrcReg.empty()) {
1747 assert(SrcReg.size() == 1);
1748 OpdMapper.getMI().getOperand(OpIdx).setReg(SrcReg[0]);
1749 return true;
1750 }
1751
1752 return false;
1753}
1754
1755/// Handle register layout difference for f16 images for some subtargets.
1758 Register Reg) const {
1760 return Reg;
1761
1762 const LLT S16 = LLT::scalar(16);
1763 LLT StoreVT = MRI.getType(Reg);
1764 if (!StoreVT.isVector() || StoreVT.getElementType() != S16)
1765 return Reg;
1766
1767 auto Unmerge = B.buildUnmerge(S16, Reg);
1768
1769
1770 SmallVector<Register, 4> WideRegs;
1771 for (int I = 0, E = Unmerge->getNumOperands() - 1; I != E; ++I)
1772 WideRegs.push_back(Unmerge.getReg(I));
1773
1774 const LLT S32 = LLT::scalar(32);
1775 int NumElts = StoreVT.getNumElements();
1776
1777 return B.buildMergeLikeInstr(LLT::fixed_vector(NumElts, S32), WideRegs)
1778 .getReg(0);
1779}
1780
1781static std::pair<Register, unsigned>
1783 int64_t Const;
1784 if (mi_match(Reg, MRI, m_ICst(Const)))
1785 return std::pair(Register(), Const);
1786
1787 Register Base;
1788 if (mi_match(Reg, MRI, m_GAdd(m_Reg(Base), m_ICst(Const))))
1789 return std::pair(Base, Const);
1790
1791 // TODO: Handle G_OR used for add case
1792 return std::pair(Reg, 0);
1793}
1794
1795std::pair<Register, unsigned>
1797 Register OrigOffset) const {
1798 const unsigned MaxImm = SIInstrInfo::getMaxMUBUFImmOffset(Subtarget);
1799 Register BaseReg;
1800 unsigned ImmOffset;
1801 const LLT S32 = LLT::scalar(32);
1802
1803 // TODO: Use AMDGPU::getBaseWithConstantOffset() instead.
1804 std::tie(BaseReg, ImmOffset) = getBaseWithConstantOffset(*B.getMRI(),
1805 OrigOffset);
1806
1807 unsigned C1 = 0;
1808 if (ImmOffset != 0) {
1809 // If the immediate value is too big for the immoffset field, put only bits
1810 // that would normally fit in the immoffset field. The remaining value that
1811 // is copied/added for the voffset field is a large power of 2, and it
1812 // stands more chance of being CSEd with the copy/add for another similar
1813 // load/store.
1814 // However, do not do that rounding down if that is a negative
1815 // number, as it appears to be illegal to have a negative offset in the
1816 // vgpr, even if adding the immediate offset makes it positive.
1817 unsigned Overflow = ImmOffset & ~MaxImm;
1818 ImmOffset -= Overflow;
1819 if ((int32_t)Overflow < 0) {
1820 Overflow += ImmOffset;
1821 ImmOffset = 0;
1822 }
1823
1824 C1 = ImmOffset;
1825 if (Overflow != 0) {
1826 if (!BaseReg)
1827 BaseReg = B.buildConstant(S32, Overflow).getReg(0);
1828 else {
1829 auto OverflowVal = B.buildConstant(S32, Overflow);
1830 BaseReg = B.buildAdd(S32, BaseReg, OverflowVal).getReg(0);
1831 }
1832 }
1833 }
1834
1835 if (!BaseReg)
1836 BaseReg = B.buildConstant(S32, 0).getReg(0);
1837
1838 return {BaseReg, C1};
1839}
1840
1842 Register SrcReg) const {
1843 MachineRegisterInfo &MRI = *B.getMRI();
1844 LLT SrcTy = MRI.getType(SrcReg);
1845 if (SrcTy.getSizeInBits() == 32) {
1846 // Use a v_mov_b32 here to make the exec dependency explicit.
1847 B.buildInstr(AMDGPU::V_MOV_B32_e32)
1848 .addDef(DstReg)
1849 .addUse(SrcReg);
1850 return constrainGenericRegister(DstReg, AMDGPU::VGPR_32RegClass, MRI) &&
1851 constrainGenericRegister(SrcReg, AMDGPU::SReg_32RegClass, MRI);
1852 }
1853
1854 Register TmpReg0 = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);
1855 Register TmpReg1 = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);
1856
1857 B.buildInstr(AMDGPU::V_MOV_B32_e32)
1858 .addDef(TmpReg0)
1859 .addUse(SrcReg, 0, AMDGPU::sub0);
1860 B.buildInstr(AMDGPU::V_MOV_B32_e32)
1861 .addDef(TmpReg1)
1862 .addUse(SrcReg, 0, AMDGPU::sub1);
1863 B.buildInstr(AMDGPU::REG_SEQUENCE)
1864 .addDef(DstReg)
1865 .addUse(TmpReg0)
1866 .addImm(AMDGPU::sub0)
1867 .addUse(TmpReg1)
1868 .addImm(AMDGPU::sub1);
1869
1870 return constrainGenericRegister(SrcReg, AMDGPU::SReg_64RegClass, MRI) &&
1871 constrainGenericRegister(DstReg, AMDGPU::VReg_64RegClass, MRI);
1872}
1873
1874/// Utility function for pushing dynamic vector indexes with a constant offset
1875/// into waterfall loops.
1877 MachineInstr &IdxUseInstr,
1878 unsigned OpIdx,
1879 unsigned ConstOffset) {
1880 MachineRegisterInfo &MRI = *B.getMRI();
1881 const LLT S32 = LLT::scalar(32);
1882 Register WaterfallIdx = IdxUseInstr.getOperand(OpIdx).getReg();
1883 B.setInsertPt(*IdxUseInstr.getParent(), IdxUseInstr.getIterator());
1884
1885 auto MaterializedOffset = B.buildConstant(S32, ConstOffset);
1886
1887 auto Add = B.buildAdd(S32, WaterfallIdx, MaterializedOffset);
1888 MRI.setRegBank(MaterializedOffset.getReg(0), AMDGPU::SGPRRegBank);
1889 MRI.setRegBank(Add.getReg(0), AMDGPU::SGPRRegBank);
1890 IdxUseInstr.getOperand(OpIdx).setReg(Add.getReg(0));
1891}
1892
1893/// Implement extending a 32-bit value to a 64-bit value. \p Lo32Reg is the
1894/// original 32-bit source value (to be inserted in the low part of the combined
1895/// 64-bit result), and \p Hi32Reg is the high half of the combined 64-bit
1896/// value.
1898 Register Hi32Reg, Register Lo32Reg,
1899 unsigned ExtOpc,
1900 const RegisterBank &RegBank,
1901 bool IsBooleanSrc = false) {
1902 if (ExtOpc == AMDGPU::G_ZEXT) {
1903 B.buildConstant(Hi32Reg, 0);
1904 } else if (ExtOpc == AMDGPU::G_SEXT) {
1905 if (IsBooleanSrc) {
1906 // If we know the original source was an s1, the high half is the same as
1907 // the low.
1908 B.buildCopy(Hi32Reg, Lo32Reg);
1909 } else {
1910 // Replicate sign bit from 32-bit extended part.
1911 auto ShiftAmt = B.buildConstant(LLT::scalar(32), 31);
1912 B.getMRI()->setRegBank(ShiftAmt.getReg(0), RegBank);
1913 B.buildAShr(Hi32Reg, Lo32Reg, ShiftAmt);
1914 }
1915 } else {
1916 assert(ExtOpc == AMDGPU::G_ANYEXT && "not an integer extension");
1917 B.buildUndef(Hi32Reg);
1918 }
1919}
1920
1921bool AMDGPURegisterBankInfo::foldExtractEltToCmpSelect(
1923 const OperandsMapper &OpdMapper) const {
1924 MachineRegisterInfo &MRI = *B.getMRI();
1925
1926 Register VecReg = MI.getOperand(1).getReg();
1927 Register Idx = MI.getOperand(2).getReg();
1928
1929 const RegisterBank &IdxBank =
1930 *OpdMapper.getInstrMapping().getOperandMapping(2).BreakDown[0].RegBank;
1931
1932 bool IsDivergentIdx = IdxBank != AMDGPU::SGPRRegBank;
1933
1934 LLT VecTy = MRI.getType(VecReg);
1935 unsigned EltSize = VecTy.getScalarSizeInBits();
1936 unsigned NumElem = VecTy.getNumElements();
1937
1938 if (!SITargetLowering::shouldExpandVectorDynExt(EltSize, NumElem,
1939 IsDivergentIdx, &Subtarget))
1940 return false;
1941
1942 LLT S32 = LLT::scalar(32);
1943
1944 const RegisterBank &DstBank =
1945 *OpdMapper.getInstrMapping().getOperandMapping(0).BreakDown[0].RegBank;
1946 const RegisterBank &SrcBank =
1947 *OpdMapper.getInstrMapping().getOperandMapping(1).BreakDown[0].RegBank;
1948
1949 const RegisterBank &CCBank =
1950 (DstBank == AMDGPU::SGPRRegBank &&
1951 SrcBank == AMDGPU::SGPRRegBank &&
1952 IdxBank == AMDGPU::SGPRRegBank) ? AMDGPU::SGPRRegBank
1953 : AMDGPU::VCCRegBank;
1954 LLT CCTy = (CCBank == AMDGPU::SGPRRegBank) ? S32 : LLT::scalar(1);
1955
1956 if (CCBank == AMDGPU::VCCRegBank && IdxBank == AMDGPU::SGPRRegBank) {
1957 Idx = B.buildCopy(S32, Idx)->getOperand(0).getReg();
1958 MRI.setRegBank(Idx, AMDGPU::VGPRRegBank);
1959 }
1960
1961 LLT EltTy = VecTy.getScalarType();
1962 SmallVector<Register, 2> DstRegs(OpdMapper.getVRegs(0));
1963 unsigned NumLanes = DstRegs.size();
1964 if (!NumLanes)
1965 NumLanes = 1;
1966 else
1967 EltTy = MRI.getType(DstRegs[0]);
1968
1969 auto UnmergeToEltTy = B.buildUnmerge(EltTy, VecReg);
1970 SmallVector<Register, 2> Res(NumLanes);
1971 for (unsigned L = 0; L < NumLanes; ++L)
1972 Res[L] = UnmergeToEltTy.getReg(L);
1973
1974 for (unsigned I = 1; I < NumElem; ++I) {
1975 auto IC = B.buildConstant(S32, I);
1976 MRI.setRegBank(IC->getOperand(0).getReg(), AMDGPU::SGPRRegBank);
1977 auto Cmp = B.buildICmp(CmpInst::ICMP_EQ, CCTy, Idx, IC);
1978 MRI.setRegBank(Cmp->getOperand(0).getReg(), CCBank);
1979
1980 for (unsigned L = 0; L < NumLanes; ++L) {
1981 auto S = B.buildSelect(EltTy, Cmp,
1982 UnmergeToEltTy.getReg(I * NumLanes + L), Res[L]);
1983
1984 for (unsigned N : { 0, 2, 3 })
1985 MRI.setRegBank(S->getOperand(N).getReg(), DstBank);
1986
1987 Res[L] = S->getOperand(0).getReg();
1988 }
1989 }
1990
1991 for (unsigned L = 0; L < NumLanes; ++L) {
1992 Register DstReg = (NumLanes == 1) ? MI.getOperand(0).getReg() : DstRegs[L];
1993 B.buildCopy(DstReg, Res[L]);
1994 MRI.setRegBank(DstReg, DstBank);
1995 }
1996
1997 MRI.setRegBank(MI.getOperand(0).getReg(), DstBank);
1998 MI.eraseFromParent();
1999
2000 return true;
2001}
2002
2003// Insert a cross regbank copy for a register if it already has a bank that
2004// differs from the one we want to set.
2007 const RegisterBank &Bank) {
2008 const RegisterBank *CurrBank = MRI.getRegBankOrNull(Reg);
2009 if (CurrBank && *CurrBank != Bank) {
2010 Register Copy = B.buildCopy(MRI.getType(Reg), Reg).getReg(0);
2011 MRI.setRegBank(Copy, Bank);
2012 return Copy;
2013 }
2014
2015 MRI.setRegBank(Reg, Bank);
2016 return Reg;
2017}
2018
2019bool AMDGPURegisterBankInfo::foldInsertEltToCmpSelect(
2021 const OperandsMapper &OpdMapper) const {
2022
2023 MachineRegisterInfo &MRI = *B.getMRI();
2024 Register VecReg = MI.getOperand(1).getReg();
2025 Register Idx = MI.getOperand(3).getReg();
2026
2027 const RegisterBank &IdxBank =
2028 *OpdMapper.getInstrMapping().getOperandMapping(3).BreakDown[0].RegBank;
2029
2030 bool IsDivergentIdx = IdxBank != AMDGPU::SGPRRegBank;
2031
2032 LLT VecTy = MRI.getType(VecReg);
2033 unsigned EltSize = VecTy.getScalarSizeInBits();
2034 unsigned NumElem = VecTy.getNumElements();
2035
2036 if (!SITargetLowering::shouldExpandVectorDynExt(EltSize, NumElem,
2037 IsDivergentIdx, &Subtarget))
2038 return false;
2039
2040 LLT S32 = LLT::scalar(32);
2041
2042 const RegisterBank &DstBank =
2043 *OpdMapper.getInstrMapping().getOperandMapping(0).BreakDown[0].RegBank;
2044 const RegisterBank &SrcBank =
2045 *OpdMapper.getInstrMapping().getOperandMapping(1).BreakDown[0].RegBank;
2046 const RegisterBank &InsBank =
2047 *OpdMapper.getInstrMapping().getOperandMapping(2).BreakDown[0].RegBank;
2048
2049 const RegisterBank &CCBank =
2050 (DstBank == AMDGPU::SGPRRegBank &&
2051 SrcBank == AMDGPU::SGPRRegBank &&
2052 InsBank == AMDGPU::SGPRRegBank &&
2053 IdxBank == AMDGPU::SGPRRegBank) ? AMDGPU::SGPRRegBank
2054 : AMDGPU::VCCRegBank;
2055 LLT CCTy = (CCBank == AMDGPU::SGPRRegBank) ? S32 : LLT::scalar(1);
2056
2057 if (CCBank == AMDGPU::VCCRegBank && IdxBank == AMDGPU::SGPRRegBank) {
2058 Idx = B.buildCopy(S32, Idx)->getOperand(0).getReg();
2059 MRI.setRegBank(Idx, AMDGPU::VGPRRegBank);
2060 }
2061
2062 LLT EltTy = VecTy.getScalarType();
2063 SmallVector<Register, 2> InsRegs(OpdMapper.getVRegs(2));
2064 unsigned NumLanes = InsRegs.size();
2065 if (!NumLanes) {
2066 NumLanes = 1;
2067 InsRegs.push_back(MI.getOperand(2).getReg());
2068 } else {
2069 EltTy = MRI.getType(InsRegs[0]);
2070 }
2071
2072 auto UnmergeToEltTy = B.buildUnmerge(EltTy, VecReg);
2073 SmallVector<Register, 16> Ops(NumElem * NumLanes);
2074
2075 for (unsigned I = 0; I < NumElem; ++I) {
2076 auto IC = B.buildConstant(S32, I);
2077 MRI.setRegBank(IC->getOperand(0).getReg(), AMDGPU::SGPRRegBank);
2078 auto Cmp = B.buildICmp(CmpInst::ICMP_EQ, CCTy, Idx, IC);
2079 MRI.setRegBank(Cmp->getOperand(0).getReg(), CCBank);
2080
2081 for (unsigned L = 0; L < NumLanes; ++L) {
2082 Register Op0 = constrainRegToBank(MRI, B, InsRegs[L], DstBank);
2083 Register Op1 = UnmergeToEltTy.getReg(I * NumLanes + L);
2084 Op1 = constrainRegToBank(MRI, B, Op1, DstBank);
2085
2086 Register Select = B.buildSelect(EltTy, Cmp, Op0, Op1).getReg(0);
2087 MRI.setRegBank(Select, DstBank);
2088
2089 Ops[I * NumLanes + L] = Select;
2090 }
2091 }
2092
2093 LLT MergeTy = LLT::fixed_vector(Ops.size(), EltTy);
2094 if (MergeTy == MRI.getType(MI.getOperand(0).getReg())) {
2095 B.buildBuildVector(MI.getOperand(0), Ops);
2096 } else {
2097 auto Vec = B.buildBuildVector(MergeTy, Ops);
2098 MRI.setRegBank(Vec->getOperand(0).getReg(), DstBank);
2099 B.buildBitcast(MI.getOperand(0).getReg(), Vec);
2100 }
2101
2102 MRI.setRegBank(MI.getOperand(0).getReg(), DstBank);
2103 MI.eraseFromParent();
2104
2105 return true;
2106}
2107
2108// Break s_mul_u64 into 32-bit vector operations.
2110 MachineIRBuilder &B, const OperandsMapper &OpdMapper) const {
2111 SmallVector<Register, 2> DefRegs(OpdMapper.getVRegs(0));
2112 SmallVector<Register, 2> Src0Regs(OpdMapper.getVRegs(1));
2113 SmallVector<Register, 2> Src1Regs(OpdMapper.getVRegs(2));
2114
2115 // All inputs are SGPRs, nothing special to do.
2116 if (DefRegs.empty()) {
2117 assert(Src0Regs.empty() && Src1Regs.empty());
2118 applyDefaultMapping(OpdMapper);
2119 return;
2120 }
2121
2122 assert(DefRegs.size() == 2);
2123 assert(Src0Regs.size() == Src1Regs.size() &&
2124 (Src0Regs.empty() || Src0Regs.size() == 2));
2125
2126 MachineRegisterInfo &MRI = OpdMapper.getMRI();
2127 MachineInstr &MI = OpdMapper.getMI();
2128 Register DstReg = MI.getOperand(0).getReg();
2129 LLT HalfTy = LLT::scalar(32);
2130
2131 // Depending on where the source registers came from, the generic code may
2132 // have decided to split the inputs already or not. If not, we still need to
2133 // extract the values.
2134
2135 if (Src0Regs.empty())
2136 split64BitValueForMapping(B, Src0Regs, HalfTy, MI.getOperand(1).getReg());
2137 else
2138 setRegsToType(MRI, Src0Regs, HalfTy);
2139
2140 if (Src1Regs.empty())
2141 split64BitValueForMapping(B, Src1Regs, HalfTy, MI.getOperand(2).getReg());
2142 else
2143 setRegsToType(MRI, Src1Regs, HalfTy);
2144
2145 setRegsToType(MRI, DefRegs, HalfTy);
2146
2147 // The multiplication is done as follows:
2148 //
2149 // Op1H Op1L
2150 // * Op0H Op0L
2151 // --------------------
2152 // Op1H*Op0L Op1L*Op0L
2153 // + Op1H*Op0H Op1L*Op0H
2154 // -----------------------------------------
2155 // (Op1H*Op0L + Op1L*Op0H + carry) Op1L*Op0L
2156 //
2157 // We drop Op1H*Op0H because the result of the multiplication is a 64-bit
2158 // value and that would overflow.
2159 // The low 32-bit value is Op1L*Op0L.
2160 // The high 32-bit value is Op1H*Op0L + Op1L*Op0H + carry (from
2161 // Op1L*Op0L).
2162
2163 ApplyRegBankMapping ApplyBank(B, *this, MRI, &AMDGPU::VGPRRegBank);
2164
2165 Register Hi = B.buildUMulH(HalfTy, Src0Regs[0], Src1Regs[0]).getReg(0);
2166 Register MulLoHi = B.buildMul(HalfTy, Src0Regs[0], Src1Regs[1]).getReg(0);
2167 Register Add = B.buildAdd(HalfTy, Hi, MulLoHi).getReg(0);
2168 Register MulHiLo = B.buildMul(HalfTy, Src0Regs[1], Src1Regs[0]).getReg(0);
2169 B.buildAdd(DefRegs[1], Add, MulHiLo);
2170 B.buildMul(DefRegs[0], Src0Regs[0], Src1Regs[0]);
2171
2172 MRI.setRegBank(DstReg, AMDGPU::VGPRRegBank);
2173 MI.eraseFromParent();
2174}
2175
2177 MachineIRBuilder &B, const OperandsMapper &OpdMapper) const {
2178 MachineInstr &MI = OpdMapper.getMI();
2179 B.setInstrAndDebugLoc(MI);
2180 unsigned Opc = MI.getOpcode();
2181 MachineRegisterInfo &MRI = OpdMapper.getMRI();
2182 switch (Opc) {
2183 case AMDGPU::G_CONSTANT:
2184 case AMDGPU::G_IMPLICIT_DEF: {
2185 Register DstReg = MI.getOperand(0).getReg();
2186 LLT DstTy = MRI.getType(DstReg);
2187 if (DstTy != LLT::scalar(1))
2188 break;
2189
2190 const RegisterBank *DstBank =
2191 OpdMapper.getInstrMapping().getOperandMapping(0).BreakDown[0].RegBank;
2192 if (DstBank == &AMDGPU::VCCRegBank)
2193 break;
2194 SmallVector<Register, 1> DefRegs(OpdMapper.getVRegs(0));
2195 if (DefRegs.empty())
2196 DefRegs.push_back(DstReg);
2197
2198 B.setInsertPt(*MI.getParent(), ++MI.getIterator());
2199
2200 Register NewDstReg = MRI.createGenericVirtualRegister(LLT::scalar(32));
2201 LLVMContext &Ctx = B.getMF().getFunction().getContext();
2202
2203 MI.getOperand(0).setReg(NewDstReg);
2204 if (Opc != AMDGPU::G_IMPLICIT_DEF) {
2205 uint64_t ConstVal = MI.getOperand(1).getCImm()->getZExtValue();
2206 MI.getOperand(1).setCImm(
2207 ConstantInt::get(IntegerType::getInt32Ty(Ctx), ConstVal));
2208 }
2209
2210 MRI.setRegBank(NewDstReg, *DstBank);
2211 B.buildTrunc(DefRegs[0], NewDstReg);
2212 return;
2213 }
2214 case AMDGPU::G_PHI: {
2215 Register DstReg = MI.getOperand(0).getReg();
2216 LLT DstTy = MRI.getType(DstReg);
2217 if (DstTy != LLT::scalar(1))
2218 break;
2219
2220 const LLT S32 = LLT::scalar(32);
2221 const RegisterBank *DstBank =
2222 OpdMapper.getInstrMapping().getOperandMapping(0).BreakDown[0].RegBank;
2223 if (DstBank == &AMDGPU::VCCRegBank) {
2224 applyDefaultMapping(OpdMapper);
2225 // The standard handling only considers the result register bank for
2226 // phis. For VCC, blindly inserting a copy when the phi is lowered will
2227 // produce an invalid copy. We can only copy with some kind of compare to
2228 // get a vector boolean result. Insert a register bank copy that will be
2229 // correctly lowered to a compare.
2230 for (unsigned I = 1, E = MI.getNumOperands(); I != E; I += 2) {
2231 Register SrcReg = MI.getOperand(I).getReg();
2232 const RegisterBank *SrcBank = getRegBank(SrcReg, MRI, *TRI);
2233
2234 if (SrcBank != &AMDGPU::VCCRegBank) {
2235 MachineBasicBlock *SrcMBB = MI.getOperand(I + 1).getMBB();
2236 B.setInsertPt(*SrcMBB, SrcMBB->getFirstTerminator());
2237
2238 auto Copy = B.buildCopy(LLT::scalar(1), SrcReg);
2239 MRI.setRegBank(Copy.getReg(0), AMDGPU::VCCRegBank);
2240 MI.getOperand(I).setReg(Copy.getReg(0));
2241 }
2242 }
2243
2244 return;
2245 }
2246
2247 // Phi handling is strange and only considers the bank of the destination.
2248 substituteSimpleCopyRegs(OpdMapper, 0);
2249
2250 // Promote SGPR/VGPR booleans to s32
2251 ApplyRegBankMapping ApplyBank(B, *this, MRI, DstBank);
2252 B.setInsertPt(B.getMBB(), MI);
2253 LegalizerHelper Helper(B.getMF(), ApplyBank, B);
2254
2255 if (Helper.widenScalar(MI, 0, S32) != LegalizerHelper::Legalized)
2256 llvm_unreachable("widen scalar should have succeeded");
2257
2258 return;
2259 }
2260 case AMDGPU::G_FCMP:
2262 break;
2263 [[fallthrough]];
2264 case AMDGPU::G_ICMP:
2265 case AMDGPU::G_UADDO:
2266 case AMDGPU::G_USUBO:
2267 case AMDGPU::G_UADDE:
2268 case AMDGPU::G_SADDE:
2269 case AMDGPU::G_USUBE:
2270 case AMDGPU::G_SSUBE: {
2271 unsigned BoolDstOp =
2272 (Opc == AMDGPU::G_ICMP || Opc == AMDGPU::G_FCMP) ? 0 : 1;
2273 Register DstReg = MI.getOperand(BoolDstOp).getReg();
2274
2275 const RegisterBank *DstBank =
2276 OpdMapper.getInstrMapping().getOperandMapping(0).BreakDown[0].RegBank;
2277 if (DstBank != &AMDGPU::SGPRRegBank)
2278 break;
2279
2280 const bool HasCarryIn = MI.getNumOperands() == 5;
2281
2282 // If this is a scalar compare, promote the result to s32, as the selection
2283 // will end up using a copy to a 32-bit vreg.
2284 const LLT S32 = LLT::scalar(32);
2285 Register NewDstReg = MRI.createGenericVirtualRegister(S32);
2286 MRI.setRegBank(NewDstReg, AMDGPU::SGPRRegBank);
2287 MI.getOperand(BoolDstOp).setReg(NewDstReg);
2288
2289 if (HasCarryIn) {
2290 Register NewSrcReg = MRI.createGenericVirtualRegister(S32);
2291 MRI.setRegBank(NewSrcReg, AMDGPU::SGPRRegBank);
2292 B.buildZExt(NewSrcReg, MI.getOperand(4).getReg());
2293 MI.getOperand(4).setReg(NewSrcReg);
2294 }
2295
2296 MachineBasicBlock *MBB = MI.getParent();
2297 B.setInsertPt(*MBB, std::next(MI.getIterator()));
2298
2299 // If we had a constrained VCC result register, a copy was inserted to VCC
2300 // from SGPR.
2301 SmallVector<Register, 1> DefRegs(OpdMapper.getVRegs(0));
2302 if (DefRegs.empty())
2303 DefRegs.push_back(DstReg);
2304 B.buildTrunc(DefRegs[0], NewDstReg);
2305 return;
2306 }
2307 case AMDGPU::G_SELECT: {
2308 Register DstReg = MI.getOperand(0).getReg();
2309 LLT DstTy = MRI.getType(DstReg);
2310
2311 SmallVector<Register, 1> CondRegs(OpdMapper.getVRegs(1));
2312 if (CondRegs.empty())
2313 CondRegs.push_back(MI.getOperand(1).getReg());
2314 else {
2315 assert(CondRegs.size() == 1);
2316 }
2317
2318 const RegisterBank *CondBank = getRegBank(CondRegs[0], MRI, *TRI);
2319 if (CondBank == &AMDGPU::SGPRRegBank) {
2320 const LLT S32 = LLT::scalar(32);
2321 Register NewCondReg = MRI.createGenericVirtualRegister(S32);
2322 MRI.setRegBank(NewCondReg, AMDGPU::SGPRRegBank);
2323
2324 MI.getOperand(1).setReg(NewCondReg);
2325 B.buildZExt(NewCondReg, CondRegs[0]);
2326 }
2327
2328 if (DstTy.getSizeInBits() != 64)
2329 break;
2330
2331 LLT HalfTy = getHalfSizedType(DstTy);
2332
2333 SmallVector<Register, 2> DefRegs(OpdMapper.getVRegs(0));
2334 SmallVector<Register, 2> Src1Regs(OpdMapper.getVRegs(2));
2335 SmallVector<Register, 2> Src2Regs(OpdMapper.getVRegs(3));
2336
2337 // All inputs are SGPRs, nothing special to do.
2338 if (DefRegs.empty()) {
2339 assert(Src1Regs.empty() && Src2Regs.empty());
2340 break;
2341 }
2342
2343 if (Src1Regs.empty())
2344 split64BitValueForMapping(B, Src1Regs, HalfTy, MI.getOperand(2).getReg());
2345 else {
2346 setRegsToType(MRI, Src1Regs, HalfTy);
2347 }
2348
2349 if (Src2Regs.empty())
2350 split64BitValueForMapping(B, Src2Regs, HalfTy, MI.getOperand(3).getReg());
2351 else
2352 setRegsToType(MRI, Src2Regs, HalfTy);
2353
2354 setRegsToType(MRI, DefRegs, HalfTy);
2355
2356 B.buildSelect(DefRegs[0], CondRegs[0], Src1Regs[0], Src2Regs[0]);
2357 B.buildSelect(DefRegs[1], CondRegs[0], Src1Regs[1], Src2Regs[1]);
2358
2359 MRI.setRegBank(DstReg, AMDGPU::VGPRRegBank);
2360 MI.eraseFromParent();
2361 return;
2362 }
2363 case AMDGPU::G_BRCOND: {
2364 Register CondReg = MI.getOperand(0).getReg();
2365 // FIXME: Should use legalizer helper, but should change bool ext type.
2366 const RegisterBank *CondBank =
2367 OpdMapper.getInstrMapping().getOperandMapping(0).BreakDown[0].RegBank;
2368
2369 if (CondBank == &AMDGPU::SGPRRegBank) {
2370 const LLT S32 = LLT::scalar(32);
2371 Register NewCondReg = MRI.createGenericVirtualRegister(S32);
2372 MRI.setRegBank(NewCondReg, AMDGPU::SGPRRegBank);
2373
2374 MI.getOperand(0).setReg(NewCondReg);
2375 B.buildZExt(NewCondReg, CondReg);
2376 return;
2377 }
2378
2379 break;
2380 }
2381 case AMDGPU::G_AND:
2382 case AMDGPU::G_OR:
2383 case AMDGPU::G_XOR: {
2384 // 64-bit and is only available on the SALU, so split into 2 32-bit ops if
2385 // there is a VGPR input.
2386 Register DstReg = MI.getOperand(0).getReg();
2387 LLT DstTy = MRI.getType(DstReg);
2388
2389 if (DstTy.getSizeInBits() == 1) {
2390 const RegisterBank *DstBank =
2391 OpdMapper.getInstrMapping().getOperandMapping(0).BreakDown[0].RegBank;
2392 if (DstBank == &AMDGPU::VCCRegBank)
2393 break;
2394
2395 MachineFunction *MF = MI.getParent()->getParent();
2396 ApplyRegBankMapping ApplyBank(B, *this, MRI, DstBank);
2397 LegalizerHelper Helper(*MF, ApplyBank, B);
2398
2399 if (Helper.widenScalar(MI, 0, LLT::scalar(32)) !=
2401 llvm_unreachable("widen scalar should have succeeded");
2402 return;
2403 }
2404
2405 if (DstTy.getSizeInBits() != 64)
2406 break;
2407
2408 LLT HalfTy = getHalfSizedType(DstTy);
2409 SmallVector<Register, 2> DefRegs(OpdMapper.getVRegs(0));
2410 SmallVector<Register, 2> Src0Regs(OpdMapper.getVRegs(1));
2411 SmallVector<Register, 2> Src1Regs(OpdMapper.getVRegs(2));
2412
2413 // All inputs are SGPRs, nothing special to do.
2414 if (DefRegs.empty()) {
2415 assert(Src0Regs.empty() && Src1Regs.empty());
2416 break;
2417 }
2418
2419 assert(DefRegs.size() == 2);
2420 assert(Src0Regs.size() == Src1Regs.size() &&
2421 (Src0Regs.empty() || Src0Regs.size() == 2));
2422
2423 // Depending on where the source registers came from, the generic code may
2424 // have decided to split the inputs already or not. If not, we still need to
2425 // extract the values.
2426
2427 if (Src0Regs.empty())
2428 split64BitValueForMapping(B, Src0Regs, HalfTy, MI.getOperand(1).getReg());
2429 else
2430 setRegsToType(MRI, Src0Regs, HalfTy);
2431
2432 if (Src1Regs.empty())
2433 split64BitValueForMapping(B, Src1Regs, HalfTy, MI.getOperand(2).getReg());
2434 else
2435 setRegsToType(MRI, Src1Regs, HalfTy);
2436
2437 setRegsToType(MRI, DefRegs, HalfTy);
2438
2439 B.buildInstr(Opc, {DefRegs[0]}, {Src0Regs[0], Src1Regs[0]});
2440 B.buildInstr(Opc, {DefRegs[1]}, {Src0Regs[1], Src1Regs[1]});
2441
2442 MRI.setRegBank(DstReg, AMDGPU::VGPRRegBank);
2443 MI.eraseFromParent();
2444 return;
2445 }
2446 case AMDGPU::G_ABS: {
2447 Register SrcReg = MI.getOperand(1).getReg();
2448 const RegisterBank *SrcBank = MRI.getRegBankOrNull(SrcReg);
2449
2450 // There is no VALU abs instruction so we need to replace it with a sub and
2451 // max combination.
2452 if (SrcBank && SrcBank == &AMDGPU::VGPRRegBank) {
2453 MachineFunction *MF = MI.getParent()->getParent();
2454 ApplyRegBankMapping Apply(B, *this, MRI, &AMDGPU::VGPRRegBank);
2455 LegalizerHelper Helper(*MF, Apply, B);
2456
2458 llvm_unreachable("lowerAbsToMaxNeg should have succeeded");
2459 return;
2460 }
2461 [[fallthrough]];
2462 }
2463 case AMDGPU::G_ADD:
2464 case AMDGPU::G_SUB:
2465 case AMDGPU::G_MUL:
2466 case AMDGPU::G_SHL:
2467 case AMDGPU::G_LSHR:
2468 case AMDGPU::G_ASHR:
2469 case AMDGPU::G_SMIN:
2470 case AMDGPU::G_SMAX:
2471 case AMDGPU::G_UMIN:
2472 case AMDGPU::G_UMAX: {
2473 Register DstReg = MI.getOperand(0).getReg();
2474 LLT DstTy = MRI.getType(DstReg);
2475
2476 // Special case for s_mul_u64. There is not a vector equivalent of
2477 // s_mul_u64. Hence, we have to break down s_mul_u64 into 32-bit vector
2478 // multiplications.
2479 if (Opc == AMDGPU::G_MUL && DstTy.getSizeInBits() == 64) {
2480 applyMappingSMULU64(B, OpdMapper);
2481 return;
2482 }
2483
2484 // 16-bit operations are VALU only, but can be promoted to 32-bit SALU.
2485 // Packed 16-bit operations need to be scalarized and promoted.
2486 if (DstTy != LLT::scalar(16) && DstTy != LLT::fixed_vector(2, 16))
2487 break;
2488
2489 const RegisterBank *DstBank =
2490 OpdMapper.getInstrMapping().getOperandMapping(0).BreakDown[0].RegBank;
2491 if (DstBank == &AMDGPU::VGPRRegBank)
2492 break;
2493
2494 const LLT S32 = LLT::scalar(32);
2495 MachineBasicBlock *MBB = MI.getParent();
2496 MachineFunction *MF = MBB->getParent();
2497 ApplyRegBankMapping ApplySALU(B, *this, MRI, &AMDGPU::SGPRRegBank);
2498
2499 if (DstTy.isVector() && Opc == AMDGPU::G_ABS) {
2500 Register WideSrcLo, WideSrcHi;
2501
2502 std::tie(WideSrcLo, WideSrcHi) =
2503 unpackV2S16ToS32(B, MI.getOperand(1).getReg(), TargetOpcode::G_SEXT);
2504 auto Lo = B.buildInstr(AMDGPU::G_ABS, {S32}, {WideSrcLo});
2505 auto Hi = B.buildInstr(AMDGPU::G_ABS, {S32}, {WideSrcHi});
2506 B.buildBuildVectorTrunc(DstReg, {Lo.getReg(0), Hi.getReg(0)});
2507 MI.eraseFromParent();
2508 return;
2509 }
2510
2511 if (DstTy.isVector()) {
2512 Register WideSrc0Lo, WideSrc0Hi;
2513 Register WideSrc1Lo, WideSrc1Hi;
2514
2515 unsigned ExtendOp = getExtendOp(MI.getOpcode());
2516 std::tie(WideSrc0Lo, WideSrc0Hi)
2517 = unpackV2S16ToS32(B, MI.getOperand(1).getReg(), ExtendOp);
2518 std::tie(WideSrc1Lo, WideSrc1Hi)
2519 = unpackV2S16ToS32(B, MI.getOperand(2).getReg(), ExtendOp);
2520 auto Lo = B.buildInstr(MI.getOpcode(), {S32}, {WideSrc0Lo, WideSrc1Lo});
2521 auto Hi = B.buildInstr(MI.getOpcode(), {S32}, {WideSrc0Hi, WideSrc1Hi});
2522 B.buildBuildVectorTrunc(DstReg, {Lo.getReg(0), Hi.getReg(0)});
2523 MI.eraseFromParent();
2524 } else {
2525 LegalizerHelper Helper(*MF, ApplySALU, B);
2526
2527 if (Helper.widenScalar(MI, 0, S32) != LegalizerHelper::Legalized)
2528 llvm_unreachable("widen scalar should have succeeded");
2529
2530 // FIXME: s16 shift amounts should be legal.
2531 if (Opc == AMDGPU::G_SHL || Opc == AMDGPU::G_LSHR ||
2532 Opc == AMDGPU::G_ASHR) {
2533 B.setInsertPt(*MBB, MI.getIterator());
2534 if (Helper.widenScalar(MI, 1, S32) != LegalizerHelper::Legalized)
2535 llvm_unreachable("widen scalar should have succeeded");
2536 }
2537 }
2538
2539 return;
2540 }
2541 case AMDGPU::G_AMDGPU_S_MUL_I64_I32:
2542 case AMDGPU::G_AMDGPU_S_MUL_U64_U32: {
2543 // This is a special case for s_mul_u64. We use
2544 // G_AMDGPU_S_MUL_I64_I32 opcode to represent an s_mul_u64 operation
2545 // where the 33 higher bits are sign-extended and
2546 // G_AMDGPU_S_MUL_U64_U32 opcode to represent an s_mul_u64 operation
2547 // where the 32 higher bits are zero-extended. In case scalar registers are
2548 // selected, both opcodes are lowered as s_mul_u64. If the vector registers
2549 // are selected, then G_AMDGPU_S_MUL_I64_I32 and
2550 // G_AMDGPU_S_MUL_U64_U32 are lowered with a vector mad instruction.
2551
2552 // Insert basic copies.
2553 applyDefaultMapping(OpdMapper);
2554
2555 Register DstReg = MI.getOperand(0).getReg();
2556 Register SrcReg0 = MI.getOperand(1).getReg();
2557 Register SrcReg1 = MI.getOperand(2).getReg();
2558 const LLT S32 = LLT::scalar(32);
2559 const LLT S64 = LLT::scalar(64);
2560 assert(MRI.getType(DstReg) == S64 && "This is a special case for s_mul_u64 "
2561 "that handles only 64-bit operands.");
2562 const RegisterBank *DstBank =
2563 OpdMapper.getInstrMapping().getOperandMapping(0).BreakDown[0].RegBank;
2564
2565 // Replace G_AMDGPU_S_MUL_I64_I32 and G_AMDGPU_S_MUL_U64_U32
2566 // with s_mul_u64 operation.
2567 if (DstBank == &AMDGPU::SGPRRegBank) {
2568 MI.setDesc(TII->get(AMDGPU::S_MUL_U64));
2569 MRI.setRegClass(DstReg, &AMDGPU::SGPR_64RegClass);
2570 MRI.setRegClass(SrcReg0, &AMDGPU::SGPR_64RegClass);
2571 MRI.setRegClass(SrcReg1, &AMDGPU::SGPR_64RegClass);
2572 return;
2573 }
2574
2575 // Replace G_AMDGPU_S_MUL_I64_I32 and G_AMDGPU_S_MUL_U64_U32
2576 // with a vector mad.
2577 assert(MRI.getRegBankOrNull(DstReg) == &AMDGPU::VGPRRegBank &&
2578 "The destination operand should be in vector registers.");
2579
2580 DebugLoc DL = MI.getDebugLoc();
2581
2582 // Extract the lower subregister from the first operand.
2583 Register Op0L = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);
2584 MRI.setRegClass(Op0L, &AMDGPU::VGPR_32RegClass);
2585 MRI.setType(Op0L, S32);
2586 B.buildTrunc(Op0L, SrcReg0);
2587
2588 // Extract the lower subregister from the second operand.
2589 Register Op1L = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);
2590 MRI.setRegClass(Op1L, &AMDGPU::VGPR_32RegClass);
2591 MRI.setType(Op1L, S32);
2592 B.buildTrunc(Op1L, SrcReg1);
2593
2594 unsigned NewOpc = Opc == AMDGPU::G_AMDGPU_S_MUL_U64_U32
2595 ? AMDGPU::G_AMDGPU_MAD_U64_U32
2596 : AMDGPU::G_AMDGPU_MAD_I64_I32;
2597
2599 Register Zero64 = B.buildConstant(S64, 0).getReg(0);
2600 MRI.setRegClass(Zero64, &AMDGPU::VReg_64RegClass);
2601 Register CarryOut = MRI.createVirtualRegister(&AMDGPU::VReg_64RegClass);
2602 MRI.setRegClass(CarryOut, &AMDGPU::VReg_64RegClass);
2603 B.buildInstr(NewOpc, {DstReg, CarryOut}, {Op0L, Op1L, Zero64});
2604 MI.eraseFromParent();
2605 return;
2606 }
2607 case AMDGPU::G_SEXT_INREG: {
2608 SmallVector<Register, 2> SrcRegs(OpdMapper.getVRegs(1));
2609 if (SrcRegs.empty())
2610 break; // Nothing to repair
2611
2612 const LLT S32 = LLT::scalar(32);
2613 ApplyRegBankMapping O(B, *this, MRI, &AMDGPU::VGPRRegBank);
2614
2615 // Don't use LegalizerHelper's narrowScalar. It produces unwanted G_SEXTs
2616 // we would need to further expand, and doesn't let us directly set the
2617 // result registers.
2618 SmallVector<Register, 2> DstRegs(OpdMapper.getVRegs(0));
2619
2620 int Amt = MI.getOperand(2).getImm();
2621 if (Amt <= 32) {
2622 // Downstream users have expectations for the high bit behavior, so freeze
2623 // incoming undefined bits.
2624 if (Amt == 32) {
2625 // The low bits are unchanged.
2626 B.buildFreeze(DstRegs[0], SrcRegs[0]);
2627 } else {
2628 auto Freeze = B.buildFreeze(S32, SrcRegs[0]);
2629 // Extend in the low bits and propagate the sign bit to the high half.
2630 B.buildSExtInReg(DstRegs[0], Freeze, Amt);
2631 }
2632
2633 B.buildAShr(DstRegs[1], DstRegs[0], B.buildConstant(S32, 31));
2634 } else {
2635 // The low bits are unchanged, and extend in the high bits.
2636 // No freeze required
2637 B.buildCopy(DstRegs[0], SrcRegs[0]);
2638 B.buildSExtInReg(DstRegs[1], DstRegs[0], Amt - 32);
2639 }
2640
2641 Register DstReg = MI.getOperand(0).getReg();
2642 MRI.setRegBank(DstReg, AMDGPU::VGPRRegBank);
2643 MI.eraseFromParent();
2644 return;
2645 }
2646 case AMDGPU::G_CTPOP:
2647 case AMDGPU::G_BITREVERSE: {
2648 const RegisterBank *DstBank =
2649 OpdMapper.getInstrMapping().getOperandMapping(0).BreakDown[0].RegBank;
2650 if (DstBank == &AMDGPU::SGPRRegBank)
2651 break;
2652
2653 Register SrcReg = MI.getOperand(1).getReg();
2654 const LLT S32 = LLT::scalar(32);
2655 LLT Ty = MRI.getType(SrcReg);
2656 if (Ty == S32)
2657 break;
2658
2659 ApplyRegBankMapping ApplyVALU(B, *this, MRI, &AMDGPU::VGPRRegBank);
2660
2661 MachineFunction &MF = B.getMF();
2662 LegalizerHelper Helper(MF, ApplyVALU, B);
2663
2664 if (Helper.narrowScalar(MI, 1, S32) != LegalizerHelper::Legalized)
2665 llvm_unreachable("narrowScalar should have succeeded");
2666 return;
2667 }
2668 case AMDGPU::G_AMDGPU_FFBH_U32:
2669 case AMDGPU::G_AMDGPU_FFBL_B32:
2670 case AMDGPU::G_CTLZ_ZERO_UNDEF:
2671 case AMDGPU::G_CTTZ_ZERO_UNDEF: {
2672 const RegisterBank *DstBank =
2673 OpdMapper.getInstrMapping().getOperandMapping(0).BreakDown[0].RegBank;
2674 if (DstBank == &AMDGPU::SGPRRegBank)
2675 break;
2676
2677 Register SrcReg = MI.getOperand(1).getReg();
2678 const LLT S32 = LLT::scalar(32);
2679 LLT Ty = MRI.getType(SrcReg);
2680 if (Ty == S32)
2681 break;
2682
2683 // We can narrow this more efficiently than Helper can by using ffbh/ffbl
2684 // which return -1 when the input is zero:
2685 // (ctlz_zero_undef hi:lo) -> (umin (ffbh hi), (add (ffbh lo), 32))
2686 // (cttz_zero_undef hi:lo) -> (umin (add (ffbl hi), 32), (ffbl lo))
2687 // (ffbh hi:lo) -> (umin (ffbh hi), (uaddsat (ffbh lo), 32))
2688 // (ffbl hi:lo) -> (umin (uaddsat (ffbh hi), 32), (ffbh lo))
2689 ApplyRegBankMapping ApplyVALU(B, *this, MRI, &AMDGPU::VGPRRegBank);
2690 SmallVector<Register, 2> SrcRegs(OpdMapper.getVRegs(1));
2691 unsigned NewOpc = Opc == AMDGPU::G_CTLZ_ZERO_UNDEF
2692 ? (unsigned)AMDGPU::G_AMDGPU_FFBH_U32
2693 : Opc == AMDGPU::G_CTTZ_ZERO_UNDEF
2694 ? (unsigned)AMDGPU::G_AMDGPU_FFBL_B32
2695 : Opc;
2696 unsigned Idx = NewOpc == AMDGPU::G_AMDGPU_FFBH_U32;
2697 auto X = B.buildInstr(NewOpc, {S32}, {SrcRegs[Idx]});
2698 auto Y = B.buildInstr(NewOpc, {S32}, {SrcRegs[Idx ^ 1]});
2699 unsigned AddOpc =
2700 Opc == AMDGPU::G_CTLZ_ZERO_UNDEF || Opc == AMDGPU::G_CTTZ_ZERO_UNDEF
2701 ? AMDGPU::G_ADD
2702 : AMDGPU::G_UADDSAT;
2703 Y = B.buildInstr(AddOpc, {S32}, {Y, B.buildConstant(S32, 32)});
2704 Register DstReg = MI.getOperand(0).getReg();
2705 B.buildUMin(DstReg, X, Y);
2706 MI.eraseFromParent();
2707 return;
2708 }
2709 case AMDGPU::G_SEXT:
2710 case AMDGPU::G_ZEXT:
2711 case AMDGPU::G_ANYEXT: {
2712 Register SrcReg = MI.getOperand(1).getReg();
2713 LLT SrcTy = MRI.getType(SrcReg);
2714 const bool Signed = Opc == AMDGPU::G_SEXT;
2715
2716 assert(OpdMapper.getVRegs(1).empty());
2717
2718 const RegisterBank *SrcBank =
2719 OpdMapper.getInstrMapping().getOperandMapping(1).BreakDown[0].RegBank;
2720
2721 Register DstReg = MI.getOperand(0).getReg();
2722 LLT DstTy = MRI.getType(DstReg);
2723 if (DstTy.isScalar() &&
2724 SrcBank != &AMDGPU::SGPRRegBank &&
2725 SrcBank != &AMDGPU::VCCRegBank &&
2726 // FIXME: Should handle any type that round to s64 when irregular
2727 // breakdowns supported.
2728 DstTy.getSizeInBits() == 64 &&
2729 SrcTy.getSizeInBits() <= 32) {
2730 SmallVector<Register, 2> DefRegs(OpdMapper.getVRegs(0));
2731
2732 // Extend to 32-bit, and then extend the low half.
2733 if (Signed) {
2734 // TODO: Should really be buildSExtOrCopy
2735 B.buildSExtOrTrunc(DefRegs[0], SrcReg);
2736 } else if (Opc == AMDGPU::G_ZEXT) {
2737 B.buildZExtOrTrunc(DefRegs[0], SrcReg);
2738 } else {
2739 B.buildAnyExtOrTrunc(DefRegs[0], SrcReg);
2740 }
2741
2742 extendLow32IntoHigh32(B, DefRegs[1], DefRegs[0], Opc, *SrcBank);
2743 MRI.setRegBank(DstReg, *SrcBank);
2744 MI.eraseFromParent();
2745 return;
2746 }
2747
2748 if (SrcTy != LLT::scalar(1))
2749 return;
2750
2751 // It is not legal to have a legalization artifact with a VCC source. Rather
2752 // than introducing a copy, insert the select we would have to select the
2753 // copy to.
2754 if (SrcBank == &AMDGPU::VCCRegBank) {
2755 SmallVector<Register, 2> DefRegs(OpdMapper.getVRegs(0));
2756
2757 const RegisterBank *DstBank = &AMDGPU::VGPRRegBank;
2758
2759 unsigned DstSize = DstTy.getSizeInBits();
2760 // 64-bit select is SGPR only
2761 const bool UseSel64 = DstSize > 32 &&
2762 SrcBank->getID() == AMDGPU::SGPRRegBankID;
2763
2764 // TODO: Should s16 select be legal?
2765 LLT SelType = UseSel64 ? LLT::scalar(64) : LLT::scalar(32);
2766 auto True = B.buildConstant(SelType, Signed ? -1 : 1);
2767 auto False = B.buildConstant(SelType, 0);
2768
2769 MRI.setRegBank(True.getReg(0), *DstBank);
2770 MRI.setRegBank(False.getReg(0), *DstBank);
2771 MRI.setRegBank(DstReg, *DstBank);
2772
2773 if (DstSize > 32) {
2774 B.buildSelect(DefRegs[0], SrcReg, True, False);
2775 extendLow32IntoHigh32(B, DefRegs[1], DefRegs[0], Opc, *SrcBank, true);
2776 } else if (DstSize < 32) {
2777 auto Sel = B.buildSelect(SelType, SrcReg, True, False);
2778 MRI.setRegBank(Sel.getReg(0), *DstBank);
2779 B.buildTrunc(DstReg, Sel);
2780 } else {
2781 B.buildSelect(DstReg, SrcReg, True, False);
2782 }
2783
2784 MI.eraseFromParent();
2785 return;
2786 }
2787
2788 break;
2789 }
2790 case AMDGPU::G_EXTRACT_VECTOR_ELT: {
2791 SmallVector<Register, 2> DstRegs(OpdMapper.getVRegs(0));
2792
2793 assert(OpdMapper.getVRegs(1).empty() && OpdMapper.getVRegs(2).empty());
2794
2795 Register DstReg = MI.getOperand(0).getReg();
2796 Register SrcReg = MI.getOperand(1).getReg();
2797
2798 const LLT S32 = LLT::scalar(32);
2799 LLT DstTy = MRI.getType(DstReg);
2800 LLT SrcTy = MRI.getType(SrcReg);
2801
2802 if (foldExtractEltToCmpSelect(B, MI, OpdMapper))
2803 return;
2804
2805 const ValueMapping &DstMapping
2806 = OpdMapper.getInstrMapping().getOperandMapping(0);
2807 const RegisterBank *DstBank = DstMapping.BreakDown[0].RegBank;
2808 const RegisterBank *SrcBank =
2809 OpdMapper.getInstrMapping().getOperandMapping(1).BreakDown[0].RegBank;
2810 const RegisterBank *IdxBank =
2811 OpdMapper.getInstrMapping().getOperandMapping(2).BreakDown[0].RegBank;
2812
2813 Register BaseIdxReg;
2814 unsigned ConstOffset;
2815 std::tie(BaseIdxReg, ConstOffset) =
2816 AMDGPU::getBaseWithConstantOffset(MRI, MI.getOperand(2).getReg());
2817
2818 // See if the index is an add of a constant which will be foldable by moving
2819 // the base register of the index later if this is going to be executed in a
2820 // waterfall loop. This is essentially to reassociate the add of a constant
2821 // with the readfirstlane.
2822 bool ShouldMoveIndexIntoLoop = IdxBank != &AMDGPU::SGPRRegBank &&
2823 ConstOffset > 0 &&
2824 ConstOffset < SrcTy.getNumElements();
2825
2826 // Move the base register. We'll re-insert the add later.
2827 if (ShouldMoveIndexIntoLoop)
2828 MI.getOperand(2).setReg(BaseIdxReg);
2829
2830 // If this is a VGPR result only because the index was a VGPR result, the
2831 // actual indexing will be done on the SGPR source vector, which will
2832 // produce a scalar result. We need to copy to the VGPR result inside the
2833 // waterfall loop.
2834 const bool NeedCopyToVGPR = DstBank == &AMDGPU::VGPRRegBank &&
2835 SrcBank == &AMDGPU::SGPRRegBank;
2836 if (DstRegs.empty()) {
2837 applyDefaultMapping(OpdMapper);
2838
2840
2841 if (NeedCopyToVGPR) {
2842 // We don't want a phi for this temporary reg.
2843 Register TmpReg = MRI.createGenericVirtualRegister(DstTy);
2844 MRI.setRegBank(TmpReg, AMDGPU::SGPRRegBank);
2845 MI.getOperand(0).setReg(TmpReg);
2846 B.setInsertPt(*MI.getParent(), ++MI.getIterator());
2847
2848 // Use a v_mov_b32 here to make the exec dependency explicit.
2849 buildVCopy(B, DstReg, TmpReg);
2850 }
2851
2852 // Re-insert the constant offset add inside the waterfall loop.
2853 if (ShouldMoveIndexIntoLoop)
2854 reinsertVectorIndexAdd(B, MI, 2, ConstOffset);
2855
2856 return;
2857 }
2858
2859 assert(DstTy.getSizeInBits() == 64);
2860
2861 LLT Vec32 = LLT::fixed_vector(2 * SrcTy.getNumElements(), 32);
2862
2863 auto CastSrc = B.buildBitcast(Vec32, SrcReg);
2864 auto One = B.buildConstant(S32, 1);
2865
2866 MachineBasicBlock::iterator MII = MI.getIterator();
2867
2868 // Split the vector index into 32-bit pieces. Prepare to move all of the
2869 // new instructions into a waterfall loop if necessary.
2870 //
2871 // Don't put the bitcast or constant in the loop.
2872 MachineInstrSpan Span(MII, &B.getMBB());
2873
2874 // Compute 32-bit element indices, (2 * OrigIdx, 2 * OrigIdx + 1).
2875 auto IdxLo = B.buildShl(S32, BaseIdxReg, One);
2876 auto IdxHi = B.buildAdd(S32, IdxLo, One);
2877
2878 auto Extract0 = B.buildExtractVectorElement(DstRegs[0], CastSrc, IdxLo);
2879 auto Extract1 = B.buildExtractVectorElement(DstRegs[1], CastSrc, IdxHi);
2880
2881 MRI.setRegBank(DstReg, *DstBank);
2882 MRI.setRegBank(CastSrc.getReg(0), *SrcBank);
2883 MRI.setRegBank(One.getReg(0), AMDGPU::SGPRRegBank);
2884 MRI.setRegBank(IdxLo.getReg(0), AMDGPU::SGPRRegBank);
2885 MRI.setRegBank(IdxHi.getReg(0), AMDGPU::SGPRRegBank);
2886
2887 SmallSet<Register, 4> OpsToWaterfall;
2888 if (!collectWaterfallOperands(OpsToWaterfall, MI, MRI, { 2 })) {
2889 MI.eraseFromParent();
2890 return;
2891 }
2892
2893 // Remove the original instruction to avoid potentially confusing the
2894 // waterfall loop logic.
2895 B.setInstr(*Span.begin());
2896 MI.eraseFromParent();
2897 executeInWaterfallLoop(B, make_range(Span.begin(), Span.end()),
2898 OpsToWaterfall);
2899
2900 if (NeedCopyToVGPR) {
2901 MachineBasicBlock *LoopBB = Extract1->getParent();
2902 Register TmpReg0 = MRI.createGenericVirtualRegister(S32);
2903 Register TmpReg1 = MRI.createGenericVirtualRegister(S32);
2904 MRI.setRegBank(TmpReg0, AMDGPU::SGPRRegBank);
2905 MRI.setRegBank(TmpReg1, AMDGPU::SGPRRegBank);
2906
2907 Extract0->getOperand(0).setReg(TmpReg0);
2908 Extract1->getOperand(0).setReg(TmpReg1);
2909
2910 B.setInsertPt(*LoopBB, ++Extract1->getIterator());
2911
2912 buildVCopy(B, DstRegs[0], TmpReg0);
2913 buildVCopy(B, DstRegs[1], TmpReg1);
2914 }
2915
2916 if (ShouldMoveIndexIntoLoop)
2917 reinsertVectorIndexAdd(B, *IdxLo, 1, ConstOffset);
2918
2919 return;
2920 }
2921 case AMDGPU::G_INSERT_VECTOR_ELT: {
2922 SmallVector<Register, 2> InsRegs(OpdMapper.getVRegs(2));
2923
2924 Register DstReg = MI.getOperand(0).getReg();
2925 LLT VecTy = MRI.getType(DstReg);
2926
2927 assert(OpdMapper.getVRegs(0).empty());
2928 assert(OpdMapper.getVRegs(3).empty());
2929
2930 if (substituteSimpleCopyRegs(OpdMapper, 1))
2931 MRI.setType(MI.getOperand(1).getReg(), VecTy);
2932
2933 if (foldInsertEltToCmpSelect(B, MI, OpdMapper))
2934 return;
2935
2936 const RegisterBank *IdxBank =
2937 OpdMapper.getInstrMapping().getOperandMapping(3).BreakDown[0].RegBank;
2938
2939 Register SrcReg = MI.getOperand(1).getReg();
2940 Register InsReg = MI.getOperand(2).getReg();
2941 LLT InsTy = MRI.getType(InsReg);
2942 (void)InsTy;
2943
2944 Register BaseIdxReg;
2945 unsigned ConstOffset;
2946 std::tie(BaseIdxReg, ConstOffset) =
2947 AMDGPU::getBaseWithConstantOffset(MRI, MI.getOperand(3).getReg());
2948
2949 // See if the index is an add of a constant which will be foldable by moving
2950 // the base register of the index later if this is going to be executed in a
2951 // waterfall loop. This is essentially to reassociate the add of a constant
2952 // with the readfirstlane.
2953 bool ShouldMoveIndexIntoLoop = IdxBank != &AMDGPU::SGPRRegBank &&
2954 ConstOffset > 0 &&
2955 ConstOffset < VecTy.getNumElements();
2956
2957 // Move the base register. We'll re-insert the add later.
2958 if (ShouldMoveIndexIntoLoop)
2959 MI.getOperand(3).setReg(BaseIdxReg);
2960
2961
2962 if (InsRegs.empty()) {
2964
2965 // Re-insert the constant offset add inside the waterfall loop.
2966 if (ShouldMoveIndexIntoLoop) {
2967 reinsertVectorIndexAdd(B, MI, 3, ConstOffset);
2968 }
2969
2970 return;
2971 }
2972
2973 assert(InsTy.getSizeInBits() == 64);
2974
2975 const LLT S32 = LLT::scalar(32);
2976 LLT Vec32 = LLT::fixed_vector(2 * VecTy.getNumElements(), 32);
2977
2978 auto CastSrc = B.buildBitcast(Vec32, SrcReg);
2979 auto One = B.buildConstant(S32, 1);
2980
2981 // Split the vector index into 32-bit pieces. Prepare to move all of the
2982 // new instructions into a waterfall loop if necessary.
2983 //
2984 // Don't put the bitcast or constant in the loop.
2986
2987 // Compute 32-bit element indices, (2 * OrigIdx, 2 * OrigIdx + 1).
2988 auto IdxLo = B.buildShl(S32, BaseIdxReg, One);
2989 auto IdxHi = B.buildAdd(S32, IdxLo, One);
2990
2991 auto InsLo = B.buildInsertVectorElement(Vec32, CastSrc, InsRegs[0], IdxLo);
2992 auto InsHi = B.buildInsertVectorElement(Vec32, InsLo, InsRegs[1], IdxHi);
2993
2994 const RegisterBank *DstBank =
2995 OpdMapper.getInstrMapping().getOperandMapping(0).BreakDown[0].RegBank;
2996 const RegisterBank *SrcBank =
2997 OpdMapper.getInstrMapping().getOperandMapping(1).BreakDown[0].RegBank;
2998 const RegisterBank *InsSrcBank =
2999 OpdMapper.getInstrMapping().getOperandMapping(2).BreakDown[0].RegBank;
3000
3001 MRI.setRegBank(InsReg, *InsSrcBank);
3002 MRI.setRegBank(CastSrc.getReg(0), *SrcBank);
3003 MRI.setRegBank(InsLo.getReg(0), *DstBank);
3004 MRI.setRegBank(InsHi.getReg(0), *DstBank);
3005 MRI.setRegBank(One.getReg(0), AMDGPU::SGPRRegBank);
3006 MRI.setRegBank(IdxLo.getReg(0), AMDGPU::SGPRRegBank);
3007 MRI.setRegBank(IdxHi.getReg(0), AMDGPU::SGPRRegBank);
3008
3009
3010 SmallSet<Register, 4> OpsToWaterfall;
3011 if (!collectWaterfallOperands(OpsToWaterfall, MI, MRI, { 3 })) {
3012 B.setInsertPt(B.getMBB(), MI);
3013 B.buildBitcast(DstReg, InsHi);
3014 MI.eraseFromParent();
3015 return;
3016 }
3017
3018 B.setInstr(*Span.begin());
3019 MI.eraseFromParent();
3020
3021 // Figure out the point after the waterfall loop before mangling the control
3022 // flow.
3023 executeInWaterfallLoop(B, make_range(Span.begin(), Span.end()),
3024 OpsToWaterfall);
3025
3026 // The insertion point is now right after the original instruction.
3027 //
3028 // Keep the bitcast to the original vector type out of the loop. Doing this
3029 // saved an extra phi we don't need inside the loop.
3030 B.buildBitcast(DstReg, InsHi);
3031
3032 // Re-insert the constant offset add inside the waterfall loop.
3033 if (ShouldMoveIndexIntoLoop)
3034 reinsertVectorIndexAdd(B, *IdxLo, 1, ConstOffset);
3035
3036 return;
3037 }
3038 case AMDGPU::G_AMDGPU_BUFFER_LOAD:
3039 case AMDGPU::G_AMDGPU_BUFFER_LOAD_USHORT:
3040 case AMDGPU::G_AMDGPU_BUFFER_LOAD_SSHORT:
3041 case AMDGPU::G_AMDGPU_BUFFER_LOAD_UBYTE:
3042 case AMDGPU::G_AMDGPU_BUFFER_LOAD_SBYTE:
3043 case AMDGPU::G_AMDGPU_BUFFER_LOAD_TFE:
3044 case AMDGPU::G_AMDGPU_BUFFER_LOAD_USHORT_TFE:
3045 case AMDGPU::G_AMDGPU_BUFFER_LOAD_SSHORT_TFE:
3046 case AMDGPU::G_AMDGPU_BUFFER_LOAD_UBYTE_TFE:
3047 case AMDGPU::G_AMDGPU_BUFFER_LOAD_SBYTE_TFE:
3048 case AMDGPU::G_AMDGPU_BUFFER_LOAD_FORMAT:
3049 case AMDGPU::G_AMDGPU_BUFFER_LOAD_FORMAT_TFE:
3050 case AMDGPU::G_AMDGPU_BUFFER_LOAD_FORMAT_D16:
3051 case AMDGPU::G_AMDGPU_TBUFFER_LOAD_FORMAT:
3052 case AMDGPU::G_AMDGPU_TBUFFER_LOAD_FORMAT_D16:
3053 case AMDGPU::G_AMDGPU_BUFFER_STORE:
3054 case AMDGPU::G_AMDGPU_BUFFER_STORE_BYTE:
3055 case AMDGPU::G_AMDGPU_BUFFER_STORE_SHORT:
3056 case AMDGPU::G_AMDGPU_BUFFER_STORE_FORMAT:
3057 case AMDGPU::G_AMDGPU_BUFFER_STORE_FORMAT_D16:
3058 case AMDGPU::G_AMDGPU_TBUFFER_STORE_FORMAT:
3059 case AMDGPU::G_AMDGPU_TBUFFER_STORE_FORMAT_D16: {
3060 applyDefaultMapping(OpdMapper);
3061 executeInWaterfallLoop(B, MI, {1, 4});
3062 return;
3063 }
3064 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_SWAP:
3065 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_ADD:
3066 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_SUB:
3067 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_SMIN:
3068 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_UMIN:
3069 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_SMAX:
3070 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_UMAX:
3071 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_AND:
3072 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_OR:
3073 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_XOR:
3074 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_INC:
3075 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_DEC: {
3076 applyDefaultMapping(OpdMapper);
3077 executeInWaterfallLoop(B, MI, {2, 5});
3078 return;
3079 }
3080 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_FADD:
3081 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_FMIN:
3082 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_FMAX: {
3083 applyDefaultMapping(OpdMapper);
3084 executeInWaterfallLoop(B, MI, {2, 5});
3085 return;
3086 }
3087 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_CMPSWAP: {
3088 applyDefaultMapping(OpdMapper);
3089 executeInWaterfallLoop(B, MI, {3, 6});
3090 return;
3091 }
3092 case AMDGPU::G_AMDGPU_S_BUFFER_LOAD:
3093 case AMDGPU::G_AMDGPU_S_BUFFER_LOAD_UBYTE:
3094 case AMDGPU::G_AMDGPU_S_BUFFER_LOAD_SBYTE:
3095 case AMDGPU::G_AMDGPU_S_BUFFER_LOAD_USHORT:
3096 case AMDGPU::G_AMDGPU_S_BUFFER_LOAD_SSHORT: {
3097 applyMappingSBufferLoad(B, OpdMapper);
3098 return;
3099 }
3100 case AMDGPU::G_INTRINSIC:
3101 case AMDGPU::G_INTRINSIC_CONVERGENT: {
3102 switch (cast<GIntrinsic>(MI).getIntrinsicID()) {
3103 case Intrinsic::amdgcn_readlane: {
3104 substituteSimpleCopyRegs(OpdMapper, 2);
3105
3106 assert(OpdMapper.getVRegs(0).empty());
3107 assert(OpdMapper.getVRegs(3).empty());
3108
3109 // Make sure the index is an SGPR. It doesn't make sense to run this in a
3110 // waterfall loop, so assume it's a uniform value.
3111 constrainOpWithReadfirstlane(B, MI, 3); // Index
3112 return;
3113 }
3114 case Intrinsic::amdgcn_writelane: {
3115 assert(OpdMapper.getVRegs(0).empty());
3116 assert(OpdMapper.getVRegs(2).empty());
3117 assert(OpdMapper.getVRegs(3).empty());
3118
3119 substituteSimpleCopyRegs(OpdMapper, 4); // VGPR input val
3120 constrainOpWithReadfirstlane(B, MI, 2); // Source value
3121 constrainOpWithReadfirstlane(B, MI, 3); // Index
3122 return;
3123 }
3124 case Intrinsic::amdgcn_interp_p1:
3125 case Intrinsic::amdgcn_interp_p2:
3126 case Intrinsic::amdgcn_interp_mov:
3127 case Intrinsic::amdgcn_interp_p1_f16:
3128 case Intrinsic::amdgcn_interp_p2_f16:
3129 case Intrinsic::amdgcn_lds_param_load: {
3130 applyDefaultMapping(OpdMapper);
3131
3132 // Readlane for m0 value, which is always the last operand.
3133 // FIXME: Should this be a waterfall loop instead?
3134 constrainOpWithReadfirstlane(B, MI, MI.getNumOperands() - 1); // Index
3135 return;
3136 }
3137 case Intrinsic::amdgcn_interp_inreg_p10:
3138 case Intrinsic::amdgcn_interp_inreg_p2:
3139 case Intrinsic::amdgcn_interp_inreg_p10_f16:
3140 case Intrinsic::amdgcn_interp_inreg_p2_f16:
3141 case Intrinsic::amdgcn_interp_p10_rtz_f16:
3142 case Intrinsic::amdgcn_interp_p2_rtz_f16:
3143 applyDefaultMapping(OpdMapper);
3144 return;
3145 case Intrinsic::amdgcn_permlane16:
3146 case Intrinsic::amdgcn_permlanex16: {
3147 // Doing a waterfall loop over these wouldn't make any sense.
3148 substituteSimpleCopyRegs(OpdMapper, 2);
3149 substituteSimpleCopyRegs(OpdMapper, 3);
3152 return;
3153 }
3154 case Intrinsic::amdgcn_sbfe:
3155 applyMappingBFE(B, OpdMapper, true);
3156 return;
3157 case Intrinsic::amdgcn_ubfe:
3158 applyMappingBFE(B, OpdMapper, false);
3159 return;
3160 case Intrinsic::amdgcn_inverse_ballot:
3161 case Intrinsic::amdgcn_s_bitreplicate:
3162 case Intrinsic::amdgcn_s_quadmask:
3163 case Intrinsic::amdgcn_s_wqm:
3164 applyDefaultMapping(OpdMapper);
3165 constrainOpWithReadfirstlane(B, MI, 2); // Mask
3166 return;
3167 case Intrinsic::amdgcn_ballot:
3168 // Use default handling and insert copy to vcc source.
3169 break;
3170 }
3171 break;
3172 }
3173 case AMDGPU::G_AMDGPU_INTRIN_IMAGE_LOAD:
3174 case AMDGPU::G_AMDGPU_INTRIN_IMAGE_LOAD_D16:
3175 case AMDGPU::G_AMDGPU_INTRIN_IMAGE_STORE:
3176 case AMDGPU::G_AMDGPU_INTRIN_IMAGE_STORE_D16: {
3177 const AMDGPU::RsrcIntrinsic *RSrcIntrin =
3179 assert(RSrcIntrin && RSrcIntrin->IsImage);
3180 // Non-images can have complications from operands that allow both SGPR
3181 // and VGPR. For now it's too complicated to figure out the final opcode
3182 // to derive the register bank from the MCInstrDesc.
3183 applyMappingImage(B, MI, OpdMapper, RSrcIntrin->RsrcArg);
3184 return;
3185 }
3186 case AMDGPU::G_AMDGPU_INTRIN_BVH_INTERSECT_RAY: {
3187 unsigned N = MI.getNumExplicitOperands() - 2;
3188 applyDefaultMapping(OpdMapper);
3190 return;
3191 }
3192 case AMDGPU::G_INTRINSIC_W_SIDE_EFFECTS:
3193 case AMDGPU::G_INTRINSIC_CONVERGENT_W_SIDE_EFFECTS: {
3194 auto IntrID = cast<GIntrinsic>(MI).getIntrinsicID();
3195 switch (IntrID) {
3196 case Intrinsic::amdgcn_ds_ordered_add:
3197 case Intrinsic::amdgcn_ds_ordered_swap: {
3198 // This is only allowed to execute with 1 lane, so readfirstlane is safe.
3199 assert(OpdMapper.getVRegs(0).empty());
3200 substituteSimpleCopyRegs(OpdMapper, 3);
3202 return;
3203 }
3204 case Intrinsic::amdgcn_ds_gws_init:
3205 case Intrinsic::amdgcn_ds_gws_barrier:
3206 case Intrinsic::amdgcn_ds_gws_sema_br: {
3207 // Only the first lane is executes, so readfirstlane is safe.
3208 substituteSimpleCopyRegs(OpdMapper, 1);
3210 return;
3211 }
3212 case Intrinsic::amdgcn_ds_gws_sema_v:
3213 case Intrinsic::amdgcn_ds_gws_sema_p:
3214 case Intrinsic::amdgcn_ds_gws_sema_release_all: {
3215 // Only the first lane is executes, so readfirstlane is safe.
3217 return;
3218 }
3219 case Intrinsic::amdgcn_ds_append:
3220 case Intrinsic::amdgcn_ds_consume: {
3222 return;
3223 }
3224 case Intrinsic::amdgcn_s_sendmsg:
3225 case Intrinsic::amdgcn_s_sendmsghalt: {
3226 // FIXME: Should this use a waterfall loop?
3228 return;
3229 }
3230 case Intrinsic::amdgcn_s_setreg: {
3232 return;
3233 }
3234 case Intrinsic::amdgcn_s_ttracedata:
3236 return;
3237 case Intrinsic::amdgcn_raw_buffer_load_lds:
3238 case Intrinsic::amdgcn_raw_ptr_buffer_load_lds: {
3239 applyDefaultMapping(OpdMapper);
3240 constrainOpWithReadfirstlane(B, MI, 1); // rsrc
3242 constrainOpWithReadfirstlane(B, MI, 5); // soffset
3243 return;
3244 }
3245 case Intrinsic::amdgcn_struct_buffer_load_lds:
3246 case Intrinsic::amdgcn_struct_ptr_buffer_load_lds: {
3247 applyDefaultMapping(OpdMapper);
3248 constrainOpWithReadfirstlane(B, MI, 1); // rsrc
3250 constrainOpWithReadfirstlane(B, MI, 6); // soffset
3251 return;
3252 }
3253 case Intrinsic::amdgcn_global_load_lds: {
3254 applyDefaultMapping(OpdMapper);
3256 return;
3257 }
3258 case Intrinsic::amdgcn_lds_direct_load: {
3259 applyDefaultMapping(OpdMapper);
3260 // Readlane for m0 value, which is always the last operand.
3261 constrainOpWithReadfirstlane(B, MI, MI.getNumOperands() - 1); // Index
3262 return;
3263 }
3264 case Intrinsic::amdgcn_exp_row:
3265 applyDefaultMapping(OpdMapper);
3267 return;
3268 case Intrinsic::amdgcn_s_sleep_var:
3269 assert(OpdMapper.getVRegs(1).empty());
3271 return;
3272 case Intrinsic::amdgcn_s_barrier_signal_var:
3273 case Intrinsic::amdgcn_s_barrier_join:
3274 case Intrinsic::amdgcn_s_wakeup_barrier:
3276 return;
3277 case Intrinsic::amdgcn_s_barrier_signal_isfirst_var:
3279 return;
3280 case Intrinsic::amdgcn_s_barrier_init:
3283 return;
3284 case Intrinsic::amdgcn_s_get_barrier_state: {
3286 return;
3287 }
3288 default: {
3289 if (const AMDGPU::RsrcIntrinsic *RSrcIntrin =
3291 // Non-images can have complications from operands that allow both SGPR
3292 // and VGPR. For now it's too complicated to figure out the final opcode
3293 // to derive the register bank from the MCInstrDesc.
3294 if (RSrcIntrin->IsImage) {
3295 applyMappingImage(B, MI, OpdMapper, RSrcIntrin->RsrcArg);
3296 return;
3297 }
3298 }
3299
3300 break;
3301 }
3302 }
3303 break;
3304 }
3305 case AMDGPU::G_SI_CALL: {
3306 // Use a set to avoid extra readfirstlanes in the case where multiple
3307 // operands are the same register.
3308 SmallSet<Register, 4> SGPROperandRegs;
3309
3310 if (!collectWaterfallOperands(SGPROperandRegs, MI, MRI, {1}))
3311 break;
3312
3313 // Move all copies to physical SGPRs that are used by the call instruction
3314 // into the loop block. Start searching for these copies until the
3315 // ADJCALLSTACKUP.
3316 unsigned FrameSetupOpcode = AMDGPU::ADJCALLSTACKUP;
3317 unsigned FrameDestroyOpcode = AMDGPU::ADJCALLSTACKDOWN;
3318
3319 // Move all non-copies before the copies, so that a complete range can be
3320 // moved into the waterfall loop.
3321 SmallVector<MachineInstr *, 4> NonCopyInstrs;
3322 // Count of NonCopyInstrs found until the current LastCopy.
3323 unsigned NonCopyInstrsLen = 0;
3325 MachineBasicBlock::iterator LastCopy = Start;
3326 MachineBasicBlock *MBB = MI.getParent();
3329 while (Start->getOpcode() != FrameSetupOpcode) {
3330 --Start;
3331 bool IsCopy = false;
3332 if (Start->getOpcode() == AMDGPU::COPY) {
3333 auto &Dst = Start->getOperand(0);
3334 if (Dst.isReg()) {
3335 Register Reg = Dst.getReg();
3336 if (Reg.isPhysical() && MI.readsRegister(Reg, TRI)) {
3337 IsCopy = true;
3338 } else {
3339 // Also move the copy from the scratch rsrc descriptor into the loop
3340 // to allow it to be optimized away.
3341 auto &Src = Start->getOperand(1);
3342 if (Src.isReg()) {
3343 Reg = Src.getReg();
3344 IsCopy = Info->getScratchRSrcReg() == Reg;
3345 }
3346 }
3347 }
3348 }
3349
3350 if (IsCopy) {
3351 LastCopy = Start;
3352 NonCopyInstrsLen = NonCopyInstrs.size();
3353 } else {
3354 NonCopyInstrs.push_back(&*Start);
3355 }
3356 }
3357 NonCopyInstrs.resize(NonCopyInstrsLen);
3358
3359 for (auto *NonCopy : reverse(NonCopyInstrs)) {
3360 MBB->splice(LastCopy, MBB, NonCopy->getIterator());
3361 }
3362 Start = LastCopy;
3363
3364 // Do the same for copies after the loop
3365 NonCopyInstrs.clear();
3366 NonCopyInstrsLen = 0;
3368 LastCopy = End;
3369 while (End->getOpcode() != FrameDestroyOpcode) {
3370 ++End;
3371 bool IsCopy = false;
3372 if (End->getOpcode() == AMDGPU::COPY) {
3373 auto &Src = End->getOperand(1);
3374 if (Src.isReg()) {
3375 Register Reg = Src.getReg();
3376 IsCopy = Reg.isPhysical() && MI.modifiesRegister(Reg, TRI);
3377 }
3378 }
3379
3380 if (IsCopy) {
3381 LastCopy = End;
3382 NonCopyInstrsLen = NonCopyInstrs.size();
3383 } else {
3384 NonCopyInstrs.push_back(&*End);
3385 }
3386 }
3387 NonCopyInstrs.resize(NonCopyInstrsLen);
3388
3389 End = LastCopy;
3390 ++LastCopy;
3391 for (auto *NonCopy : reverse(NonCopyInstrs)) {
3392 MBB->splice(LastCopy, MBB, NonCopy->getIterator());
3393 }
3394
3395 ++End;
3396 B.setInsertPt(B.getMBB(), Start);
3397 executeInWaterfallLoop(B, make_range(Start, End), SGPROperandRegs);
3398 break;
3399 }
3400 case AMDGPU::G_LOAD:
3401 case AMDGPU::G_ZEXTLOAD:
3402 case AMDGPU::G_SEXTLOAD: {
3403 if (applyMappingLoad(B, OpdMapper, MI))
3404 return;
3405 break;
3406 }
3407 case AMDGPU::G_DYN_STACKALLOC:
3408 applyMappingDynStackAlloc(B, OpdMapper, MI);
3409 return;
3410 case AMDGPU::G_STACKRESTORE: {
3411 applyDefaultMapping(OpdMapper);
3413 return;
3414 }
3415 case AMDGPU::G_SBFX:
3416 applyMappingBFE(B, OpdMapper, /*Signed*/ true);
3417 return;
3418 case AMDGPU::G_UBFX:
3419 applyMappingBFE(B, OpdMapper, /*Signed*/ false);
3420 return;
3421 case AMDGPU::G_AMDGPU_MAD_U64_U32:
3422 case AMDGPU::G_AMDGPU_MAD_I64_I32:
3423 applyMappingMAD_64_32(B, OpdMapper);
3424 return;
3425 case AMDGPU::G_PREFETCH: {
3426 if (!Subtarget.hasPrefetch()) {
3427 MI.eraseFromParent();
3428 return;
3429 }
3430 Register PtrReg = MI.getOperand(0).getReg();
3431 unsigned PtrBank = getRegBankID(PtrReg, MRI, AMDGPU::SGPRRegBankID);
3432 if (PtrBank == AMDGPU::VGPRRegBankID) {
3433 MI.eraseFromParent();
3434 return;
3435 }
3436 unsigned AS = MRI.getType(PtrReg).getAddressSpace();
3439 MI.eraseFromParent();
3440 return;
3441 }
3442 applyDefaultMapping(OpdMapper);
3443 return;
3444 }
3445 default:
3446 break;
3447 }
3448
3449 return applyDefaultMapping(OpdMapper);
3450}
3451
3452// vgpr, sgpr -> vgpr
3453// vgpr, agpr -> vgpr
3454// agpr, agpr -> agpr
3455// agpr, sgpr -> vgpr
3456static unsigned regBankUnion(unsigned RB0, unsigned RB1) {
3457 if (RB0 == AMDGPU::InvalidRegBankID)
3458 return RB1;
3459 if (RB1 == AMDGPU::InvalidRegBankID)
3460 return RB0;
3461
3462 if (RB0 == AMDGPU::SGPRRegBankID && RB1 == AMDGPU::SGPRRegBankID)
3463 return AMDGPU::SGPRRegBankID;
3464
3465 if (RB0 == AMDGPU::AGPRRegBankID && RB1 == AMDGPU::AGPRRegBankID)
3466 return AMDGPU::AGPRRegBankID;
3467
3468 return AMDGPU::VGPRRegBankID;
3469}
3470
3471static unsigned regBankBoolUnion(unsigned RB0, unsigned RB1) {
3472 if (RB0 == AMDGPU::InvalidRegBankID)
3473 return RB1;
3474 if (RB1 == AMDGPU::InvalidRegBankID)
3475 return RB0;
3476
3477 // vcc, vcc -> vcc
3478 // vcc, sgpr -> vcc
3479 // vcc, vgpr -> vcc
3480 if (RB0 == AMDGPU::VCCRegBankID || RB1 == AMDGPU::VCCRegBankID)
3481 return AMDGPU::VCCRegBankID;
3482
3483 // vcc, vgpr -> vgpr
3484 return regBankUnion(RB0, RB1);
3485}
3486
3488 const MachineInstr &MI) const {
3489 unsigned RegBank = AMDGPU::InvalidRegBankID;
3490
3491 for (const MachineOperand &MO : MI.operands()) {
3492 if (!MO.isReg())
3493 continue;
3494 Register Reg = MO.getReg();
3495 if (const RegisterBank *Bank = getRegBank(Reg, MRI, *TRI)) {
3496 RegBank = regBankUnion(RegBank, Bank->getID());
3497 if (RegBank == AMDGPU::VGPRRegBankID)
3498 break;
3499 }
3500 }
3501
3502 return RegBank;
3503}
3504
3506 const MachineFunction &MF = *MI.getParent()->getParent();
3507 const MachineRegisterInfo &MRI = MF.getRegInfo();
3508 for (const MachineOperand &MO : MI.operands()) {
3509 if (!MO.isReg())
3510 continue;
3511 Register Reg = MO.getReg();
3512 if (const RegisterBank *Bank = getRegBank(Reg, MRI, *TRI)) {
3513 if (Bank->getID() != AMDGPU::SGPRRegBankID)
3514 return false;
3515 }
3516 }
3517 return true;
3518}
3519
3522 const MachineFunction &MF = *MI.getParent()->getParent();
3523 const MachineRegisterInfo &MRI = MF.getRegInfo();
3524 SmallVector<const ValueMapping*, 8> OpdsMapping(MI.getNumOperands());
3525
3526 for (unsigned i = 0, e = MI.getNumOperands(); i != e; ++i) {
3527 const MachineOperand &SrcOp = MI.getOperand(i);
3528 if (!SrcOp.isReg())
3529 continue;
3530
3531 unsigned Size = getSizeInBits(SrcOp.getReg(), MRI, *TRI);
3532 OpdsMapping[i] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
3533 }
3534 return getInstructionMapping(1, 1, getOperandsMapping(OpdsMapping),
3535 MI.getNumOperands());
3536}
3537
3540 const MachineFunction &MF = *MI.getParent()->getParent();
3541 const MachineRegisterInfo &MRI = MF.getRegInfo();
3542 SmallVector<const ValueMapping*, 8> OpdsMapping(MI.getNumOperands());
3543
3544 // Even though we technically could use SGPRs, this would require knowledge of
3545 // the constant bus restriction. Force all sources to VGPR (except for VCC).
3546 //
3547 // TODO: Unary ops are trivially OK, so accept SGPRs?
3548 for (unsigned i = 0, e = MI.getNumOperands(); i != e; ++i) {
3549 const MachineOperand &Src = MI.getOperand(i);
3550 if (!Src.isReg())
3551 continue;
3552
3553 unsigned Size = getSizeInBits(Src.getReg(), MRI, *TRI);
3554 unsigned BankID = Size == 1 ? AMDGPU::VCCRegBankID : AMDGPU::VGPRRegBankID;
3555 OpdsMapping[i] = AMDGPU::getValueMapping(BankID, Size);
3556 }
3557
3558 return getInstructionMapping(1, 1, getOperandsMapping(OpdsMapping),
3559 MI.getNumOperands());
3560}
3561
3564 const MachineFunction &MF = *MI.getParent()->getParent();
3565 const MachineRegisterInfo &MRI = MF.getRegInfo();
3566 SmallVector<const ValueMapping*, 8> OpdsMapping(MI.getNumOperands());
3567
3568 for (unsigned I = 0, E = MI.getNumOperands(); I != E; ++I) {
3569 const MachineOperand &Op = MI.getOperand(I);
3570 if (!Op.isReg())
3571 continue;
3572
3573 unsigned Size = getSizeInBits(Op.getReg(), MRI, *TRI);
3574 OpdsMapping[I] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
3575 }
3576
3577 return getInstructionMapping(1, 1, getOperandsMapping(OpdsMapping),
3578 MI.getNumOperands());
3579}
3580
3583 const MachineInstr &MI,
3584 int RsrcIdx) const {
3585 // The reported argument index is relative to the IR intrinsic call arguments,
3586 // so we need to shift by the number of defs and the intrinsic ID.
3587 RsrcIdx += MI.getNumExplicitDefs() + 1;
3588
3589 const int NumOps = MI.getNumOperands();
3590 SmallVector<const ValueMapping *, 8> OpdsMapping(NumOps);
3591
3592 // TODO: Should packed/unpacked D16 difference be reported here as part of
3593 // the value mapping?
3594 for (int I = 0; I != NumOps; ++I) {
3595 if (!MI.getOperand(I).isReg())
3596 continue;
3597
3598 Register OpReg = MI.getOperand(I).getReg();
3599 // We replace some dead address operands with $noreg
3600 if (!OpReg)
3601 continue;
3602
3603 unsigned Size = getSizeInBits(OpReg, MRI, *TRI);
3604
3605 // FIXME: Probably need a new intrinsic register bank searchable table to
3606 // handle arbitrary intrinsics easily.
3607 //
3608 // If this has a sampler, it immediately follows rsrc.
3609 const bool MustBeSGPR = I == RsrcIdx || I == RsrcIdx + 1;
3610
3611 if (MustBeSGPR) {
3612 // If this must be an SGPR, so we must report whatever it is as legal.
3613 unsigned NewBank = getRegBankID(OpReg, MRI, AMDGPU::SGPRRegBankID);
3614 OpdsMapping[I] = AMDGPU::getValueMapping(NewBank, Size);
3615 } else {
3616 // Some operands must be VGPR, and these are easy to copy to.
3617 OpdsMapping[I] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
3618 }
3619 }
3620
3621 return getInstructionMapping(1, 1, getOperandsMapping(OpdsMapping), NumOps);
3622}
3623
3624/// Return the mapping for a pointer argument.
3627 Register PtrReg) const {
3628 LLT PtrTy = MRI.getType(PtrReg);
3629 unsigned Size = PtrTy.getSizeInBits();
3632 return AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
3633
3634 // If we're using MUBUF instructions for global memory, an SGPR base register
3635 // is possible. Otherwise this needs to be a VGPR.
3636 const RegisterBank *PtrBank = getRegBank(PtrReg, MRI, *TRI);
3637 return AMDGPU::getValueMapping(PtrBank->getID(), Size);
3638}
3639
3642
3643 const MachineFunction &MF = *MI.getParent()->getParent();
3644 const MachineRegisterInfo &MRI = MF.getRegInfo();
3646 unsigned Size = getSizeInBits(MI.getOperand(0).getReg(), MRI, *TRI);
3647 Register PtrReg = MI.getOperand(1).getReg();
3648 LLT PtrTy = MRI.getType(PtrReg);
3649 unsigned AS = PtrTy.getAddressSpace();
3650 unsigned PtrSize = PtrTy.getSizeInBits();
3651
3652 const ValueMapping *ValMapping;
3653 const ValueMapping *PtrMapping;
3654
3655 const RegisterBank *PtrBank = getRegBank(PtrReg, MRI, *TRI);
3656
3657 if (PtrBank == &AMDGPU::SGPRRegBank && AMDGPU::isFlatGlobalAddrSpace(AS)) {
3658 if (isScalarLoadLegal(MI)) {
3659 // We have a uniform instruction so we want to use an SMRD load
3660 ValMapping = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
3661 PtrMapping = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, PtrSize);
3662 } else {
3663 ValMapping = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
3664
3665 // If we're using MUBUF instructions for global memory, an SGPR base
3666 // register is possible. Otherwise this needs to be a VGPR.
3667 unsigned PtrBankID = Subtarget.useFlatForGlobal() ?
3668 AMDGPU::VGPRRegBankID : AMDGPU::SGPRRegBankID;
3669
3670 PtrMapping = AMDGPU::getValueMapping(PtrBankID, PtrSize);
3671 }
3672 } else {
3673 ValMapping = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
3674 PtrMapping = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, PtrSize);
3675 }
3676
3677 OpdsMapping[0] = ValMapping;
3678 OpdsMapping[1] = PtrMapping;
3680 1, 1, getOperandsMapping(OpdsMapping), MI.getNumOperands());
3681 return Mapping;
3682
3683 // FIXME: Do we want to add a mapping for FLAT load, or should we just
3684 // handle that during instruction selection?
3685}
3686
3687unsigned
3689 const MachineRegisterInfo &MRI,
3690 unsigned Default) const {
3691 const RegisterBank *Bank = getRegBank(Reg, MRI, *TRI);
3692 return Bank ? Bank->getID() : Default;
3693}
3694
3697 const MachineRegisterInfo &MRI,
3698 const TargetRegisterInfo &TRI) const {
3699 // Lie and claim anything is legal, even though this needs to be an SGPR
3700 // applyMapping will have to deal with it as a waterfall loop.
3701 unsigned Bank = getRegBankID(Reg, MRI, AMDGPU::SGPRRegBankID);
3702 unsigned Size = getSizeInBits(Reg, MRI, TRI);
3703 return AMDGPU::getValueMapping(Bank, Size);
3704}
3705
3708 const MachineRegisterInfo &MRI,
3709 const TargetRegisterInfo &TRI) const {
3710 unsigned Size = getSizeInBits(Reg, MRI, TRI);
3711 return AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
3712}
3713
3716 const MachineRegisterInfo &MRI,
3717 const TargetRegisterInfo &TRI) const {
3718 unsigned Size = getSizeInBits(Reg, MRI, TRI);
3719 return AMDGPU::getValueMapping(AMDGPU::AGPRRegBankID, Size);
3720}
3721
3722///
3723/// This function must return a legal mapping, because
3724/// AMDGPURegisterBankInfo::getInstrAlternativeMappings() is not called
3725/// in RegBankSelect::Mode::Fast. Any mapping that would cause a
3726/// VGPR to SGPR generated is illegal.
3727///
3728// Operands that must be SGPRs must accept potentially divergent VGPRs as
3729// legal. These will be dealt with in applyMappingImpl.
3730//
3733 const MachineFunction &MF = *MI.getParent()->getParent();
3734 const MachineRegisterInfo &MRI = MF.getRegInfo();
3735
3736 if (MI.isCopy() || MI.getOpcode() == AMDGPU::G_FREEZE) {
3737 // The default logic bothers to analyze impossible alternative mappings. We
3738 // want the most straightforward mapping, so just directly handle this.
3739 const RegisterBank *DstBank = getRegBank(MI.getOperand(0).getReg(), MRI,
3740 *TRI);
3741 const RegisterBank *SrcBank = getRegBank(MI.getOperand(1).getReg(), MRI,
3742 *TRI);
3743 assert(SrcBank && "src bank should have been assigned already");
3744 if (!DstBank)
3745 DstBank = SrcBank;
3746
3747 unsigned Size = getSizeInBits(MI.getOperand(0).getReg(), MRI, *TRI);
3748 if (MI.getOpcode() != AMDGPU::G_FREEZE &&
3749 cannotCopy(*DstBank, *SrcBank, TypeSize::getFixed(Size)))
3751
3752 const ValueMapping &ValMap = getValueMapping(0, Size, *DstBank);
3753 unsigned OpdsMappingSize = MI.isCopy() ? 1 : 2;
3754 SmallVector<const ValueMapping *, 1> OpdsMapping(OpdsMappingSize);
3755 OpdsMapping[0] = &ValMap;
3756 if (MI.getOpcode() == AMDGPU::G_FREEZE)
3757 OpdsMapping[1] = &ValMap;
3758
3759 return getInstructionMapping(
3760 1, /*Cost*/ 1,
3761 /*OperandsMapping*/ getOperandsMapping(OpdsMapping), OpdsMappingSize);
3762 }
3763
3764 if (MI.isRegSequence()) {
3765 // If any input is a VGPR, the result must be a VGPR. The default handling
3766 // assumes any copy between banks is legal.
3767 unsigned BankID = AMDGPU::SGPRRegBankID;
3768
3769 for (unsigned I = 1, E = MI.getNumOperands(); I != E; I += 2) {
3770 auto OpBank = getRegBankID(MI.getOperand(I).getReg(), MRI);
3771 // It doesn't make sense to use vcc or scc banks here, so just ignore
3772 // them.
3773 if (OpBank != AMDGPU::SGPRRegBankID) {
3774 BankID = AMDGPU::VGPRRegBankID;
3775 break;
3776 }
3777 }
3778 unsigned Size = getSizeInBits(MI.getOperand(0).getReg(), MRI, *TRI);
3779
3780 const ValueMapping &ValMap = getValueMapping(0, Size, getRegBank(BankID));
3781 return getInstructionMapping(
3782 1, /*Cost*/ 1,
3783 /*OperandsMapping*/ getOperandsMapping({&ValMap}), 1);
3784 }
3785
3786 // The default handling is broken and doesn't handle illegal SGPR->VGPR copies
3787 // properly.
3788 //
3789 // TODO: There are additional exec masking dependencies to analyze.
3790 if (auto *PHI = dyn_cast<GPhi>(&MI)) {
3791 unsigned ResultBank = AMDGPU::InvalidRegBankID;
3792 Register DstReg = PHI->getReg(0);
3793
3794 // Sometimes the result may have already been assigned a bank.
3795 if (const RegisterBank *DstBank = getRegBank(DstReg, MRI, *TRI))
3796 ResultBank = DstBank->getID();
3797
3798 for (unsigned I = 0; I < PHI->getNumIncomingValues(); ++I) {
3799 Register Reg = PHI->getIncomingValue(I);
3800 const RegisterBank *Bank = getRegBank(Reg, MRI, *TRI);
3801
3802 // FIXME: Assuming VGPR for any undetermined inputs.
3803 if (!Bank || Bank->getID() == AMDGPU::VGPRRegBankID) {
3804 ResultBank = AMDGPU::VGPRRegBankID;
3805 break;
3806 }
3807
3808 // FIXME: Need to promote SGPR case to s32
3809 unsigned OpBank = Bank->getID();
3810 ResultBank = regBankBoolUnion(ResultBank, OpBank);
3811 }
3812
3813 assert(ResultBank != AMDGPU::InvalidRegBankID);
3814
3815 unsigned Size = MRI.getType(DstReg).getSizeInBits();
3816
3817 const ValueMapping &ValMap =
3818 getValueMapping(0, Size, getRegBank(ResultBank));
3819 return getInstructionMapping(
3820 1, /*Cost*/ 1,
3821 /*OperandsMapping*/ getOperandsMapping({&ValMap}), 1);
3822 }
3823
3825 if (Mapping.isValid())
3826 return Mapping;
3827
3828 SmallVector<const ValueMapping*, 8> OpdsMapping(MI.getNumOperands());
3829
3830 switch (MI.getOpcode()) {
3831 default:
3833
3834 case AMDGPU::G_AND:
3835 case AMDGPU::G_OR:
3836 case AMDGPU::G_XOR:
3837 case AMDGPU::G_MUL: {
3838 unsigned Size = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
3839 if (Size == 1) {
3840 const RegisterBank *DstBank
3841 = getRegBank(MI.getOperand(0).getReg(), MRI, *TRI);
3842
3843 unsigned TargetBankID = AMDGPU::InvalidRegBankID;
3844 unsigned BankLHS = AMDGPU::InvalidRegBankID;
3845 unsigned BankRHS = AMDGPU::InvalidRegBankID;
3846 if (DstBank) {
3847 TargetBankID = DstBank->getID();
3848 if (DstBank == &AMDGPU::VCCRegBank) {
3849 TargetBankID = AMDGPU::VCCRegBankID;
3850 BankLHS = AMDGPU::VCCRegBankID;
3851 BankRHS = AMDGPU::VCCRegBankID;
3852 } else {
3853 BankLHS = getRegBankID(MI.getOperand(1).getReg(), MRI,
3854 AMDGPU::SGPRRegBankID);
3855 BankRHS = getRegBankID(MI.getOperand(2).getReg(), MRI,
3856 AMDGPU::SGPRRegBankID);
3857 }
3858 } else {
3859 BankLHS = getRegBankID(MI.getOperand(1).getReg(), MRI,
3860 AMDGPU::VCCRegBankID);
3861 BankRHS = getRegBankID(MI.getOperand(2).getReg(), MRI,
3862 AMDGPU::VCCRegBankID);
3863
3864 // Both inputs should be true booleans to produce a boolean result.
3865 if (BankLHS == AMDGPU::VGPRRegBankID || BankRHS == AMDGPU::VGPRRegBankID) {
3866 TargetBankID = AMDGPU::VGPRRegBankID;
3867 } else if (BankLHS == AMDGPU::VCCRegBankID || BankRHS == AMDGPU::VCCRegBankID) {
3868 TargetBankID = AMDGPU::VCCRegBankID;
3869 BankLHS = AMDGPU::VCCRegBankID;
3870 BankRHS = AMDGPU::VCCRegBankID;
3871 } else if (BankLHS == AMDGPU::SGPRRegBankID && BankRHS == AMDGPU::SGPRRegBankID) {
3872 TargetBankID = AMDGPU::SGPRRegBankID;
3873 }
3874 }
3875
3876 OpdsMapping[0] = AMDGPU::getValueMapping(TargetBankID, Size);
3877 OpdsMapping[1] = AMDGPU::getValueMapping(BankLHS, Size);
3878 OpdsMapping[2] = AMDGPU::getValueMapping(BankRHS, Size);
3879 break;
3880 }
3881
3882 if (Size == 64) {
3883
3884 if (isSALUMapping(MI)) {
3885 OpdsMapping[0] = getValueMappingSGPR64Only(AMDGPU::SGPRRegBankID, Size);
3886 OpdsMapping[1] = OpdsMapping[2] = OpdsMapping[0];
3887 } else {
3888 OpdsMapping[0] = getValueMappingSGPR64Only(AMDGPU::VGPRRegBankID, Size);
3889 unsigned Bank1 = getRegBankID(MI.getOperand(1).getReg(), MRI /*, DefaultBankID*/);
3890 OpdsMapping[1] = AMDGPU::getValueMapping(Bank1, Size);
3891
3892 unsigned Bank2 = getRegBankID(MI.getOperand(2).getReg(), MRI /*, DefaultBankID*/);
3893 OpdsMapping[2] = AMDGPU::getValueMapping(Bank2, Size);
3894 }
3895
3896 break;
3897 }
3898
3899 [[fallthrough]];
3900 }
3901 case AMDGPU::G_PTR_ADD:
3902 case AMDGPU::G_PTRMASK:
3903 case AMDGPU::G_ADD:
3904 case AMDGPU::G_SUB:
3905 case AMDGPU::G_SHL:
3906 case AMDGPU::G_LSHR:
3907 case AMDGPU::G_ASHR:
3908 case AMDGPU::G_UADDO:
3909 case AMDGPU::G_USUBO:
3910 case AMDGPU::G_UADDE:
3911 case AMDGPU::G_SADDE:
3912 case AMDGPU::G_USUBE:
3913 case AMDGPU::G_SSUBE:
3914 case AMDGPU::G_SMIN:
3915 case AMDGPU::G_SMAX:
3916 case AMDGPU::G_UMIN:
3917 case AMDGPU::G_UMAX:
3918 case AMDGPU::G_ABS:
3919 case AMDGPU::G_SHUFFLE_VECTOR:
3920 case AMDGPU::G_SBFX:
3921 case AMDGPU::G_UBFX:
3922 case AMDGPU::G_AMDGPU_S_MUL_I64_I32:
3923 case AMDGPU::G_AMDGPU_S_MUL_U64_U32:
3924 if (isSALUMapping(MI))
3925 return getDefaultMappingSOP(MI);
3926 return getDefaultMappingVOP(MI);
3927 case AMDGPU::G_FADD:
3928 case AMDGPU::G_FSUB:
3929 case AMDGPU::G_FMUL:
3930 case AMDGPU::G_FMA:
3931 case AMDGPU::G_FFLOOR:
3932 case AMDGPU::G_FCEIL:
3933 case AMDGPU::G_INTRINSIC_ROUNDEVEN:
3934 case AMDGPU::G_FMINNUM:
3935 case AMDGPU::G_FMAXNUM:
3936 case AMDGPU::G_FMINIMUM:
3937 case AMDGPU::G_FMAXIMUM:
3938 case AMDGPU::G_INTRINSIC_TRUNC:
3939 case AMDGPU::G_STRICT_FADD:
3940 case AMDGPU::G_STRICT_FSUB:
3941 case AMDGPU::G_STRICT_FMUL:
3942 case AMDGPU::G_STRICT_FMA: {
3943 LLT Ty = MRI.getType(MI.getOperand(0).getReg());
3944 unsigned Size = Ty.getSizeInBits();
3945 if (Subtarget.hasSALUFloatInsts() && Ty.isScalar() &&
3946 (Size == 32 || Size == 16) && isSALUMapping(MI))
3947 return getDefaultMappingSOP(MI);
3948 return getDefaultMappingVOP(MI);
3949 }
3950 case AMDGPU::G_FPTOSI:
3951 case AMDGPU::G_FPTOUI:
3952 case AMDGPU::G_SITOFP:
3953 case AMDGPU::G_UITOFP: {
3954 unsigned SizeDst = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
3955 unsigned SizeSrc = MRI.getType(MI.getOperand(1).getReg()).getSizeInBits();
3956 if (Subtarget.hasSALUFloatInsts() && SizeDst == 32 && SizeSrc == 32 &&
3958 return getDefaultMappingSOP(MI);
3959 return getDefaultMappingVOP(MI);
3960 }
3961 case AMDGPU::G_FPTRUNC:
3962 case AMDGPU::G_FPEXT: {
3963 unsigned SizeDst = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
3964 unsigned SizeSrc = MRI.getType(MI.getOperand(1).getReg()).getSizeInBits();
3965 if (Subtarget.hasSALUFloatInsts() && SizeDst != 64 && SizeSrc != 64 &&
3967 return getDefaultMappingSOP(MI);
3968 return getDefaultMappingVOP(MI);
3969 }
3970 case AMDGPU::G_FSQRT:
3971 case AMDGPU::G_FEXP2:
3972 case AMDGPU::G_FLOG2: {
3973 unsigned Size = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
3974 if (Subtarget.hasPseudoScalarTrans() && (Size == 16 || Size == 32) &&
3976 return getDefaultMappingSOP(MI);
3977 return getDefaultMappingVOP(MI);
3978 }
3979 case AMDGPU::G_SADDSAT: // FIXME: Could lower sat ops for SALU
3980 case AMDGPU::G_SSUBSAT:
3981 case AMDGPU::G_UADDSAT:
3982 case AMDGPU::G_USUBSAT:
3983 case AMDGPU::G_FMAD:
3984 case AMDGPU::G_FLDEXP:
3985 case AMDGPU::G_FMINNUM_IEEE:
3986 case AMDGPU::G_FMAXNUM_IEEE:
3987 case AMDGPU::G_FCANONICALIZE:
3988 case AMDGPU::G_STRICT_FLDEXP:
3989 case AMDGPU::G_BSWAP: // TODO: Somehow expand for scalar?
3990 case AMDGPU::G_FSHR: // TODO: Expand for scalar
3991 case AMDGPU::G_AMDGPU_FMIN_LEGACY:
3992 case AMDGPU::G_AMDGPU_FMAX_LEGACY:
3993 case AMDGPU::G_AMDGPU_RCP_IFLAG:
3994 case AMDGPU::G_AMDGPU_CVT_F32_UBYTE0:
3995 case AMDGPU::G_AMDGPU_CVT_F32_UBYTE1:
3996 case AMDGPU::G_AMDGPU_CVT_F32_UBYTE2:
3997 case AMDGPU::G_AMDGPU_CVT_F32_UBYTE3:
3998 case AMDGPU::G_AMDGPU_CVT_PK_I16_I32:
3999 case AMDGPU::G_AMDGPU_SMED3:
4000 case AMDGPU::G_AMDGPU_FMED3:
4001 return getDefaultMappingVOP(MI);
4002 case AMDGPU::G_UMULH:
4003 case AMDGPU::G_SMULH: {
4005 return getDefaultMappingSOP(MI);
4006 return getDefaultMappingVOP(MI);
4007 }
4008 case AMDGPU::G_AMDGPU_MAD_U64_U32:
4009 case AMDGPU::G_AMDGPU_MAD_I64_I32: {
4010 // Three possible mappings:
4011 //
4012 // - Default SOP
4013 // - Default VOP
4014 // - Scalar multiply: src0 and src1 are SGPRs, the rest is VOP.
4015 //
4016 // This allows instruction selection to keep the multiplication part of the
4017 // instruction on the SALU.
4018 bool AllSalu = true;
4019 bool MulSalu = true;
4020 for (unsigned i = 0; i < 5; ++i) {
4021 Register Reg = MI.getOperand(i).getReg();
4022 if (const RegisterBank *Bank = getRegBank(Reg, MRI, *TRI)) {
4023 if (Bank->getID() != AMDGPU::SGPRRegBankID) {
4024 AllSalu = false;
4025 if (i == 2 || i == 3) {
4026 MulSalu = false;
4027 break;
4028 }
4029 }
4030 }
4031 }
4032
4033 if (AllSalu)
4034 return getDefaultMappingSOP(MI);
4035
4036 // If the multiply-add is full-rate in VALU, use that even if the
4037 // multiplication part is scalar. Accumulating separately on the VALU would
4038 // take two instructions.
4039 if (!MulSalu || Subtarget.hasFullRate64Ops())
4040 return getDefaultMappingVOP(MI);
4041
4042 // Keep the multiplication on the SALU, then accumulate on the VALU.
4043 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 64);
4044 OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, 1);
4045 OpdsMapping[2] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 32);
4046 OpdsMapping[3] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 32);
4047 OpdsMapping[4] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 64);
4048 break;
4049 }
4050 case AMDGPU::G_IMPLICIT_DEF: {
4051 unsigned Size = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4052 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
4053 break;
4054 }
4055 case AMDGPU::G_FCONSTANT:
4056 case AMDGPU::G_CONSTANT:
4057 case AMDGPU::G_GLOBAL_VALUE:
4058 case AMDGPU::G_BLOCK_ADDR:
4059 case AMDGPU::G_READSTEADYCOUNTER:
4060 case AMDGPU::G_READCYCLECOUNTER: {
4061 unsigned Size = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4062 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
4063 break;
4064 }
4065 case AMDGPU::G_FRAME_INDEX: {
4066 // TODO: This should be the same as other constants, but eliminateFrameIndex
4067 // currently assumes VALU uses.
4068 unsigned Size = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4069 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
4070 break;
4071 }
4072 case AMDGPU::G_DYN_STACKALLOC: {
4073 // Result is always uniform, and a wave reduction is needed for the source.
4074 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 32);
4075 unsigned SrcBankID = getRegBankID(MI.getOperand(1).getReg(), MRI);
4076 OpdsMapping[1] = AMDGPU::getValueMapping(SrcBankID, 32);
4077 break;
4078 }
4079 case AMDGPU::G_AMDGPU_WAVE_ADDRESS: {
4080 // This case is weird because we expect a physical register in the source,
4081 // but need to set a bank anyway.
4082 //
4083 // TODO: We could select the result to SGPR or VGPR
4084 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 32);
4085 OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 32);
4086 break;
4087 }
4088 case AMDGPU::G_INSERT: {
4089 unsigned BankID = getMappingType(MRI, MI);
4090 unsigned DstSize = getSizeInBits(MI.getOperand(0).getReg(), MRI, *TRI);
4091 unsigned SrcSize = getSizeInBits(MI.getOperand(1).getReg(), MRI, *TRI);
4092 unsigned EltSize = getSizeInBits(MI.getOperand(2).getReg(), MRI, *TRI);
4093 OpdsMapping[0] = AMDGPU::getValueMapping(BankID, DstSize);
4094 OpdsMapping[1] = AMDGPU::getValueMapping(BankID, SrcSize);
4095 OpdsMapping[2] = AMDGPU::getValueMapping(BankID, EltSize);
4096 OpdsMapping[3] = nullptr;
4097 break;
4098 }
4099 case AMDGPU::G_EXTRACT: {
4100 unsigned BankID = getRegBankID(MI.getOperand(1).getReg(), MRI);
4101 unsigned DstSize = getSizeInBits(MI.getOperand(0).getReg(), MRI, *TRI);
4102 unsigned SrcSize = getSizeInBits(MI.getOperand(1).getReg(), MRI, *TRI);
4103 OpdsMapping[0] = AMDGPU::getValueMapping(BankID, DstSize);
4104 OpdsMapping[1] = AMDGPU::getValueMapping(BankID, SrcSize);
4105 OpdsMapping[2] = nullptr;
4106 break;
4107 }
4108 case AMDGPU::G_BUILD_VECTOR:
4109 case AMDGPU::G_BUILD_VECTOR_TRUNC: {
4110 LLT DstTy = MRI.getType(MI.getOperand(0).getReg());
4111 if (DstTy == LLT::fixed_vector(2, 16)) {
4112 unsigned DstSize = DstTy.getSizeInBits();
4113 unsigned SrcSize = MRI.getType(MI.getOperand(1).getReg()).getSizeInBits();
4114 unsigned Src0BankID = getRegBankID(MI.getOperand(1).getReg(), MRI);
4115 unsigned Src1BankID = getRegBankID(MI.getOperand(2).getReg(), MRI);
4116 unsigned DstBankID = regBankUnion(Src0BankID, Src1BankID);
4117
4118 OpdsMapping[0] = AMDGPU::getValueMapping(DstBankID, DstSize);
4119 OpdsMapping[1] = AMDGPU::getValueMapping(Src0BankID, SrcSize);
4120 OpdsMapping[2] = AMDGPU::getValueMapping(Src1BankID, SrcSize);
4121 break;
4122 }
4123
4124 [[fallthrough]];
4125 }
4126 case AMDGPU::G_MERGE_VALUES:
4127 case AMDGPU::G_CONCAT_VECTORS: {
4128 unsigned Bank = getMappingType(MRI, MI);
4129 unsigned DstSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4130 unsigned SrcSize = MRI.getType(MI.getOperand(1).getReg()).getSizeInBits();
4131
4132 OpdsMapping[0] = AMDGPU::getValueMapping(Bank, DstSize);
4133 // Op1 and Dst should use the same register bank.
4134 for (unsigned i = 1, e = MI.getNumOperands(); i != e; ++i)
4135 OpdsMapping[i] = AMDGPU::getValueMapping(Bank, SrcSize);
4136 break;
4137 }
4138 case AMDGPU::G_BITREVERSE:
4139 case AMDGPU::G_BITCAST:
4140 case AMDGPU::G_INTTOPTR:
4141 case AMDGPU::G_PTRTOINT:
4142 case AMDGPU::G_FABS:
4143 case AMDGPU::G_FNEG: {
4144 unsigned Size = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4145 unsigned BankID = getRegBankID(MI.getOperand(1).getReg(), MRI);
4146 OpdsMapping[0] = OpdsMapping[1] = AMDGPU::getValueMapping(BankID, Size);
4147 break;
4148 }
4149 case AMDGPU::G_AMDGPU_FFBH_U32:
4150 case AMDGPU::G_AMDGPU_FFBL_B32:
4151 case AMDGPU::G_CTLZ_ZERO_UNDEF:
4152 case AMDGPU::G_CTTZ_ZERO_UNDEF: {
4153 unsigned Size = MRI.getType(MI.getOperand(1).getReg()).getSizeInBits();
4154 unsigned BankID = getRegBankID(MI.getOperand(1).getReg(), MRI);
4155 OpdsMapping[0] = AMDGPU::getValueMapping(BankID, 32);
4156 OpdsMapping[1] = AMDGPU::getValueMappingSGPR64Only(BankID, Size);
4157 break;
4158 }
4159 case AMDGPU::G_CTPOP: {
4160 unsigned Size = MRI.getType(MI.getOperand(1).getReg()).getSizeInBits();
4161 unsigned BankID = getRegBankID(MI.getOperand(1).getReg(), MRI);
4162 OpdsMapping[0] = AMDGPU::getValueMapping(BankID, 32);
4163
4164 // This should really be getValueMappingSGPR64Only, but allowing the generic
4165 // code to handle the register split just makes using LegalizerHelper more
4166 // difficult.
4167 OpdsMapping[1] = AMDGPU::getValueMapping(BankID, Size);
4168 break;
4169 }
4170 case AMDGPU::G_TRUNC: {
4171 Register Dst = MI.getOperand(0).getReg();
4172 Register Src = MI.getOperand(1).getReg();
4173 unsigned Bank = getRegBankID(Src, MRI);
4174 unsigned DstSize = getSizeInBits(Dst, MRI, *TRI);
4175 unsigned SrcSize = getSizeInBits(Src, MRI, *TRI);
4176 OpdsMapping[0] = AMDGPU::getValueMapping(Bank, DstSize);
4177 OpdsMapping[1] = AMDGPU::getValueMapping(Bank, SrcSize);
4178 break;
4179 }
4180 case AMDGPU::G_ZEXT:
4181 case AMDGPU::G_SEXT:
4182 case AMDGPU::G_ANYEXT:
4183 case AMDGPU::G_SEXT_INREG: {
4184 Register Dst = MI.getOperand(0).getReg();
4185 Register Src = MI.getOperand(1).getReg();
4186 unsigned DstSize = getSizeInBits(Dst, MRI, *TRI);
4187 unsigned SrcSize = getSizeInBits(Src, MRI, *TRI);
4188
4189 unsigned DstBank;
4190 const RegisterBank *SrcBank = getRegBank(Src, MRI, *TRI);
4191 assert(SrcBank);
4192 switch (SrcBank->getID()) {
4193 case AMDGPU::SGPRRegBankID:
4194 DstBank = AMDGPU::SGPRRegBankID;
4195 break;
4196 default:
4197 DstBank = AMDGPU::VGPRRegBankID;
4198 break;
4199 }
4200
4201 // Scalar extend can use 64-bit BFE, but VGPRs require extending to
4202 // 32-bits, and then to 64.
4203 OpdsMapping[0] = AMDGPU::getValueMappingSGPR64Only(DstBank, DstSize);
4204 OpdsMapping[1] = AMDGPU::getValueMappingSGPR64Only(SrcBank->getID(),
4205 SrcSize);
4206 break;
4207 }
4208 case AMDGPU::G_IS_FPCLASS: {
4209 Register SrcReg = MI.getOperand(1).getReg();
4210 unsigned SrcSize = MRI.getType(SrcReg).getSizeInBits();
4211 unsigned DstSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4212 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, DstSize);
4213 OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, SrcSize);
4214 break;
4215 }
4216 case AMDGPU::G_STORE: {
4217 assert(MI.getOperand(0).isReg());
4218 unsigned Size = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4219
4220 // FIXME: We need to specify a different reg bank once scalar stores are
4221 // supported.
4222 const ValueMapping *ValMapping =
4223 AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
4224 OpdsMapping[0] = ValMapping;
4225 OpdsMapping[1] = getValueMappingForPtr(MRI, MI.getOperand(1).getReg());
4226 break;
4227 }
4228 case AMDGPU::G_ICMP:
4229 case AMDGPU::G_FCMP: {
4230 unsigned Size = MRI.getType(MI.getOperand(2).getReg()).getSizeInBits();
4231
4232 // See if the result register has already been constrained to vcc, which may
4233 // happen due to control flow intrinsic lowering.
4234 unsigned DstBank = getRegBankID(MI.getOperand(0).getReg(), MRI,
4235 AMDGPU::SGPRRegBankID);
4236 unsigned Op2Bank = getRegBankID(MI.getOperand(2).getReg(), MRI);
4237 unsigned Op3Bank = getRegBankID(MI.getOperand(3).getReg(), MRI);
4238
4239 auto canUseSCCICMP = [&]() {
4240 auto Pred =
4241 static_cast<CmpInst::Predicate>(MI.getOperand(1).getPredicate());
4242 return Size == 32 ||
4243 (Size == 64 &&
4244 (Pred == CmpInst::ICMP_EQ || Pred == CmpInst::ICMP_NE) &&
4246 };
4247 auto canUseSCCFCMP = [&]() {
4248 return Subtarget.hasSALUFloatInsts() && (Size == 32 || Size == 16);
4249 };
4250
4251 bool isICMP = MI.getOpcode() == AMDGPU::G_ICMP;
4252 bool CanUseSCC = DstBank == AMDGPU::SGPRRegBankID &&
4253 Op2Bank == AMDGPU::SGPRRegBankID &&
4254 Op3Bank == AMDGPU::SGPRRegBankID &&
4255 (isICMP ? canUseSCCICMP() : canUseSCCFCMP());
4256
4257 DstBank = CanUseSCC ? AMDGPU::SGPRRegBankID : AMDGPU::VCCRegBankID;
4258 unsigned SrcBank = CanUseSCC ? AMDGPU::SGPRRegBankID : AMDGPU::VGPRRegBankID;
4259
4260 // TODO: Use 32-bit for scalar output size.
4261 // SCC results will need to be copied to a 32-bit SGPR virtual register.
4262 const unsigned ResultSize = 1;
4263
4264 OpdsMapping[0] = AMDGPU::getValueMapping(DstBank, ResultSize);
4265 OpdsMapping[1] = nullptr; // Predicate Operand.
4266 OpdsMapping[2] = AMDGPU::getValueMapping(SrcBank, Size);
4267 OpdsMapping[3] = AMDGPU::getValueMapping(SrcBank, Size);
4268 break;
4269 }
4270 case AMDGPU::G_EXTRACT_VECTOR_ELT: {
4271 // VGPR index can be used for waterfall when indexing a SGPR vector.
4272 unsigned SrcBankID = getRegBankID(MI.getOperand(1).getReg(), MRI);
4273 unsigned DstSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4274 unsigned SrcSize = MRI.getType(MI.getOperand(1).getReg()).getSizeInBits();
4275 unsigned IdxSize = MRI.getType(MI.getOperand(2).getReg()).getSizeInBits();
4276 unsigned IdxBank = getRegBankID(MI.getOperand(2).getReg(), MRI);
4277 unsigned OutputBankID = regBankUnion(SrcBankID, IdxBank);
4278
4279 OpdsMapping[0] = AMDGPU::getValueMappingSGPR64Only(OutputBankID, DstSize);
4280 OpdsMapping[1] = AMDGPU::getValueMapping(SrcBankID, SrcSize);
4281
4282 // The index can be either if the source vector is VGPR.
4283 OpdsMapping[2] = AMDGPU::getValueMapping(IdxBank, IdxSize);
4284 break;
4285 }
4286 case AMDGPU::G_INSERT_VECTOR_ELT: {
4287 unsigned OutputBankID = isSALUMapping(MI) ?
4288 AMDGPU::SGPRRegBankID : AMDGPU::VGPRRegBankID;
4289
4290 unsigned VecSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4291 unsigned InsertSize = MRI.getType(MI.getOperand(2).getReg()).getSizeInBits();
4292 unsigned IdxSize = MRI.getType(MI.getOperand(3).getReg()).getSizeInBits();
4293 unsigned InsertEltBankID = getRegBankID(MI.getOperand(2).getReg(), MRI);
4294 unsigned IdxBankID = getRegBankID(MI.getOperand(3).getReg(), MRI);
4295
4296 OpdsMapping[0] = AMDGPU::getValueMapping(OutputBankID, VecSize);
4297 OpdsMapping[1] = AMDGPU::getValueMapping(OutputBankID, VecSize);
4298
4299 // This is a weird case, because we need to break down the mapping based on
4300 // the register bank of a different operand.
4301 if (InsertSize == 64 && OutputBankID == AMDGPU::VGPRRegBankID) {
4302 OpdsMapping[2] = AMDGPU::getValueMappingSplit64(InsertEltBankID,
4303 InsertSize);
4304 } else {
4305 assert(InsertSize == 32 || InsertSize == 64);
4306 OpdsMapping[2] = AMDGPU::getValueMapping(InsertEltBankID, InsertSize);
4307 }
4308
4309 // The index can be either if the source vector is VGPR.
4310 OpdsMapping[3] = AMDGPU::getValueMapping(IdxBankID, IdxSize);
4311 break;
4312 }
4313 case AMDGPU::G_UNMERGE_VALUES: {
4314 unsigned Bank = getMappingType(MRI, MI);
4315
4316 // Op1 and Dst should use the same register bank.
4317 // FIXME: Shouldn't this be the default? Why do we need to handle this?
4318 for (unsigned i = 0, e = MI.getNumOperands(); i != e; ++i) {
4319 unsigned Size = getSizeInBits(MI.getOperand(i).getReg(), MRI, *TRI);
4320 OpdsMapping[i] = AMDGPU::getValueMapping(Bank, Size);
4321 }
4322 break;
4323 }
4324 case AMDGPU::G_AMDGPU_BUFFER_LOAD:
4325 case AMDGPU::G_AMDGPU_BUFFER_LOAD_UBYTE:
4326 case AMDGPU::G_AMDGPU_BUFFER_LOAD_SBYTE:
4327 case AMDGPU::G_AMDGPU_BUFFER_LOAD_USHORT:
4328 case AMDGPU::G_AMDGPU_BUFFER_LOAD_SSHORT:
4329 case AMDGPU::G_AMDGPU_BUFFER_LOAD_TFE:
4330 case AMDGPU::G_AMDGPU_BUFFER_LOAD_UBYTE_TFE:
4331 case AMDGPU::G_AMDGPU_BUFFER_LOAD_SBYTE_TFE:
4332 case AMDGPU::G_AMDGPU_BUFFER_LOAD_USHORT_TFE:
4333 case AMDGPU::G_AMDGPU_BUFFER_LOAD_SSHORT_TFE:
4334 case AMDGPU::G_AMDGPU_BUFFER_LOAD_FORMAT:
4335 case AMDGPU::G_AMDGPU_BUFFER_LOAD_FORMAT_TFE:
4336 case AMDGPU::G_AMDGPU_BUFFER_LOAD_FORMAT_D16:
4337 case AMDGPU::G_AMDGPU_TBUFFER_LOAD_FORMAT:
4338 case AMDGPU::G_AMDGPU_TBUFFER_LOAD_FORMAT_D16:
4339 case AMDGPU::G_AMDGPU_TBUFFER_STORE_FORMAT:
4340 case AMDGPU::G_AMDGPU_TBUFFER_STORE_FORMAT_D16:
4341 case AMDGPU::G_AMDGPU_BUFFER_STORE:
4342 case AMDGPU::G_AMDGPU_BUFFER_STORE_BYTE:
4343 case AMDGPU::G_AMDGPU_BUFFER_STORE_SHORT:
4344 case AMDGPU::G_AMDGPU_BUFFER_STORE_FORMAT:
4345 case AMDGPU::G_AMDGPU_BUFFER_STORE_FORMAT_D16: {
4346 OpdsMapping[0] = getVGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);
4347
4348 // rsrc
4349 OpdsMapping[1] = getSGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);
4350
4351 // vindex
4352 OpdsMapping[2] = getVGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
4353
4354 // voffset
4355 OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);
4356
4357 // soffset
4358 OpdsMapping[4] = getSGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
4359
4360 // Any remaining operands are immediates and were correctly null
4361 // initialized.
4362 break;
4363 }
4364 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_SWAP:
4365 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_ADD:
4366 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_SUB:
4367 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_SMIN:
4368 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_UMIN:
4369 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_SMAX:
4370 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_UMAX:
4371 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_AND:
4372 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_OR:
4373 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_XOR:
4374 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_INC:
4375 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_DEC:
4376 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_FADD:
4377 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_FMIN:
4378 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_FMAX: {
4379 // vdata_out
4380 OpdsMapping[0] = getVGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);
4381
4382 // vdata_in
4383 OpdsMapping[1] = getVGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);
4384
4385 // rsrc
4386 OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
4387
4388 // vindex
4389 OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);
4390
4391 // voffset
4392 OpdsMapping[4] = getVGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
4393
4394 // soffset
4395 OpdsMapping[5] = getSGPROpMapping(MI.getOperand(5).getReg(), MRI, *TRI);
4396
4397 // Any remaining operands are immediates and were correctly null
4398 // initialized.
4399 break;
4400 }
4401 case AMDGPU::G_AMDGPU_BUFFER_ATOMIC_CMPSWAP: {
4402 // vdata_out
4403 OpdsMapping[0] = getVGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);
4404
4405 // vdata_in
4406 OpdsMapping[1] = getVGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);
4407
4408 // cmp
4409 OpdsMapping[2] = getVGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
4410
4411 // rsrc
4412 OpdsMapping[3] = getSGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);
4413
4414 // vindex
4415 OpdsMapping[4] = getVGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
4416
4417 // voffset
4418 OpdsMapping[5] = getVGPROpMapping(MI.getOperand(5).getReg(), MRI, *TRI);
4419
4420 // soffset
4421 OpdsMapping[6] = getSGPROpMapping(MI.getOperand(6).getReg(), MRI, *TRI);
4422
4423 // Any remaining operands are immediates and were correctly null
4424 // initialized.
4425 break;
4426 }
4427 case AMDGPU::G_AMDGPU_S_BUFFER_LOAD:
4428 case AMDGPU::G_AMDGPU_S_BUFFER_LOAD_UBYTE:
4429 case AMDGPU::G_AMDGPU_S_BUFFER_LOAD_SBYTE:
4430 case AMDGPU::G_AMDGPU_S_BUFFER_LOAD_USHORT:
4431 case AMDGPU::G_AMDGPU_S_BUFFER_LOAD_SSHORT: {
4432 // Lie and claim everything is legal, even though some need to be
4433 // SGPRs. applyMapping will have to deal with it as a waterfall loop.
4434 OpdsMapping[1] = getSGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);
4435 OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
4436
4437 // We need to convert this to a MUBUF if either the resource of offset is
4438 // VGPR.
4439 unsigned RSrcBank = OpdsMapping[1]->BreakDown[0].RegBank->getID();
4440 unsigned OffsetBank = OpdsMapping[2]->BreakDown[0].RegBank->getID();
4441 unsigned ResultBank = regBankUnion(RSrcBank, OffsetBank);
4442
4443 unsigned Size0 = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4444 OpdsMapping[0] = AMDGPU::getValueMapping(ResultBank, Size0);
4445 break;
4446 }
4447 case AMDGPU::G_INTRINSIC:
4448 case AMDGPU::G_INTRINSIC_CONVERGENT: {
4449 switch (cast<GIntrinsic>(MI).getIntrinsicID()) {
4450 default:
4452 case Intrinsic::amdgcn_div_fmas:
4453 case Intrinsic::amdgcn_div_fixup:
4454 case Intrinsic::amdgcn_trig_preop:
4455 case Intrinsic::amdgcn_sin:
4456 case Intrinsic::amdgcn_cos:
4457 case Intrinsic::amdgcn_log_clamp:
4458 case Intrinsic::amdgcn_rcp_legacy:
4459 case Intrinsic::amdgcn_rsq_legacy:
4460 case Intrinsic::amdgcn_rsq_clamp:
4461 case Intrinsic::amdgcn_fmul_legacy:
4462 case Intrinsic::amdgcn_fma_legacy:
4463 case Intrinsic::amdgcn_frexp_mant:
4464 case Intrinsic::amdgcn_frexp_exp:
4465 case Intrinsic::amdgcn_fract:
4466 case Intrinsic::amdgcn_cvt_pknorm_i16:
4467 case Intrinsic::amdgcn_cvt_pknorm_u16:
4468 case Intrinsic::amdgcn_cvt_pk_i16:
4469 case Intrinsic::amdgcn_cvt_pk_u16:
4470 case Intrinsic::amdgcn_fmed3:
4471 case Intrinsic::amdgcn_cubeid:
4472 case Intrinsic::amdgcn_cubema:
4473 case Intrinsic::amdgcn_cubesc:
4474 case Intrinsic::amdgcn_cubetc:
4475 case Intrinsic::amdgcn_sffbh:
4476 case Intrinsic::amdgcn_fmad_ftz:
4477 case Intrinsic::amdgcn_mbcnt_lo:
4478 case Intrinsic::amdgcn_mbcnt_hi:
4479 case Intrinsic::amdgcn_mul_u24:
4480 case Intrinsic::amdgcn_mul_i24:
4481 case Intrinsic::amdgcn_mulhi_u24:
4482 case Intrinsic::amdgcn_mulhi_i24:
4483 case Intrinsic::amdgcn_lerp:
4484 case Intrinsic::amdgcn_sad_u8:
4485 case Intrinsic::amdgcn_msad_u8:
4486 case Intrinsic::amdgcn_sad_hi_u8:
4487 case Intrinsic::amdgcn_sad_u16:
4488 case Intrinsic::amdgcn_qsad_pk_u16_u8:
4489 case Intrinsic::amdgcn_mqsad_pk_u16_u8:
4490 case Intrinsic::amdgcn_mqsad_u32_u8:
4491 case Intrinsic::amdgcn_cvt_pk_u8_f32:
4492 case Intrinsic::amdgcn_alignbyte:
4493 case Intrinsic::amdgcn_perm:
4494 case Intrinsic::amdgcn_fdot2:
4495 case Intrinsic::amdgcn_sdot2:
4496 case Intrinsic::amdgcn_udot2:
4497 case Intrinsic::amdgcn_sdot4:
4498 case Intrinsic::amdgcn_udot4:
4499 case Intrinsic::amdgcn_sdot8:
4500 case Intrinsic::amdgcn_udot8:
4501 case Intrinsic::amdgcn_fdot2_bf16_bf16:
4502 case Intrinsic::amdgcn_fdot2_f16_f16:
4503 case Intrinsic::amdgcn_fdot2_f32_bf16:
4504 case Intrinsic::amdgcn_sudot4:
4505 case Intrinsic::amdgcn_sudot8:
4506 case Intrinsic::amdgcn_dot4_f32_fp8_bf8:
4507 case Intrinsic::amdgcn_dot4_f32_bf8_fp8:
4508 case Intrinsic::amdgcn_dot4_f32_fp8_fp8:
4509 case Intrinsic::amdgcn_dot4_f32_bf8_bf8:
4510 case Intrinsic::amdgcn_cvt_f32_fp8:
4511 case Intrinsic::amdgcn_cvt_f32_bf8:
4512 case Intrinsic::amdgcn_cvt_pk_f32_fp8:
4513 case Intrinsic::amdgcn_cvt_pk_f32_bf8:
4514 case Intrinsic::amdgcn_cvt_pk_fp8_f32:
4515 case Intrinsic::amdgcn_cvt_pk_bf8_f32:
4516 case Intrinsic::amdgcn_cvt_sr_fp8_f32:
4517 case Intrinsic::amdgcn_cvt_sr_bf8_f32:
4518 case Intrinsic::amdgcn_wmma_bf16_16x16x16_bf16:
4519 case Intrinsic::amdgcn_wmma_f16_16x16x16_f16:
4520 case Intrinsic::amdgcn_wmma_bf16_16x16x16_bf16_tied:
4521 case Intrinsic::amdgcn_wmma_f16_16x16x16_f16_tied:
4522 case Intrinsic::amdgcn_wmma_f32_16x16x16_bf16:
4523 case Intrinsic::amdgcn_wmma_f32_16x16x16_f16:
4524 case Intrinsic::amdgcn_wmma_i32_16x16x16_iu4:
4525 case Intrinsic::amdgcn_wmma_i32_16x16x16_iu8:
4526 case Intrinsic::amdgcn_wmma_f32_16x16x16_fp8_fp8:
4527 case Intrinsic::amdgcn_wmma_f32_16x16x16_fp8_bf8:
4528 case Intrinsic::amdgcn_wmma_f32_16x16x16_bf8_fp8:
4529 case Intrinsic::amdgcn_wmma_f32_16x16x16_bf8_bf8:
4530 case Intrinsic::amdgcn_wmma_i32_16x16x32_iu4:
4531 case Intrinsic::amdgcn_swmmac_f32_16x16x32_f16:
4532 case Intrinsic::amdgcn_swmmac_f32_16x16x32_bf16:
4533 case Intrinsic::amdgcn_swmmac_f16_16x16x32_f16:
4534 case Intrinsic::amdgcn_swmmac_bf16_16x16x32_bf16:
4535 case Intrinsic::amdgcn_swmmac_i32_16x16x32_iu8:
4536 case Intrinsic::amdgcn_swmmac_i32_16x16x32_iu4:
4537 case Intrinsic::amdgcn_swmmac_i32_16x16x64_iu4:
4538 case Intrinsic::amdgcn_swmmac_f32_16x16x32_fp8_fp8:
4539 case Intrinsic::amdgcn_swmmac_f32_16x16x32_fp8_bf8:
4540 case Intrinsic::amdgcn_swmmac_f32_16x16x32_bf8_fp8:
4541 case Intrinsic::amdgcn_swmmac_f32_16x16x32_bf8_bf8:
4542 return getDefaultMappingVOP(MI);
4543 case Intrinsic::amdgcn_log:
4544 case Intrinsic::amdgcn_exp2:
4545 case Intrinsic::amdgcn_rcp:
4546 case Intrinsic::amdgcn_rsq:
4547 case Intrinsic::amdgcn_sqrt: {
4548 unsigned Size = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4549 if (Subtarget.hasPseudoScalarTrans() && (Size == 16 || Size == 32) &&
4551 return getDefaultMappingSOP(MI);
4552 return getDefaultMappingVOP(MI);
4553 }
4554 case Intrinsic::amdgcn_sbfe:
4555 case Intrinsic::amdgcn_ubfe:
4556 if (isSALUMapping(MI))
4557 return getDefaultMappingSOP(MI);
4558 return getDefaultMappingVOP(MI);
4559 case Intrinsic::amdgcn_ds_swizzle:
4560 case Intrinsic::amdgcn_ds_permute:
4561 case Intrinsic::amdgcn_ds_bpermute:
4562 case Intrinsic::amdgcn_update_dpp:
4563 case Intrinsic::amdgcn_mov_dpp8:
4564 case Intrinsic::amdgcn_mov_dpp:
4565 case Intrinsic::amdgcn_strict_wwm:
4566 case Intrinsic::amdgcn_wwm:
4567 case Intrinsic::amdgcn_strict_wqm:
4568 case Intrinsic::amdgcn_wqm:
4569 case Intrinsic::amdgcn_softwqm:
4570 case Intrinsic::amdgcn_set_inactive:
4571 case Intrinsic::amdgcn_set_inactive_chain_arg:
4572 case Intrinsic::amdgcn_permlane64:
4574 case Intrinsic::amdgcn_cvt_pkrtz:
4576 return getDefaultMappingSOP(MI);
4577 return getDefaultMappingVOP(MI);
4578 case Intrinsic::amdgcn_kernarg_segment_ptr:
4579 case Intrinsic::amdgcn_s_getpc:
4580 case Intrinsic::amdgcn_groupstaticsize:
4581 case Intrinsic::amdgcn_reloc_constant:
4582 case Intrinsic::returnaddress: {
4583 unsigned Size = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4584 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
4585 break;
4586 }
4587 case Intrinsic::amdgcn_wqm_vote: {
4588 unsigned Size = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4589 OpdsMapping[0] = OpdsMapping[2]
4590 = AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, Size);
4591 break;
4592 }
4593 case Intrinsic::amdgcn_ps_live: {
4594 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, 1);
4595 break;
4596 }
4597 case Intrinsic::amdgcn_div_scale: {
4598 unsigned Dst0Size = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4599 unsigned Dst1Size = MRI.getType(MI.getOperand(1).getReg()).getSizeInBits();
4600 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Dst0Size);
4601 OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, Dst1Size);
4602
4603 unsigned SrcSize = MRI.getType(MI.getOperand(3).getReg()).getSizeInBits();
4604 OpdsMapping[3] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, SrcSize);
4605 OpdsMapping[4] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, SrcSize);
4606 break;
4607 }
4608 case Intrinsic::amdgcn_class: {
4609 Register Src0Reg = MI.getOperand(2).getReg();
4610 Register Src1Reg = MI.getOperand(3).getReg();
4611 unsigned Src0Size = MRI.getType(Src0Reg).getSizeInBits();
4612 unsigned Src1Size = MRI.getType(Src1Reg).getSizeInBits();
4613 unsigned DstSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4614 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, DstSize);
4615 OpdsMapping[2] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Src0Size);
4616 OpdsMapping[3] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Src1Size);
4617 break;
4618 }
4619 case Intrinsic::amdgcn_icmp:
4620 case Intrinsic::amdgcn_fcmp: {
4621 unsigned DstSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4622 // This is not VCCRegBank because this is not used in boolean contexts.
4623 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, DstSize);
4624 unsigned OpSize = MRI.getType(MI.getOperand(2).getReg()).getSizeInBits();
4625 OpdsMapping[2] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, OpSize);
4626 OpdsMapping[3] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, OpSize);
4627 break;
4628 }
4629 case Intrinsic::amdgcn_readlane: {
4630 // This must be an SGPR, but accept a VGPR.
4631 Register IdxReg = MI.getOperand(3).getReg();
4632 unsigned IdxSize = MRI.getType(IdxReg).getSizeInBits();
4633 unsigned IdxBank = getRegBankID(IdxReg, MRI, AMDGPU::SGPRRegBankID);
4634 OpdsMapping[3] = AMDGPU::getValueMapping(IdxBank, IdxSize);
4635 [[fallthrough]];
4636 }
4637 case Intrinsic::amdgcn_readfirstlane: {
4638 unsigned DstSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4639 unsigned SrcSize = MRI.getType(MI.getOperand(2).getReg()).getSizeInBits();
4640 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, DstSize);
4641 OpdsMapping[2] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, SrcSize);
4642 break;
4643 }
4644 case Intrinsic::amdgcn_writelane: {
4645 unsigned DstSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4646 Register SrcReg = MI.getOperand(2).getReg();
4647 unsigned SrcSize = MRI.getType(SrcReg).getSizeInBits();
4648 unsigned SrcBank = getRegBankID(SrcReg, MRI, AMDGPU::SGPRRegBankID);
4649 Register IdxReg = MI.getOperand(3).getReg();
4650 unsigned IdxSize = MRI.getType(IdxReg).getSizeInBits();
4651 unsigned IdxBank = getRegBankID(IdxReg, MRI, AMDGPU::SGPRRegBankID);
4652 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, DstSize);
4653
4654 // These 2 must be SGPRs, but accept VGPRs. Readfirstlane will be inserted
4655 // to legalize.
4656 OpdsMapping[2] = AMDGPU::getValueMapping(SrcBank, SrcSize);
4657 OpdsMapping[3] = AMDGPU::getValueMapping(IdxBank, IdxSize);
4658 OpdsMapping[4] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, SrcSize);
4659 break;
4660 }
4661 case Intrinsic::amdgcn_if_break: {
4662 unsigned Size = getSizeInBits(MI.getOperand(0).getReg(), MRI, *TRI);
4663 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
4664 OpdsMapping[2] = AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, 1);
4665 OpdsMapping[3] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
4666 break;
4667 }
4668 case Intrinsic::amdgcn_permlane16:
4669 case Intrinsic::amdgcn_permlanex16: {
4670 unsigned Size = getSizeInBits(MI.getOperand(0).getReg(), MRI, *TRI);
4671 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
4672 OpdsMapping[2] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
4673 OpdsMapping[3] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
4674 OpdsMapping[4] = getSGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);
4675 OpdsMapping[5] = getSGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
4676 break;
4677 }
4678 case Intrinsic::amdgcn_permlane16_var:
4679 case Intrinsic::amdgcn_permlanex16_var: {
4680 unsigned Size = getSizeInBits(MI.getOperand(0).getReg(), MRI, *TRI);
4681 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
4682 OpdsMapping[2] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
4683 OpdsMapping[3] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
4684 OpdsMapping[4] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
4685 break;
4686 }
4687 case Intrinsic::amdgcn_mfma_f32_4x4x1f32:
4688 case Intrinsic::amdgcn_mfma_f32_4x4x4f16:
4689 case Intrinsic::amdgcn_mfma_i32_4x4x4i8:
4690 case Intrinsic::amdgcn_mfma_f32_4x4x2bf16:
4691 case Intrinsic::amdgcn_mfma_f32_16x16x1f32:
4692 case Intrinsic::amdgcn_mfma_f32_16x16x4f32:
4693 case Intrinsic::amdgcn_mfma_f32_16x16x4f16:
4694 case Intrinsic::amdgcn_mfma_f32_16x16x16f16:
4695 case Intrinsic::amdgcn_mfma_i32_16x16x4i8:
4696 case Intrinsic::amdgcn_mfma_i32_16x16x16i8:
4697 case Intrinsic::amdgcn_mfma_f32_16x16x2bf16:
4698 case Intrinsic::amdgcn_mfma_f32_16x16x8bf16:
4699 case Intrinsic::amdgcn_mfma_f32_32x32x1f32:
4700 case Intrinsic::amdgcn_mfma_f32_32x32x2f32:
4701 case Intrinsic::amdgcn_mfma_f32_32x32x4f16:
4702 case Intrinsic::amdgcn_mfma_f32_32x32x8f16:
4703 case Intrinsic::amdgcn_mfma_i32_32x32x4i8:
4704 case Intrinsic::amdgcn_mfma_i32_32x32x8i8:
4705 case Intrinsic::amdgcn_mfma_f32_32x32x2bf16:
4706 case Intrinsic::amdgcn_mfma_f32_32x32x4bf16:
4707 case Intrinsic::amdgcn_mfma_f32_32x32x4bf16_1k:
4708 case Intrinsic::amdgcn_mfma_f32_16x16x4bf16_1k:
4709 case Intrinsic::amdgcn_mfma_f32_4x4x4bf16_1k:
4710 case Intrinsic::amdgcn_mfma_f32_32x32x8bf16_1k:
4711 case Intrinsic::amdgcn_mfma_f32_16x16x16bf16_1k:
4712 case Intrinsic::amdgcn_mfma_f64_16x16x4f64:
4713 case Intrinsic::amdgcn_mfma_f64_4x4x4f64:
4714 case Intrinsic::amdgcn_mfma_i32_16x16x32_i8:
4715 case Intrinsic::amdgcn_mfma_i32_32x32x16_i8:
4716 case Intrinsic::amdgcn_mfma_f32_16x16x8_xf32:
4717 case Intrinsic::amdgcn_mfma_f32_32x32x4_xf32:
4718 case Intrinsic::amdgcn_mfma_f32_16x16x32_bf8_bf8:
4719 case Intrinsic::amdgcn_mfma_f32_16x16x32_bf8_fp8:
4720 case Intrinsic::amdgcn_mfma_f32_16x16x32_fp8_bf8:
4721 case Intrinsic::amdgcn_mfma_f32_16x16x32_fp8_fp8:
4722 case Intrinsic::amdgcn_mfma_f32_32x32x16_bf8_bf8:
4723 case Intrinsic::amdgcn_mfma_f32_32x32x16_bf8_fp8:
4724 case Intrinsic::amdgcn_mfma_f32_32x32x16_fp8_bf8:
4725 case Intrinsic::amdgcn_mfma_f32_32x32x16_fp8_fp8: {
4726 // Default for MAI intrinsics.
4727 // srcC can also be an immediate which can be folded later.
4728 // FIXME: Should we eventually add an alternative mapping with AGPR src
4729 // for srcA/srcB?
4730 //
4731 // vdst, srcA, srcB, srcC
4733 OpdsMapping[0] =
4734 Info->mayNeedAGPRs()
4735 ? getAGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI)
4736 : getVGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);
4737 OpdsMapping[2] = getVGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
4738 OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);
4739 OpdsMapping[4] =
4740 Info->mayNeedAGPRs()
4741 ? getAGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI)
4742 : getVGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
4743 break;
4744 }
4745 case Intrinsic::amdgcn_smfmac_f32_16x16x32_f16:
4746 case Intrinsic::amdgcn_smfmac_f32_32x32x16_f16:
4747 case Intrinsic::amdgcn_smfmac_f32_16x16x32_bf16:
4748 case Intrinsic::amdgcn_smfmac_f32_32x32x16_bf16:
4749 case Intrinsic::amdgcn_smfmac_i32_16x16x64_i8:
4750 case Intrinsic::amdgcn_smfmac_i32_32x32x32_i8:
4751 case Intrinsic::amdgcn_smfmac_f32_16x16x64_bf8_bf8:
4752 case Intrinsic::amdgcn_smfmac_f32_16x16x64_bf8_fp8:
4753 case Intrinsic::amdgcn_smfmac_f32_16x16x64_fp8_bf8:
4754 case Intrinsic::amdgcn_smfmac_f32_16x16x64_fp8_fp8:
4755 case Intrinsic::amdgcn_smfmac_f32_32x32x32_bf8_bf8:
4756 case Intrinsic::amdgcn_smfmac_f32_32x32x32_bf8_fp8:
4757 case Intrinsic::amdgcn_smfmac_f32_32x32x32_fp8_bf8:
4758 case Intrinsic::amdgcn_smfmac_f32_32x32x32_fp8_fp8: {
4759 // vdst, srcA, srcB, srcC, idx
4760 OpdsMapping[0] = getAGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);
4761 OpdsMapping[2] = getVGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
4762 OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);
4763 OpdsMapping[4] = getAGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
4764 OpdsMapping[5] = getVGPROpMapping(MI.getOperand(5).getReg(), MRI, *TRI);
4765 break;
4766 }
4767 case Intrinsic::amdgcn_interp_p1:
4768 case Intrinsic::amdgcn_interp_p2:
4769 case Intrinsic::amdgcn_interp_mov:
4770 case Intrinsic::amdgcn_interp_p1_f16:
4771 case Intrinsic::amdgcn_interp_p2_f16:
4772 case Intrinsic::amdgcn_lds_param_load: {
4773 const int M0Idx = MI.getNumOperands() - 1;
4774 Register M0Reg = MI.getOperand(M0Idx).getReg();
4775 unsigned M0Bank = getRegBankID(M0Reg, MRI, AMDGPU::SGPRRegBankID);
4776 unsigned DstSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4777
4778 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, DstSize);
4779 for (int I = 2; I != M0Idx && MI.getOperand(I).isReg(); ++I)
4780 OpdsMapping[I] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
4781
4782 // Must be SGPR, but we must take whatever the original bank is and fix it
4783 // later.
4784 OpdsMapping[M0Idx] = AMDGPU::getValueMapping(M0Bank, 32);
4785 break;
4786 }
4787 case Intrinsic::amdgcn_interp_inreg_p10:
4788 case Intrinsic::amdgcn_interp_inreg_p2:
4789 case Intrinsic::amdgcn_interp_inreg_p10_f16:
4790 case Intrinsic::amdgcn_interp_inreg_p2_f16:
4791 case Intrinsic::amdgcn_interp_p10_rtz_f16:
4792 case Intrinsic::amdgcn_interp_p2_rtz_f16: {
4793 unsigned DstSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4794 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, DstSize);
4795 OpdsMapping[2] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
4796 OpdsMapping[3] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
4797 OpdsMapping[4] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
4798 break;
4799 }
4800 case Intrinsic::amdgcn_ballot: {
4801 unsigned DstSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4802 unsigned SrcSize = MRI.getType(MI.getOperand(2).getReg()).getSizeInBits();
4803 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, DstSize);
4804 OpdsMapping[2] = AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, SrcSize);
4805 break;
4806 }
4807 case Intrinsic::amdgcn_inverse_ballot: {
4808 // This must be an SGPR, but accept a VGPR.
4809 Register MaskReg = MI.getOperand(2).getReg();
4810 unsigned MaskSize = MRI.getType(MaskReg).getSizeInBits();
4811 unsigned MaskBank = getRegBankID(MaskReg, MRI, AMDGPU::SGPRRegBankID);
4812 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, 1);
4813 OpdsMapping[2] = AMDGPU::getValueMapping(MaskBank, MaskSize);
4814 break;
4815 }
4816 case Intrinsic::amdgcn_s_quadmask:
4817 case Intrinsic::amdgcn_s_wqm: {
4818 Register MaskReg = MI.getOperand(2).getReg();
4819 unsigned MaskSize = MRI.getType(MaskReg).getSizeInBits();
4820 unsigned MaskBank = getRegBankID(MaskReg, MRI, AMDGPU::SGPRRegBankID);
4821 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, MaskSize);
4822 OpdsMapping[2] = AMDGPU::getValueMapping(MaskBank, MaskSize);
4823 break;
4824 }
4825 case Intrinsic::amdgcn_wave_reduce_umin:
4826 case Intrinsic::amdgcn_wave_reduce_umax: {
4827 unsigned DstSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4828 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, DstSize);
4829 unsigned OpSize = MRI.getType(MI.getOperand(2).getReg()).getSizeInBits();
4830 auto regBankID =
4831 isSALUMapping(MI) ? AMDGPU::SGPRRegBankID : AMDGPU::VGPRRegBankID;
4832 OpdsMapping[2] = AMDGPU::getValueMapping(regBankID, OpSize);
4833 break;
4834 }
4835 case Intrinsic::amdgcn_s_bitreplicate:
4836 Register MaskReg = MI.getOperand(2).getReg();
4837 unsigned MaskBank = getRegBankID(MaskReg, MRI, AMDGPU::SGPRRegBankID);
4838 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 64);
4839 OpdsMapping[2] = AMDGPU::getValueMapping(MaskBank, 32);
4840 }
4841 break;
4842 }
4843 case AMDGPU::G_AMDGPU_INTRIN_IMAGE_LOAD:
4844 case AMDGPU::G_AMDGPU_INTRIN_IMAGE_LOAD_D16:
4845 case AMDGPU::G_AMDGPU_INTRIN_IMAGE_STORE:
4846 case AMDGPU::G_AMDGPU_INTRIN_IMAGE_STORE_D16: {
4847 auto IntrID = AMDGPU::getIntrinsicID(MI);
4848 const AMDGPU::RsrcIntrinsic *RSrcIntrin = AMDGPU::lookupRsrcIntrinsic(IntrID);
4849 assert(RSrcIntrin && "missing RsrcIntrinsic for image intrinsic");
4850 // Non-images can have complications from operands that allow both SGPR
4851 // and VGPR. For now it's too complicated to figure out the final opcode
4852 // to derive the register bank from the MCInstrDesc.
4853 assert(RSrcIntrin->IsImage);
4854 return getImageMapping(MRI, MI, RSrcIntrin->RsrcArg);
4855 }
4856 case AMDGPU::G_AMDGPU_INTRIN_BVH_INTERSECT_RAY: {
4857 unsigned N = MI.getNumExplicitOperands() - 2;
4858 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 128);
4859 OpdsMapping[N] = getSGPROpMapping(MI.getOperand(N).getReg(), MRI, *TRI);
4860 if (N == 3) {
4861 // Sequential form: all operands combined into VGPR256/VGPR512
4862 unsigned Size = MRI.getType(MI.getOperand(2).getReg()).getSizeInBits();
4863 if (Size > 256)
4864 Size = 512;
4865 OpdsMapping[2] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
4866 } else {
4867 // NSA form
4868 for (unsigned I = 2; I < N; ++I) {
4869 unsigned Size = MRI.getType(MI.getOperand(I).getReg()).getSizeInBits();
4870 OpdsMapping[I] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, Size);
4871 }
4872 }
4873 break;
4874 }
4875 case AMDGPU::G_INTRINSIC_W_SIDE_EFFECTS:
4876 case AMDGPU::G_INTRINSIC_CONVERGENT_W_SIDE_EFFECTS: {
4877 auto IntrID = cast<GIntrinsic>(MI).getIntrinsicID();
4878 switch (IntrID) {
4879 case Intrinsic::amdgcn_s_getreg:
4880 case Intrinsic::amdgcn_s_memtime:
4881 case Intrinsic::amdgcn_s_memrealtime:
4882 case Intrinsic::amdgcn_s_get_waveid_in_workgroup:
4883 case Intrinsic::amdgcn_s_sendmsg_rtn: {
4884 unsigned Size = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4885 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
4886 break;
4887 }
4888 case Intrinsic::amdgcn_global_atomic_fadd:
4889 case Intrinsic::amdgcn_global_atomic_csub:
4890 case Intrinsic::amdgcn_global_atomic_fmin:
4891 case Intrinsic::amdgcn_global_atomic_fmax:
4892 case Intrinsic::amdgcn_global_atomic_fmin_num:
4893 case Intrinsic::amdgcn_global_atomic_fmax_num:
4894 case Intrinsic::amdgcn_flat_atomic_fadd:
4895 case Intrinsic::amdgcn_flat_atomic_fmin:
4896 case Intrinsic::amdgcn_flat_atomic_fmax:
4897 case Intrinsic::amdgcn_flat_atomic_fmin_num:
4898 case Intrinsic::amdgcn_flat_atomic_fmax_num:
4899 case Intrinsic::amdgcn_global_atomic_fadd_v2bf16:
4900 case Intrinsic::amdgcn_flat_atomic_fadd_v2bf16:
4901 case Intrinsic::amdgcn_atomic_cond_sub_u32:
4902 case Intrinsic::amdgcn_global_atomic_ordered_add_b64:
4903 case Intrinsic::amdgcn_global_load_tr_b64:
4904 case Intrinsic::amdgcn_global_load_tr_b128:
4906 case Intrinsic::amdgcn_ds_ordered_add:
4907 case Intrinsic::amdgcn_ds_ordered_swap: {
4908 unsigned DstSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4909 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, DstSize);
4910 unsigned M0Bank = getRegBankID(MI.getOperand(2).getReg(), MRI,
4911 AMDGPU::SGPRRegBankID);
4912 OpdsMapping[2] = AMDGPU::getValueMapping(M0Bank, 32);
4913 OpdsMapping[3] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
4914 break;
4915 }
4916 case Intrinsic::amdgcn_ds_append:
4917 case Intrinsic::amdgcn_ds_consume: {
4918 unsigned DstSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
4919 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, DstSize);
4920 OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
4921 break;
4922 }
4923 case Intrinsic::amdgcn_exp_compr:
4924 OpdsMapping[3] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
4925 OpdsMapping[4] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
4926 break;
4927 case Intrinsic::amdgcn_exp:
4928 // FIXME: Could we support packed types here?
4929 OpdsMapping[3] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
4930 OpdsMapping[4] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
4931 OpdsMapping[5] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
4932 OpdsMapping[6] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
4933 break;
4934 case Intrinsic::amdgcn_exp_row:
4935 OpdsMapping[3] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
4936 OpdsMapping[4] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
4937 OpdsMapping[5] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
4938 OpdsMapping[6] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
4939 OpdsMapping[8] = getSGPROpMapping(MI.getOperand(8).getReg(), MRI, *TRI);
4940 break;
4941 case Intrinsic::amdgcn_s_sendmsg:
4942 case Intrinsic::amdgcn_s_sendmsghalt: {
4943 // This must be an SGPR, but accept a VGPR.
4944 unsigned Bank = getRegBankID(MI.getOperand(2).getReg(), MRI,
4945 AMDGPU::SGPRRegBankID);
4946 OpdsMapping[2] = AMDGPU::getValueMapping(Bank, 32);
4947 break;
4948 }
4949 case Intrinsic::amdgcn_s_setreg: {
4950 // This must be an SGPR, but accept a VGPR.
4951 unsigned Bank = getRegBankID(MI.getOperand(2).getReg(), MRI,
4952 AMDGPU::SGPRRegBankID);
4953 OpdsMapping[2] = AMDGPU::getValueMapping(Bank, 32);
4954 break;
4955 }
4956 case Intrinsic::amdgcn_s_ttracedata: {
4957 // This must be an SGPR, but accept a VGPR.
4958 unsigned Bank =
4959 getRegBankID(MI.getOperand(1).getReg(), MRI, AMDGPU::SGPRRegBankID);
4960 OpdsMapping[1] = AMDGPU::getValueMapping(Bank, 32);
4961 break;
4962 }
4963 case Intrinsic::amdgcn_end_cf: {
4964 unsigned Size = getSizeInBits(MI.getOperand(1).getReg(), MRI, *TRI);
4965 OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
4966 break;
4967 }
4968 case Intrinsic::amdgcn_else: {
4969 unsigned WaveSize = getSizeInBits(MI.getOperand(1).getReg(), MRI, *TRI);
4970 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, 1);
4971 OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, WaveSize);
4972 OpdsMapping[3] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, WaveSize);
4973 break;
4974 }
4975 case Intrinsic::amdgcn_live_mask: {
4976 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, 1);
4977 break;
4978 }
4979 case Intrinsic::amdgcn_wqm_demote:
4980 case Intrinsic::amdgcn_kill: {
4981 OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, 1);
4982 break;
4983 }
4984 case Intrinsic::amdgcn_raw_buffer_load:
4985 case Intrinsic::amdgcn_raw_ptr_buffer_load:
4986 case Intrinsic::amdgcn_raw_tbuffer_load:
4987 case Intrinsic::amdgcn_raw_ptr_tbuffer_load: {
4988 // FIXME: Should make intrinsic ID the last operand of the instruction,
4989 // then this would be the same as store
4990 OpdsMapping[0] = getVGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);
4991 OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
4992 OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);
4993 OpdsMapping[4] = getSGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
4994 break;
4995 }
4996 case Intrinsic::amdgcn_raw_buffer_load_lds:
4997 case Intrinsic::amdgcn_raw_ptr_buffer_load_lds: {
4998 OpdsMapping[1] = getSGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);
4999 OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
5000 OpdsMapping[4] = getVGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
5001 OpdsMapping[5] = getSGPROpMapping(MI.getOperand(5).getReg(), MRI, *TRI);
5002 break;
5003 }
5004 case Intrinsic::amdgcn_raw_buffer_store:
5005 case Intrinsic::amdgcn_raw_ptr_buffer_store:
5006 case Intrinsic::amdgcn_raw_buffer_store_format:
5007 case Intrinsic::amdgcn_raw_ptr_buffer_store_format:
5008 case Intrinsic::amdgcn_raw_tbuffer_store:
5009 case Intrinsic::amdgcn_raw_ptr_tbuffer_store: {
5010 OpdsMapping[1] = getVGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);
5011 OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
5012 OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);
5013 OpdsMapping[4] = getSGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
5014 break;
5015 }
5016 case Intrinsic::amdgcn_struct_buffer_load:
5017 case Intrinsic::amdgcn_struct_ptr_buffer_load:
5018 case Intrinsic::amdgcn_struct_tbuffer_load:
5019 case Intrinsic::amdgcn_struct_ptr_tbuffer_load: {
5020 OpdsMapping[0] = getVGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);
5021 OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
5022 OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);
5023 OpdsMapping[4] = getVGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
5024 OpdsMapping[5] = getSGPROpMapping(MI.getOperand(5).getReg(), MRI, *TRI);
5025 break;
5026 }
5027 case Intrinsic::amdgcn_struct_buffer_load_lds:
5028 case Intrinsic::amdgcn_struct_ptr_buffer_load_lds: {
5029 OpdsMapping[1] = getSGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);
5030 OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
5031 OpdsMapping[4] = getVGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
5032 OpdsMapping[5] = getVGPROpMapping(MI.getOperand(5).getReg(), MRI, *TRI);
5033 OpdsMapping[6] = getSGPROpMapping(MI.getOperand(6).getReg(), MRI, *TRI);
5034 break;
5035 }
5036 case Intrinsic::amdgcn_struct_buffer_store:
5037 case Intrinsic::amdgcn_struct_ptr_buffer_store:
5038 case Intrinsic::amdgcn_struct_tbuffer_store:
5039 case Intrinsic::amdgcn_struct_ptr_tbuffer_store: {
5040 OpdsMapping[1] = getVGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);
5041 OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
5042 OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);
5043 OpdsMapping[4] = getVGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
5044 OpdsMapping[5] = getSGPROpMapping(MI.getOperand(5).getReg(), MRI, *TRI);
5045 break;
5046 }
5047 case Intrinsic::amdgcn_init_exec_from_input: {
5048 unsigned Size = getSizeInBits(MI.getOperand(1).getReg(), MRI, *TRI);
5049 OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
5050 break;
5051 }
5052 case Intrinsic::amdgcn_ds_gws_init:
5053 case Intrinsic::amdgcn_ds_gws_barrier:
5054 case Intrinsic::amdgcn_ds_gws_sema_br: {
5055 OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
5056
5057 // This must be an SGPR, but accept a VGPR.
5058 unsigned Bank = getRegBankID(MI.getOperand(2).getReg(), MRI,
5059 AMDGPU::SGPRRegBankID);
5060 OpdsMapping[2] = AMDGPU::getValueMapping(Bank, 32);
5061 break;
5062 }
5063 case Intrinsic::amdgcn_ds_gws_sema_v:
5064 case Intrinsic::amdgcn_ds_gws_sema_p:
5065 case Intrinsic::amdgcn_ds_gws_sema_release_all: {
5066 // This must be an SGPR, but accept a VGPR.
5067 unsigned Bank = getRegBankID(MI.getOperand(1).getReg(), MRI,
5068 AMDGPU::SGPRRegBankID);
5069 OpdsMapping[1] = AMDGPU::getValueMapping(Bank, 32);
5070 break;
5071 }
5072 case Intrinsic::amdgcn_global_load_lds: {
5073 OpdsMapping[1] = getVGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);
5074 OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
5075 break;
5076 }
5077 case Intrinsic::amdgcn_lds_direct_load: {
5078 const int M0Idx = MI.getNumOperands() - 1;
5079 Register M0Reg = MI.getOperand(M0Idx).getReg();
5080 unsigned M0Bank = getRegBankID(M0Reg, MRI, AMDGPU::SGPRRegBankID);
5081 unsigned DstSize = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
5082
5083 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, DstSize);
5084 for (int I = 2; I != M0Idx && MI.getOperand(I).isReg(); ++I)
5085 OpdsMapping[I] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
5086
5087 // Must be SGPR, but we must take whatever the original bank is and fix it
5088 // later.
5089 OpdsMapping[M0Idx] = AMDGPU::getValueMapping(M0Bank, 32);
5090 break;
5091 }
5092 case Intrinsic::amdgcn_ds_add_gs_reg_rtn:
5093 case Intrinsic::amdgcn_ds_sub_gs_reg_rtn:
5094 OpdsMapping[0] = getVGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);
5095 OpdsMapping[2] = getVGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
5096 break;
5097 case Intrinsic::amdgcn_ds_bvh_stack_rtn: {
5098 OpdsMapping[0] =
5099 getVGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI); // %vdst
5100 OpdsMapping[1] =
5101 getVGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI); // %addr
5102 OpdsMapping[3] =
5103 getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI); // %addr
5104 OpdsMapping[4] =
5105 getVGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI); // %data0
5106 OpdsMapping[5] =
5107 getVGPROpMapping(MI.getOperand(5).getReg(), MRI, *TRI); // %data1
5108 break;
5109 }
5110 case Intrinsic::amdgcn_s_sleep_var:
5111 OpdsMapping[1] = getSGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);
5112 break;
5113 case Intrinsic::amdgcn_s_barrier_signal_var:
5114 case Intrinsic::amdgcn_s_barrier_join:
5115 case Intrinsic::amdgcn_s_wakeup_barrier:
5116 OpdsMapping[1] = getSGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);
5117 break;
5118 case Intrinsic::amdgcn_s_barrier_init:
5119 OpdsMapping[1] = getSGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);
5120 OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
5121 break;
5122 case Intrinsic::amdgcn_s_barrier_signal_isfirst_var: {
5123 const unsigned ResultSize = 1;
5124 OpdsMapping[0] =
5125 AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, ResultSize);
5126 OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
5127 break;
5128 }
5129 case Intrinsic::amdgcn_s_barrier_signal_isfirst:
5130 case Intrinsic::amdgcn_s_barrier_leave: {
5131 const unsigned ResultSize = 1;
5132 OpdsMapping[0] =
5133 AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, ResultSize);
5134 break;
5135 }
5136 case Intrinsic::amdgcn_s_get_barrier_state: {
5137 OpdsMapping[0] = getSGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);
5138 OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
5139 break;
5140 }
5141 case Intrinsic::amdgcn_pops_exiting_wave_id:
5142 return getDefaultMappingSOP(MI);
5143 default:
5145 }
5146 break;
5147 }
5148 case AMDGPU::G_SELECT: {
5149 unsigned Size = MRI.getType(MI.getOperand(0).getReg()).getSizeInBits();
5150 unsigned Op2Bank = getRegBankID(MI.getOperand(2).getReg(), MRI,
5151 AMDGPU::SGPRRegBankID);
5152 unsigned Op3Bank = getRegBankID(MI.getOperand(3).getReg(), MRI,
5153 AMDGPU::SGPRRegBankID);
5154 bool SGPRSrcs = Op2Bank == AMDGPU::SGPRRegBankID &&
5155 Op3Bank == AMDGPU::SGPRRegBankID;
5156
5157 unsigned CondBankDefault = SGPRSrcs ?
5158 AMDGPU::SGPRRegBankID : AMDGPU::VCCRegBankID;
5159 unsigned CondBank = getRegBankID(MI.getOperand(1).getReg(), MRI,
5160 CondBankDefault);
5161 if (CondBank == AMDGPU::SGPRRegBankID)
5162 CondBank = SGPRSrcs ? AMDGPU::SGPRRegBankID : AMDGPU::VCCRegBankID;
5163 else if (CondBank == AMDGPU::VGPRRegBankID)
5164 CondBank = AMDGPU::VCCRegBankID;
5165
5166 unsigned Bank = SGPRSrcs && CondBank == AMDGPU::SGPRRegBankID ?
5167 AMDGPU::SGPRRegBankID : AMDGPU::VGPRRegBankID;
5168
5169 assert(CondBank == AMDGPU::VCCRegBankID || CondBank == AMDGPU::SGPRRegBankID);
5170
5171 // TODO: Should report 32-bit for scalar condition type.
5172 if (Size == 64) {
5173 OpdsMapping[0] = AMDGPU::getValueMappingSGPR64Only(Bank, Size);
5174 OpdsMapping[1] = AMDGPU::getValueMapping(CondBank, 1);
5175 OpdsMapping[2] = AMDGPU::getValueMappingSGPR64Only(Bank, Size);
5176 OpdsMapping[3] = AMDGPU::getValueMappingSGPR64Only(Bank, Size);
5177 } else {
5178 OpdsMapping[0] = AMDGPU::getValueMapping(Bank, Size);
5179 OpdsMapping[1] = AMDGPU::getValueMapping(CondBank, 1);
5180 OpdsMapping[2] = AMDGPU::getValueMapping(Bank, Size);
5181 OpdsMapping[3] = AMDGPU::getValueMapping(Bank, Size);
5182 }
5183
5184 break;
5185 }
5186
5187 case AMDGPU::G_SI_CALL: {
5188 OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 64);
5189 // Lie and claim everything is legal, even though some need to be
5190 // SGPRs. applyMapping will have to deal with it as a waterfall loop.
5191 OpdsMapping[1] = getSGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);
5192
5193 // Allow anything for implicit arguments
5194 for (unsigned I = 4; I < MI.getNumOperands(); ++I) {
5195 if (MI.getOperand(I).isReg()) {
5196 Register Reg = MI.getOperand(I).getReg();
5197 auto OpBank = getRegBankID(Reg, MRI);
5198 unsigned Size = getSizeInBits(Reg, MRI, *TRI);
5199 OpdsMapping[I] = AMDGPU::getValueMapping(OpBank, Size);
5200 }
5201 }
5202 break;
5203 }
5204 case AMDGPU::G_LOAD:
5205 case AMDGPU::G_ZEXTLOAD:
5206 case AMDGPU::G_SEXTLOAD:
5207 return getInstrMappingForLoad(MI);
5208
5209 case AMDGPU::G_ATOMICRMW_XCHG:
5210 case AMDGPU::G_ATOMICRMW_ADD:
5211 case AMDGPU::G_ATOMICRMW_SUB:
5212 case AMDGPU::G_ATOMICRMW_AND:
5213 case AMDGPU::G_ATOMICRMW_OR:
5214 case AMDGPU::G_ATOMICRMW_XOR:
5215 case AMDGPU::G_ATOMICRMW_MAX:
5216 case AMDGPU::G_ATOMICRMW_MIN:
5217 case AMDGPU::G_ATOMICRMW_UMAX:
5218 case AMDGPU::G_ATOMICRMW_UMIN:
5219 case AMDGPU::G_ATOMICRMW_FADD:
5220 case AMDGPU::G_ATOMICRMW_FMIN:
5221 case AMDGPU::G_ATOMICRMW_FMAX:
5222 case AMDGPU::G_ATOMICRMW_UINC_WRAP:
5223 case AMDGPU::G_ATOMICRMW_UDEC_WRAP:
5224 case AMDGPU::G_AMDGPU_ATOMIC_CMPXCHG: {
5225 OpdsMapping[0] = getVGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);
5226 OpdsMapping[1] = getValueMappingForPtr(MRI, MI.getOperand(1).getReg());
5227 OpdsMapping[2] = getVGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
5228 break;
5229 }
5230 case AMDGPU::G_ATOMIC_CMPXCHG: {
5231 OpdsMapping[0] = getVGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);
5232 OpdsMapping[1] = getValueMappingForPtr(MRI, MI.getOperand(1).getReg());
5233 OpdsMapping[2] = getVGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
5234 OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);
5235 break;
5236 }
5237 case AMDGPU::G_BRCOND: {
5238 unsigned Bank = getRegBankID(MI.getOperand(0).getReg(), MRI,
5239 AMDGPU::SGPRRegBankID);
5240 assert(MRI.getType(MI.getOperand(0).getReg()).getSizeInBits() == 1);
5241 if (Bank != AMDGPU::SGPRRegBankID)
5242 Bank = AMDGPU::VCCRegBankID;
5243
5244 OpdsMapping[0] = AMDGPU::getValueMapping(Bank, 1);
5245 break;
5246 }
5247 case AMDGPU::G_FPTRUNC_ROUND_UPWARD:
5248 case AMDGPU::G_FPTRUNC_ROUND_DOWNWARD:
5249 return getDefaultMappingVOP(MI);
5250 case AMDGPU::G_PREFETCH:
5251 OpdsMapping[0] = getSGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);
5252 break;
5253 }
5254
5255 return getInstructionMapping(/*ID*/1, /*Cost*/1,
5256 getOperandsMapping(OpdsMapping),
5257 MI.getNumOperands());
5258}
unsigned const MachineRegisterInfo * MRI
static unsigned getIntrinsicID(const SDNode *N)
Contains the definition of a TargetInstrInfo class that is common to all AMD GPUs.
static const LLT S1
static const LLT S64
static const LLT S32
static const LLT S16
amdgpu AMDGPU Register Bank Select
static bool substituteSimpleCopyRegs(const AMDGPURegisterBankInfo::OperandsMapper &OpdMapper, unsigned OpIdx)
static unsigned regBankBoolUnion(unsigned RB0, unsigned RB1)
static std::pair< Register, unsigned > getBaseWithConstantOffset(MachineRegisterInfo &MRI, Register Reg)
static Register constrainRegToBank(MachineRegisterInfo &MRI, MachineIRBuilder &B, Register &Reg, const RegisterBank &Bank)
static std::pair< Register, Register > unpackV2S16ToS32(MachineIRBuilder &B, Register Src, unsigned ExtOpcode)
static void extendLow32IntoHigh32(MachineIRBuilder &B, Register Hi32Reg, Register Lo32Reg, unsigned ExtOpc, const RegisterBank &RegBank, bool IsBooleanSrc=false)
Implement extending a 32-bit value to a 64-bit value.
static unsigned getExtendOp(unsigned Opc)
static bool isVectorRegisterBank(const RegisterBank &Bank)
static unsigned regBankUnion(unsigned RB0, unsigned RB1)
static std::pair< LLT, LLT > splitUnequalType(LLT Ty, unsigned FirstSize)
Split Ty into 2 pieces.
static void setRegsToType(MachineRegisterInfo &MRI, ArrayRef< Register > Regs, LLT NewTy)
Replace the current type each register in Regs has with NewTy.
static void reinsertVectorIndexAdd(MachineIRBuilder &B, MachineInstr &IdxUseInstr, unsigned OpIdx, unsigned ConstOffset)
Utility function for pushing dynamic vector indexes with a constant offset into waterfall loops.
static LLT widen96To128(LLT Ty)
static LLT getHalfSizedType(LLT Ty)
This file declares the targeting of the RegisterBankInfo class for AMDGPU.
Rewrite undef for PHI
MachineBasicBlock & MBB
MachineBasicBlock MachineBasicBlock::iterator DebugLoc DL
MachineBasicBlock MachineBasicBlock::iterator MBBI
static GCRegistry::Add< OcamlGC > B("ocaml", "ocaml 3.10-compatible GC")
Analysis containing CSE Info
Definition: CSEInfo.cpp:27
Returns the sub type a function will return at a given Idx Should correspond to the result type of an ExtractValue instruction executed with just that one unsigned Idx
uint64_t Size
bool End
Definition: ELF_riscv.cpp:480
static GCMetadataPrinterRegistry::Add< ErlangGCPrinter > X("erlang", "erlang-compatible garbage collector")
AMD GCN specific subclass of TargetSubtarget.
Declares convenience wrapper classes for interpreting MachineInstr instances as specific generic oper...
const HexagonInstrInfo * TII
IRTranslator LLVM IR MI
#define I(x, y, z)
Definition: MD5.cpp:58
Contains matchers for matching SSA Machine Instructions.
mir Rename Register Operands
This file declares the MachineIRBuilder class.
unsigned const TargetRegisterInfo * TRI
static bool isReg(const MCInst &MI, unsigned OpNo)
ConstantRange Range(APInt(BitWidth, Low), APInt(BitWidth, High))
static GCMetadataPrinterRegistry::Add< OcamlGCMetadataPrinter > Y("ocaml", "ocaml 3.10-compatible collector")
assert(ImpDefSCC.getReg()==AMDGPU::SCC &&ImpDefSCC.isDef())
Interface definition for SIRegisterInfo.
static bool isUniformMMO(const MachineMemOperand *MMO)
bool applyMappingDynStackAlloc(MachineIRBuilder &B, const OperandsMapper &OpdMapper, MachineInstr &MI) const
std::pair< Register, unsigned > splitBufferOffsets(MachineIRBuilder &B, Register Offset) const
bool collectWaterfallOperands(SmallSet< Register, 4 > &SGPROperandRegs, MachineInstr &MI, MachineRegisterInfo &MRI, ArrayRef< unsigned > OpIndices) const
const InstructionMapping & getImageMapping(const MachineRegisterInfo &MRI, const MachineInstr &MI, int RsrcIdx) const
InstructionMappings addMappingFromTable(const MachineInstr &MI, const MachineRegisterInfo &MRI, const std::array< unsigned, NumOps > RegSrcOpIdx, ArrayRef< OpRegBankEntry< NumOps > > Table) const
unsigned copyCost(const RegisterBank &A, const RegisterBank &B, TypeSize Size) const override
Get the cost of a copy from B to A, or put differently, get the cost of A = COPY B.
RegisterBankInfo::InstructionMappings getInstrAlternativeMappingsIntrinsicWSideEffects(const MachineInstr &MI, const MachineRegisterInfo &MRI) const
bool buildVCopy(MachineIRBuilder &B, Register DstReg, Register SrcReg) const
bool executeInWaterfallLoop(MachineIRBuilder &B, iterator_range< MachineBasicBlock::iterator > Range, SmallSet< Register, 4 > &SGPROperandRegs) const
Legalize instruction MI where operands in OpIndices must be SGPRs.
const RegisterBank & getRegBankFromRegClass(const TargetRegisterClass &RC, LLT) const override
Get a register bank that covers RC.
AMDGPURegisterBankInfo(const GCNSubtarget &STI)
bool applyMappingMAD_64_32(MachineIRBuilder &B, const OperandsMapper &OpdMapper) const
unsigned getRegBankID(Register Reg, const MachineRegisterInfo &MRI, unsigned Default=AMDGPU::VGPRRegBankID) const
Register handleD16VData(MachineIRBuilder &B, MachineRegisterInfo &MRI, Register Reg) const
Handle register layout difference for f16 images for some subtargets.
const RegisterBankInfo::InstructionMapping & getInstrMappingForLoad(const MachineInstr &MI) const
void applyMappingImpl(MachineIRBuilder &Builder, const OperandsMapper &OpdMapper) const override
See RegisterBankInfo::applyMapping.
bool applyMappingBFE(MachineIRBuilder &B, const OperandsMapper &OpdMapper, bool Signed) const
bool applyMappingImage(MachineIRBuilder &B, MachineInstr &MI, const OperandsMapper &OpdMapper, int RSrcIdx) const
const ValueMapping * getVGPROpMapping(Register Reg, const MachineRegisterInfo &MRI, const TargetRegisterInfo &TRI) const
bool isScalarLoadLegal(const MachineInstr &MI) const
unsigned setBufferOffsets(MachineIRBuilder &B, Register CombinedOffset, Register &VOffsetReg, Register &SOffsetReg, int64_t &InstOffsetVal, Align Alignment) const
const ValueMapping * getSGPROpMapping(Register Reg, const MachineRegisterInfo &MRI, const TargetRegisterInfo &TRI) const
bool applyMappingLoad(MachineIRBuilder &B, const OperandsMapper &OpdMapper, MachineInstr &MI) const
void split64BitValueForMapping(MachineIRBuilder &B, SmallVector< Register, 2 > &Regs, LLT HalfTy, Register Reg) const
Split 64-bit value Reg into two 32-bit halves and populate them into Regs.
const ValueMapping * getValueMappingForPtr(const MachineRegisterInfo &MRI, Register Ptr) const
Return the mapping for a pointer argument.
unsigned getMappingType(const MachineRegisterInfo &MRI, const MachineInstr &MI) const
RegisterBankInfo::InstructionMappings getInstrAlternativeMappingsIntrinsic(const MachineInstr &MI, const MachineRegisterInfo &MRI) const
bool isDivergentRegBank(const RegisterBank *RB) const override
Returns true if the register bank is considered divergent.
void constrainOpWithReadfirstlane(MachineIRBuilder &B, MachineInstr &MI, unsigned OpIdx) const
InstructionMappings getInstrAlternativeMappings(const MachineInstr &MI) const override
Get the alternative mappings for MI.
const InstructionMapping & getDefaultMappingSOP(const MachineInstr &MI) const
const InstructionMapping & getDefaultMappingAllVGPR(const MachineInstr &MI) const
const InstructionMapping & getInstrMapping(const MachineInstr &MI) const override
This function must return a legal mapping, because AMDGPURegisterBankInfo::getInstrAlternativeMapping...
unsigned getBreakDownCost(const ValueMapping &ValMapping, const RegisterBank *CurBank=nullptr) const override
Get the cost of using ValMapping to decompose a register.
const ValueMapping * getAGPROpMapping(Register Reg, const MachineRegisterInfo &MRI, const TargetRegisterInfo &TRI) const
const InstructionMapping & getDefaultMappingVOP(const MachineInstr &MI) const
bool isSALUMapping(const MachineInstr &MI) const
Register buildReadFirstLane(MachineIRBuilder &B, MachineRegisterInfo &MRI, Register Src) const
bool applyMappingSBufferLoad(MachineIRBuilder &B, const OperandsMapper &OpdMapper) const
void applyMappingSMULU64(MachineIRBuilder &B, const OperandsMapper &OpdMapper) const
ArrayRef - Represent a constant reference to an array (0 or more elements consecutively in memory),...
Definition: ArrayRef.h:41
Predicate
This enumeration lists the possible predicates for CmpInst subclasses.
Definition: InstrTypes.h:757
@ ICMP_SLT
signed less than
Definition: InstrTypes.h:786
@ ICMP_EQ
equal
Definition: InstrTypes.h:778
@ ICMP_NE
not equal
Definition: InstrTypes.h:779
This class represents an Operation in the Expression.
A debug info location.
Definition: DebugLoc.h:33
iterator find(const_arg_type_t< KeyT > Val)
Definition: DenseMap.h:155
iterator end()
Definition: DenseMap.h:84
std::pair< iterator, bool > insert(const std::pair< KeyT, ValueT > &KV)
Definition: DenseMap.h:220
static constexpr ElementCount getFixed(ScalarTy MinVal)
Definition: TypeSize.h:311
bool hasPrefetch() const
Definition: GCNSubtarget.h:939
bool hasScalarCompareEq64() const
Definition: GCNSubtarget.h:998
bool hasScalarSubwordLoads() const
Definition: GCNSubtarget.h:454
bool hasFullRate64Ops() const
Definition: GCNSubtarget.h:376
bool isWave32() const
bool hasScalarDwordx3Loads() const
bool hasScalarMulHiInsts() const
Definition: GCNSubtarget.h:450
bool hasPseudoScalarTrans() const
bool useFlatForGlobal() const
Definition: GCNSubtarget.h:530
Generation getGeneration() const
Definition: GCNSubtarget.h:316
bool hasUnpackedD16VMem() const
Definition: GCNSubtarget.h:731
bool hasSALUFloatInsts() const
Abstract class that contains various methods for clients to notify about changes.
virtual void changingInstr(MachineInstr &MI)=0
This instruction is about to be mutated in some way.
virtual void changedInstr(MachineInstr &MI)=0
This instruction was mutated in some way.
virtual void createdInstr(MachineInstr &MI)=0
An instruction has been created and inserted into the function.
virtual void erasingInstr(MachineInstr &MI)=0
An instruction is about to be erased.
constexpr unsigned getScalarSizeInBits() const
Definition: LowLevelType.h:267
constexpr bool isScalar() const
Definition: LowLevelType.h:146
static constexpr LLT scalar(unsigned SizeInBits)
Get a low-level scalar or aggregate "bag of bits".
Definition: LowLevelType.h:42
constexpr bool isValid() const
Definition: LowLevelType.h:145
constexpr uint16_t getNumElements() const
Returns the number of elements in a vector LLT.
Definition: LowLevelType.h:159
constexpr bool isVector() const
Definition: LowLevelType.h:148
constexpr TypeSize getSizeInBits() const
Returns the total size of the type. Must only be called on sized types.
Definition: LowLevelType.h:193
constexpr LLT getElementType() const
Returns the vector's element type. Only valid for vector types.
Definition: LowLevelType.h:290
constexpr ElementCount getElementCount() const
Definition: LowLevelType.h:184
constexpr unsigned getAddressSpace() const
Definition: LowLevelType.h:280
static constexpr LLT fixed_vector(unsigned NumElements, unsigned ScalarSizeInBits)
Get a low-level fixed-width vector of some number of elements and element width.
Definition: LowLevelType.h:100
constexpr LLT getScalarType() const
Definition: LowLevelType.h:208
static constexpr LLT scalarOrVector(ElementCount EC, LLT ScalarTy)
Definition: LowLevelType.h:124
constexpr LLT divide(int Factor) const
Return a type that is Factor times smaller.
Definition: LowLevelType.h:237
This is an important class for using LLVM in a threaded context.
Definition: LLVMContext.h:67
LegalizeResult lowerAbsToMaxNeg(MachineInstr &MI)
LegalizeResult narrowScalar(MachineInstr &MI, unsigned TypeIdx, LLT NarrowTy)
Legalize an instruction by reducing the width of the underlying scalar type.
LegalizeResult reduceLoadStoreWidth(GLoadStore &MI, unsigned TypeIdx, LLT NarrowTy)
@ Legalized
Instruction has been legalized and the MachineFunction changed.
LegalizeResult fewerElementsVector(MachineInstr &MI, unsigned TypeIdx, LLT NarrowTy)
Legalize a vector instruction by splitting into multiple components, each acting on the same scalar t...
LegalizeResult widenScalar(MachineInstr &MI, unsigned TypeIdx, LLT WideTy)
Legalize an instruction by performing the operation on a wider scalar type (for example a 16-bit addi...
TypeSize getValue() const
void transferSuccessorsAndUpdatePHIs(MachineBasicBlock *FromMBB)
Transfers all the successors, as in transferSuccessors, and update PHI operands in the successor bloc...
iterator getFirstTerminator()
Returns an iterator to the first terminator instruction of this basic block.
void addSuccessor(MachineBasicBlock *Succ, BranchProbability Prob=BranchProbability::getUnknown())
Add Succ as a successor of this MachineBasicBlock.
const MachineFunction * getParent() const
Return the MachineFunction containing this basic block.
void splice(iterator Where, MachineBasicBlock *Other, iterator From)
Take an instruction from MBB 'Other' at the position From, and insert it into this MBB right before '...
const TargetSubtargetInfo & getSubtarget() const
getSubtarget - Return the subtarget for which this machine code is being compiled.
MachineMemOperand * getMachineMemOperand(MachinePointerInfo PtrInfo, MachineMemOperand::Flags f, LLT MemTy, Align base_alignment, const AAMDNodes &AAInfo=AAMDNodes(), const MDNode *Ranges=nullptr, SyncScope::ID SSID=SyncScope::System, AtomicOrdering Ordering=AtomicOrdering::NotAtomic, AtomicOrdering FailureOrdering=AtomicOrdering::NotAtomic)
getMachineMemOperand - Allocate a new MachineMemOperand.
MachineRegisterInfo & getRegInfo()
getRegInfo - Return information about the registers currently in use.
Ty * getInfo()
getInfo - Keep track of various per-function pieces of information for backends that would like to do...
MachineBasicBlock * CreateMachineBasicBlock(const BasicBlock *BB=nullptr, std::optional< UniqueBBID > BBID=std::nullopt)
CreateMachineBasicBlock - Allocate a new MachineBasicBlock.
void insert(iterator MBBI, MachineBasicBlock *MBB)
Helper class to build MachineInstr.
const MachineInstrBuilder & addReg(Register RegNo, unsigned flags=0, unsigned SubReg=0) const
Add a new virtual register operand.
MachineInstrSpan provides an interface to get an iteration range containing the instruction it was in...
MachineBasicBlock::iterator begin()
MachineBasicBlock::iterator end()
Representation of each machine instruction.
Definition: MachineInstr.h:69
const MachineBasicBlock * getParent() const
Definition: MachineInstr.h:346
const MachineOperand & getOperand(unsigned i) const
Definition: MachineInstr.h:579
A description of a memory reference used in the backend.
LocationSize getSize() const
Return the size in bytes of the memory reference.
unsigned getAddrSpace() const
bool isAtomic() const
Returns true if this operation has an atomic ordering requirement of unordered or higher,...
@ MODereferenceable
The memory access is dereferenceable (i.e., doesn't trap).
@ MOLoad
The memory access reads data.
@ MOInvariant
The memory access always returns the same value (or traps).
Flags getFlags() const
Return the raw flags of the source value,.
Align getAlign() const
Return the minimum known alignment in bytes of the actual memory reference.
MachineOperand class - Representation of each machine instruction operand.
void setReg(Register Reg)
Change the register this operand corresponds to.
Register getReg() const
getReg - Returns the register number.
MachineRegisterInfo - Keep track of information for virtual and physical registers,...
Helper class that represents how the value of an instruction may be mapped and what is the related co...
bool isValid() const
Check whether this object is valid.
Helper class used to get/create the virtual registers that will be used to replace the MachineOperand...
const InstructionMapping & getInstrMapping() const
The final mapping of the instruction.
MachineRegisterInfo & getMRI() const
The MachineRegisterInfo we used to realize the mapping.
iterator_range< SmallVectorImpl< Register >::const_iterator > getVRegs(unsigned OpIdx, bool ForDebug=false) const
Get all the virtual registers required to map the OpIdx-th operand of the instruction.
virtual InstructionMappings getInstrAlternativeMappings(const MachineInstr &MI) const
Get the alternative mappings for MI.
static const TargetRegisterClass * constrainGenericRegister(Register Reg, const TargetRegisterClass &RC, MachineRegisterInfo &MRI)
Constrain the (possibly generic) virtual register Reg to RC.
const InstructionMapping & getInstructionMapping(unsigned ID, unsigned Cost, const ValueMapping *OperandsMapping, unsigned NumOperands) const
Method to get a uniquely generated InstructionMapping.
static void applyDefaultMapping(const OperandsMapper &OpdMapper)
Helper method to apply something that is like the default mapping.
const ValueMapping & getValueMapping(unsigned StartIdx, unsigned Length, const RegisterBank &RegBank) const
The most common ValueMapping consists of a single PartialMapping.
const InstructionMapping & getInvalidInstructionMapping() const
Method to get a uniquely generated invalid InstructionMapping.
const RegisterBank & getRegBank(unsigned ID)
Get the register bank identified by ID.
const unsigned * Sizes
Hold the sizes of the register banks for all HwModes.
bool cannotCopy(const RegisterBank &Dst, const RegisterBank &Src, TypeSize Size) const
TypeSize getSizeInBits(Register Reg, const MachineRegisterInfo &MRI, const TargetRegisterInfo &TRI) const
Get the size in bits of Reg.
const ValueMapping * getOperandsMapping(Iterator Begin, Iterator End) const
Get the uniquely generated array of ValueMapping for the elements of between Begin and End.
virtual unsigned copyCost(const RegisterBank &A, const RegisterBank &B, TypeSize Size) const
Get the cost of a copy from B to A, or put differently, get the cost of A = COPY B.
const InstructionMapping & getInstrMappingImpl(const MachineInstr &MI) const
Try to get the mapping of MI.
This class implements the register bank concept.
Definition: RegisterBank.h:28
unsigned getID() const
Get the identifier of this register bank.
Definition: RegisterBank.h:45
Wrapper class representing virtual and physical registers.
Definition: Register.h:19
bool splitMUBUFOffset(uint32_t Imm, uint32_t &SOffset, uint32_t &ImmOffset, Align Alignment=Align(4)) const
static unsigned getMaxMUBUFImmOffset(const GCNSubtarget &ST)
This class keeps track of the SPI_SP_INPUT_ADDR config register, which tells the hardware which inter...
const TargetRegisterClass * getWaveMaskRegClass() const
static bool isSGPRClass(const TargetRegisterClass *RC)
static bool isAGPRClass(const TargetRegisterClass *RC)
static bool shouldExpandVectorDynExt(unsigned EltSize, unsigned NumElem, bool IsDivergentIdx, const GCNSubtarget *Subtarget)
Check if EXTRACT_VECTOR_ELT/INSERT_VECTOR_ELT (<n x e>, var-idx) should be expanded into a set of cmp...
SmallSet - This maintains a set of unique values, optimizing for the case when the set is small (less...
Definition: SmallSet.h:135
size_type count(const T &V) const
count - Return 1 if the element is in the set, 0 otherwise.
Definition: SmallSet.h:166
bool empty() const
Definition: SmallSet.h:159
std::pair< const_iterator, bool > insert(const T &V)
insert - Insert an element into the set if it isn't already there.
Definition: SmallSet.h:179
bool empty() const
Definition: SmallVector.h:94
size_t size() const
Definition: SmallVector.h:91
void resize(size_type N)
Definition: SmallVector.h:651
void push_back(const T &Elt)
Definition: SmallVector.h:426
This is a 'vector' (really, a variable-sized array), optimized for the case when the array is small.
Definition: SmallVector.h:1209
Register getReg() const
TargetRegisterInfo base class - We assume that the target defines a static array of TargetRegisterDes...
static constexpr TypeSize getFixed(ScalarTy ExactSize)
Definition: TypeSize.h:345
static IntegerType * getInt32Ty(LLVMContext &C)
constexpr bool isKnownMultipleOf(ScalarTy RHS) const
This function tells the caller whether the element count is known at compile time to be a multiple of...
Definition: TypeSize.h:183
constexpr LeafTy divideCoefficientBy(ScalarTy RHS) const
We do not provide the '/' operator here because division for polynomial types does not work in the sa...
Definition: TypeSize.h:254
self_iterator getIterator()
Definition: ilist_node.h:132
A range adaptor for a pair of iterators.
#define llvm_unreachable(msg)
Marks that the current location is not supposed to be reachable.
@ CONSTANT_ADDRESS_32BIT
Address space for 32-bit constant memory.
@ REGION_ADDRESS
Address space for region memory. (GDS)
@ LOCAL_ADDRESS
Address space for local memory.
@ CONSTANT_ADDRESS
Address space for constant memory (VTX2).
@ PRIVATE_ADDRESS
Address space for private memory.
bool isFlatGlobalAddrSpace(unsigned AS)
Definition: AMDGPU.h:415
Intrinsic::ID getIntrinsicID(const MachineInstr &I)
Return the intrinsic ID for opcodes with the G_AMDGPU_INTRIN_ prefix.
const RsrcIntrinsic * lookupRsrcIntrinsic(unsigned Intr)
std::pair< Register, unsigned > getBaseWithConstantOffset(MachineRegisterInfo &MRI, Register Reg, GISelKnownBits *KnownBits=nullptr, bool CheckNUW=false)
Returns base register and constant offset.
operand_type_match m_Reg()
ConstantMatch< APInt > m_ICst(APInt &Cst)
BinaryOp_match< LHS, RHS, TargetOpcode::G_ADD, true > m_GAdd(const LHS &L, const RHS &R)
bool mi_match(Reg R, const MachineRegisterInfo &MRI, Pattern &&P)
cst_pred_ty< is_zero_int > m_ZeroInt()
Match an integer 0 or a vector with all elements equal to 0.
Definition: PatternMatch.h:599
@ Kill
The last use of a register.
This is an optimization pass for GlobalISel generic memory operations.
Definition: AddressRanges.h:18
@ Offset
Definition: DWP.cpp:480
MachineInstr * getOpcodeDef(unsigned Opcode, Register Reg, const MachineRegisterInfo &MRI)
See if Reg is defined by an single def instruction that is Opcode.
Definition: Utils.cpp:639
MachineInstrBuilder BuildMI(MachineFunction &MF, const MIMetadata &MIMD, const MCInstrDesc &MCID)
Builder interface. Specify how to create the initial instruction itself.
bool constrainSelectedInstRegOperands(MachineInstr &I, const TargetInstrInfo &TII, const TargetRegisterInfo &TRI, const RegisterBankInfo &RBI)
Mutate the newly-selected instruction I to constrain its (possibly generic) virtual register operands...
Definition: Utils.cpp:155
iterator_range< T > make_range(T x, T y)
Convenience function for iterating over sub-ranges.
std::optional< int64_t > getIConstantVRegSExtVal(Register VReg, const MachineRegisterInfo &MRI)
If VReg is defined by a G_CONSTANT fits in int64_t returns it.
Definition: Utils.cpp:307
static const MachineMemOperand::Flags MONoClobber
Mark the MMO of a uniform load if there are no potentially clobbering stores on any path from the sta...
Definition: SIInstrInfo.h:41
auto reverse(ContainerTy &&C)
Definition: STLExtras.h:419
@ Add
Sum of integers.
void call_once(once_flag &flag, Function &&F, Args &&... ArgList)
Execute the function specified as a parameter once.
Definition: Threading.h:87
std::optional< ValueAndVReg > getIConstantVRegValWithLookThrough(Register VReg, const MachineRegisterInfo &MRI, bool LookThroughInstrs=true)
If VReg is defined by a statically evaluable chain of instructions rooted on a G_CONSTANT returns its...
Definition: Utils.cpp:426
Align assumeAligned(uint64_t Value)
Treats the value 0 as a 1, so Align is always at least 1.
Definition: Alignment.h:111
unsigned Log2(Align A)
Returns the log2 of the alignment.
Definition: Alignment.h:208
Register getSrcRegIgnoringCopies(Register Reg, const MachineRegisterInfo &MRI)
Find the source register for Reg, folding away any trivial copies.
Definition: Utils.cpp:486
@ Default
The result values are uniform if and only if all operands are uniform.
#define N
This struct is a compact representation of a valid (non-zero power of two) alignment.
Definition: Alignment.h:39
This class contains a discriminated union of information about pointers in memory operands,...
unsigned StartIdx
Number of bits at which this partial mapping starts in the original value.
const RegisterBank * RegBank
Register bank where the partial value lives.
unsigned Length
Length of this mapping in bits.
Helper struct that represents how a value is mapped through different register banks.
unsigned NumBreakDowns
Number of partial mapping to break down this value.
const PartialMapping * BreakDown
How the value is broken down between the different register banks.
The llvm::once_flag structure.
Definition: Threading.h:68