LLVM 23.0.0git
AMDGPULowerModuleLDSPass.cpp
Go to the documentation of this file.
1//===-- AMDGPULowerModuleLDSPass.cpp ------------------------------*- C++ -*-=//
2//
3// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4// See https://llvm.org/LICENSE.txt for license information.
5// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6//
7//===----------------------------------------------------------------------===//
8//
9// This pass eliminates local data store, LDS, uses from non-kernel functions.
10// LDS is contiguous memory allocated per kernel execution.
11//
12// Background.
13//
14// The programming model is global variables, or equivalently function local
15// static variables, accessible from kernels or other functions. For uses from
16// kernels this is straightforward - assign an integer to the kernel for the
17// memory required by all the variables combined, allocate them within that.
18// For uses from functions there are performance tradeoffs to choose between.
19//
20// This model means the GPU runtime can specify the amount of memory allocated.
21// If this is more than the kernel assumed, the excess can be made available
22// using a language specific feature, which IR represents as a variable with
23// no initializer. This feature is referred to here as "Dynamic LDS" and is
24// lowered slightly differently to the normal case.
25//
26// Consequences of this GPU feature:
27// - memory is limited and exceeding it halts compilation
28// - a global accessed by one kernel exists independent of other kernels
29// - a global exists independent of simultaneous execution of the same kernel
30// - the address of the global may be different from different kernels as they
31// do not alias, which permits only allocating variables they use
32// - if the address is allowed to differ, functions need help to find it
33//
34// Uses from kernels are implemented here by grouping them in a per-kernel
35// struct instance. This duplicates the variables, accurately modelling their
36// aliasing properties relative to a single global representation. It also
37// permits control over alignment via padding.
38//
39// Uses from functions are more complicated and the primary purpose of this
40// IR pass. Several different lowering are chosen between to meet requirements
41// to avoid allocating any LDS where it is not necessary, as that impacts
42// occupancy and may fail the compilation, while not imposing overhead on a
43// feature whose primary advantage over global memory is performance. The basic
44// design goal is to avoid one kernel imposing overhead on another.
45//
46// Implementation.
47//
48// LDS variables with constant annotation or non-undef initializer are passed
49// through unchanged for simplification or error diagnostics in later passes.
50// Non-undef initializers are not yet implemented for LDS.
51//
52// LDS variables that are always allocated at the same address can be found
53// by lookup at that address. Otherwise runtime information/cost is required.
54//
55// The simplest strategy possible is to group all LDS variables in a single
56// struct and allocate that struct in every kernel such that the original
57// variables are always at the same address. LDS is however a limited resource
58// so this strategy is unusable in practice. It is not implemented here.
59//
60// Strategy | Precise allocation | Zero runtime cost | General purpose |
61// --------+--------------------+-------------------+-----------------+
62// Module | No | Yes | Yes |
63// Table | Yes | No | Yes |
64// Kernel | Yes | Yes | No |
65// Hybrid | Yes | Partial | Yes |
66//
67// "Module" spends LDS memory to save cycles. "Table" spends cycles and global
68// memory to save LDS. "Kernel" is as fast as kernel allocation but only works
69// for variables that are known reachable from a single kernel. "Hybrid" picks
70// between all three. When forced to choose between LDS and cycles we minimise
71// LDS use.
72
73// The "module" lowering implemented here finds LDS variables which are used by
74// non-kernel functions and creates a new struct with a field for each of those
75// LDS variables. Variables that are only used from kernels are excluded.
76//
77// The "table" lowering implemented here has three components.
78// First kernels are assigned a unique integer identifier which is available in
79// functions it calls through the intrinsic amdgcn_lds_kernel_id. The integer
80// is passed through a specific SGPR, thus works with indirect calls.
81// Second, each kernel allocates LDS variables independent of other kernels and
82// writes the addresses it chose for each variable into an array in consistent
83// order. If the kernel does not allocate a given variable, it writes undef to
84// the corresponding array location. These arrays are written to a constant
85// table in the order matching the kernel unique integer identifier.
86// Third, uses from non-kernel functions are replaced with a table lookup using
87// the intrinsic function to find the address of the variable.
88//
89// "Kernel" lowering is only applicable for variables that are unambiguously
90// reachable from exactly one kernel. For those cases, accesses to the variable
91// can be lowered to ConstantExpr address of a struct instance specific to that
92// one kernel. This is zero cost in space and in compute. It will raise a fatal
93// error on any variable that might be reachable from multiple kernels and is
94// thus most easily used as part of the hybrid lowering strategy.
95//
96// Hybrid lowering is a mixture of the above. It uses the zero cost kernel
97// lowering where it can. It lowers the variable accessed by the greatest
98// number of kernels using the module strategy as that is free for the first
99// variable. Any futher variables that can be lowered with the module strategy
100// without incurring LDS memory overhead are. The remaining ones are lowered
101// via table.
102//
103// Consequences
104// - No heuristics or user controlled magic numbers, hybrid is the right choice
105// - Kernels that don't use functions (or have had them all inlined) are not
106// affected by any lowering for kernels that do.
107// - Kernels that don't make indirect function calls are not affected by those
108// that do.
109// - Variables which are used by lots of kernels, e.g. those injected by a
110// language runtime in most kernels, are expected to have no overhead
111// - Implementations that instantiate templates per-kernel where those templates
112// use LDS are expected to hit the "Kernel" lowering strategy
113// - The runtime properties impose a cost in compiler implementation complexity
114//
115// Dynamic LDS implementation
116// Dynamic LDS is lowered similarly to the "table" strategy above and uses the
117// same intrinsic to identify which kernel is at the root of the dynamic call
118// graph. This relies on the specified behaviour that all dynamic LDS variables
119// alias one another, i.e. are at the same address, with respect to a given
120// kernel. Therefore this pass creates new dynamic LDS variables for each kernel
121// that allocates any dynamic LDS and builds a table of addresses out of those.
122// The AMDGPUPromoteAlloca pass skips kernels that use dynamic LDS.
123// The corresponding optimisation for "kernel" lowering where the table lookup
124// is elided is not implemented.
125//
126//
127// Implementation notes / limitations
128// A single LDS global variable represents an instance per kernel that can reach
129// said variables. This pass essentially specialises said variables per kernel.
130// Handling ConstantExpr during the pass complicated this significantly so now
131// all ConstantExpr uses of LDS variables are expanded to instructions. This
132// may need amending when implementing non-undef initialisers.
133//
134// Lowering is split between this IR pass and the back end. This pass chooses
135// where given variables should be allocated and marks them with metadata,
136// MD_absolute_symbol. The backend places the variables in coincidentally the
137// same location and raises a fatal error if something has gone awry. This works
138// in practice because the only pass between this one and the backend that
139// changes LDS is PromoteAlloca and the changes it makes do not conflict.
140//
141// Addresses are written to constant global arrays based on the same metadata.
142//
143// The backend lowers LDS variables in the order of traversal of the function.
144// This is at odds with the deterministic layout required. The workaround is to
145// allocate the fixed-address variables immediately upon starting the function
146// where they can be placed as intended. This requires a means of mapping from
147// the function to the variables that it allocates. For the module scope lds,
148// this is via metadata indicating whether the variable is not required. If a
149// pass deletes that metadata, a fatal error on disagreement with the absolute
150// symbol metadata will occur. For kernel scope and dynamic, this is by _name_
151// correspondence between the function and the variable. It requires the
152// kernel to have a name (which is only a limitation for tests in practice) and
153// for nothing to rename the corresponding symbols. This is a hazard if the pass
154// is run multiple times during debugging. Alternative schemes considered all
155// involve bespoke metadata.
156//
157// If the name correspondence can be replaced, multiple distinct kernels that
158// have the same memory layout can map to the same kernel id (as the address
159// itself is handled by the absolute symbol metadata) and that will allow more
160// uses of the "kernel" style faster lowering and reduce the size of the lookup
161// tables.
162//
163// There is a test that checks this does not fire for a graphics shader. This
164// lowering is expected to work for graphics if the isKernel test is changed.
165//
166// The current markUsedByKernel is sufficient for PromoteAlloca but is elided
167// before codegen. Replacing this with an equivalent intrinsic which lasts until
168// shortly after the machine function lowering of LDS would help break the name
169// mapping. The other part needed is probably to amend PromoteAlloca to embed
170// the LDS variables it creates in the same struct created here. That avoids the
171// current hazard where a PromoteAlloca LDS variable might be allocated before
172// the kernel scope (and thus error on the address check). Given a new invariant
173// that no LDS variables exist outside of the structs managed here, and an
174// intrinsic that lasts until after the LDS frame lowering, it should be
175// possible to drop the name mapping and fold equivalent memory layouts.
176//
177//===----------------------------------------------------------------------===//
178
179#include "AMDGPU.h"
180#include "AMDGPUMemoryUtils.h"
181#include "AMDGPUTargetMachine.h"
182#include "Utils/AMDGPUBaseInfo.h"
183#include "llvm/ADT/BitVector.h"
184#include "llvm/ADT/DenseMap.h"
185#include "llvm/ADT/DenseSet.h"
186#include "llvm/ADT/STLExtras.h"
191#include "llvm/IR/Constants.h"
192#include "llvm/IR/DerivedTypes.h"
193#include "llvm/IR/Dominators.h"
194#include "llvm/IR/IRBuilder.h"
195#include "llvm/IR/InlineAsm.h"
196#include "llvm/IR/Instructions.h"
197#include "llvm/IR/IntrinsicsAMDGPU.h"
198#include "llvm/IR/MDBuilder.h"
201#include "llvm/Pass.h"
203#include "llvm/Support/Debug.h"
204#include "llvm/Support/Format.h"
209
210#include <vector>
211
212#include <cstdio>
213
214#define DEBUG_TYPE "amdgpu-lower-module-lds"
215
216using namespace llvm;
217using namespace AMDGPU;
218
219namespace {
220
221cl::opt<bool> SuperAlignLDSGlobals(
222 "amdgpu-super-align-lds-globals",
223 cl::desc("Increase alignment of LDS if it is not on align boundary"),
224 cl::init(true), cl::Hidden);
225
226enum class LoweringKind { module, table, kernel, hybrid };
227cl::opt<LoweringKind> LoweringKindLoc(
228 "amdgpu-lower-module-lds-strategy",
229 cl::desc("Specify lowering strategy for function LDS access:"), cl::Hidden,
230 cl::init(LoweringKind::hybrid),
232 clEnumValN(LoweringKind::table, "table", "Lower via table lookup"),
233 clEnumValN(LoweringKind::module, "module", "Lower via module struct"),
235 LoweringKind::kernel, "kernel",
236 "Lower variables reachable from one kernel, otherwise abort"),
237 clEnumValN(LoweringKind::hybrid, "hybrid",
238 "Lower via mixture of above strategies")));
239
240template <typename T> std::vector<T> sortByName(std::vector<T> &&V) {
241 llvm::sort(V, [](const auto *L, const auto *R) {
242 return L->getName() < R->getName();
243 });
244 return {std::move(V)};
245}
246
247class AMDGPULowerModuleLDS {
248 const AMDGPUTargetMachine &TM;
249
250 static void
251 removeLocalVarsFromUsedLists(Module &M,
252 const DenseSet<GlobalVariable *> &LocalVars) {
253 // The verifier rejects used lists containing an inttoptr of a constant
254 // so remove the variables from these lists before replaceAllUsesWith
255 SmallPtrSet<Constant *, 8> LocalVarsSet;
256 for (GlobalVariable *LocalVar : LocalVars)
257 LocalVarsSet.insert(cast<Constant>(LocalVar->stripPointerCasts()));
258
260 M, [&LocalVarsSet](Constant *C) { return LocalVarsSet.count(C); });
261
262 for (GlobalVariable *LocalVar : LocalVars)
263 LocalVar->removeDeadConstantUsers();
264 }
265
266 static void markUsedByKernel(Function *Func, GlobalVariable *SGV) {
267 // The llvm.amdgcn.module.lds instance is implicitly used by all kernels
268 // that might call a function which accesses a field within it. This is
269 // presently approximated to 'all kernels' if there are any such functions
270 // in the module. This implicit use is redefined as an explicit use here so
271 // that later passes, specifically PromoteAlloca, account for the required
272 // memory without any knowledge of this transform.
273
274 // An operand bundle on llvm.donothing works because the call instruction
275 // survives until after the last pass that needs to account for LDS. It is
276 // better than inline asm as the latter survives until the end of codegen. A
277 // totally robust solution would be a function with the same semantics as
278 // llvm.donothing that takes a pointer to the instance and is lowered to a
279 // no-op after LDS is allocated, but that is not presently necessary.
280
281 // This intrinsic is eliminated shortly before instruction selection. It
282 // does not suffice to indicate to ISel that a given global which is not
283 // immediately used by the kernel must still be allocated by it. An
284 // equivalent target specific intrinsic which lasts until immediately after
285 // codegen would suffice for that, but one would still need to ensure that
286 // the variables are allocated in the anticipated order.
287 BasicBlock *Entry = &Func->getEntryBlock();
288 IRBuilder<> Builder(Entry, Entry->getFirstNonPHIIt());
289
291 Func->getParent(), Intrinsic::donothing, {});
292
293 Value *UseInstance[1] = {
294 Builder.CreateConstInBoundsGEP1_32(SGV->getValueType(), SGV, 0)};
295
296 Builder.CreateCall(
297 Decl, {}, {OperandBundleDefT<Value *>("ExplicitUse", UseInstance)});
298 }
299
300public:
301 AMDGPULowerModuleLDS(const AMDGPUTargetMachine &TM_) : TM(TM_) {}
302
303 struct LDSVariableReplacement {
304 GlobalVariable *SGV = nullptr;
305 DenseMap<GlobalVariable *, Constant *> LDSVarsToConstantGEP;
306 };
307
308 // remap from lds global to a constantexpr gep to where it has been moved to
309 // for each kernel
310 // an array with an element for each kernel containing where the corresponding
311 // variable was remapped to
312
313 static Constant *getAddressesOfVariablesInKernel(
315 const DenseMap<GlobalVariable *, Constant *> &LDSVarsToConstantGEP) {
316 // Create a ConstantArray containing the address of each Variable within the
317 // kernel corresponding to LDSVarsToConstantGEP, or poison if that kernel
318 // does not allocate it
319
321 ArrayType *KernelOffsetsType = ArrayType::get(LocalPtrTy, Variables.size());
322
324 for (GlobalVariable *GV : Variables) {
325 auto ConstantGepIt = LDSVarsToConstantGEP.find(GV);
326 if (ConstantGepIt != LDSVarsToConstantGEP.end()) {
327 Elements.push_back(ConstantGepIt->second);
328 } else {
329 Elements.push_back(PoisonValue::get(LocalPtrTy));
330 }
331 }
332 return ConstantArray::get(KernelOffsetsType, Elements);
333 }
334
335 static GlobalVariable *buildLookupTable(
337 ArrayRef<Function *> kernels,
339 if (Variables.empty()) {
340 return nullptr;
341 }
342 LLVMContext &Ctx = M.getContext();
343
344 const size_t NumberVariables = Variables.size();
345 const size_t NumberKernels = kernels.size();
346
348 ArrayType *KernelOffsetsType = ArrayType::get(LocalPtrTy, NumberVariables);
349
350 ArrayType *AllKernelsOffsetsType =
351 ArrayType::get(KernelOffsetsType, NumberKernels);
352
353 Constant *Missing = PoisonValue::get(KernelOffsetsType);
354 std::vector<Constant *> overallConstantExprElts(NumberKernels);
355 for (size_t i = 0; i < NumberKernels; i++) {
356 auto Replacement = KernelToReplacement.find(kernels[i]);
357 overallConstantExprElts[i] =
358 (Replacement == KernelToReplacement.end())
359 ? Missing
360 : getAddressesOfVariablesInKernel(
361 Ctx, Variables, Replacement->second.LDSVarsToConstantGEP);
362 }
363
364 Constant *init =
365 ConstantArray::get(AllKernelsOffsetsType, overallConstantExprElts);
366
367 return new GlobalVariable(
368 M, AllKernelsOffsetsType, true, GlobalValue::InternalLinkage, init,
369 "llvm.amdgcn.lds.offset.table", nullptr, GlobalValue::NotThreadLocal,
371 }
372
373 void replaceUseWithTableLookup(Module &M, IRBuilder<> &Builder,
374 GlobalVariable *LookupTable,
375 GlobalVariable *GV, Use &U,
376 Value *OptionalIndex) {
377 // Table is a constant array of the same length as OrderedKernels
378 LLVMContext &Ctx = M.getContext();
379 Type *I32 = Type::getInt32Ty(Ctx);
380 auto *I = cast<Instruction>(U.getUser());
381
382 Value *tableKernelIndex = getTableLookupKernelIndex(M, I->getFunction());
383
384 if (auto *Phi = dyn_cast<PHINode>(I)) {
385 BasicBlock *BB = Phi->getIncomingBlock(U);
386 Builder.SetInsertPoint(&(*(BB->getFirstInsertionPt())));
387 } else {
388 Builder.SetInsertPoint(I);
389 }
390
391 SmallVector<Value *, 3> GEPIdx = {
392 ConstantInt::get(I32, 0),
393 tableKernelIndex,
394 };
395 if (OptionalIndex)
396 GEPIdx.push_back(OptionalIndex);
397
398 Value *Address = Builder.CreateInBoundsGEP(
399 LookupTable->getValueType(), LookupTable, GEPIdx, GV->getName());
400
401 Value *Loaded = Builder.CreateLoad(GV->getType(), Address);
402 U.set(Loaded);
403 }
404
405 void replaceUsesInInstructionsWithTableLookup(
406 Module &M, ArrayRef<GlobalVariable *> ModuleScopeVariables,
407 GlobalVariable *LookupTable) {
408
409 LLVMContext &Ctx = M.getContext();
410 IRBuilder<> Builder(Ctx);
411 Type *I32 = Type::getInt32Ty(Ctx);
412
413 for (size_t Index = 0; Index < ModuleScopeVariables.size(); Index++) {
414 auto *GV = ModuleScopeVariables[Index];
415
416 for (Use &U : make_early_inc_range(GV->uses())) {
417 auto *I = dyn_cast<Instruction>(U.getUser());
418 if (!I)
419 continue;
420
421 replaceUseWithTableLookup(M, Builder, LookupTable, GV, U,
422 ConstantInt::get(I32, Index));
423 }
424 }
425 }
426
427 static DenseSet<Function *> kernelsThatIndirectlyAccessAnyOfPassedVariables(
428 Module &M, LDSUsesInfoTy &LDSUsesInfo,
429 DenseSet<GlobalVariable *> const &VariableSet) {
430
431 DenseSet<Function *> KernelSet;
432
433 if (VariableSet.empty())
434 return KernelSet;
435
436 for (Function &Func : M.functions()) {
437 if (Func.isDeclaration() || !isKernel(Func))
438 continue;
439 for (GlobalVariable *GV : LDSUsesInfo.indirect_access[&Func]) {
440 if (VariableSet.contains(GV)) {
441 KernelSet.insert(&Func);
442 break;
443 }
444 }
445 }
446
447 return KernelSet;
448 }
449
450 static GlobalVariable *
451 chooseBestVariableForModuleStrategy(const DataLayout &DL,
452 VariableFunctionMap &LDSVars) {
453 // Find the global variable with the most indirect uses from kernels
454
455 struct CandidateTy {
456 GlobalVariable *GV = nullptr;
457 size_t UserCount = 0;
458 size_t Size = 0;
459
460 CandidateTy() = default;
461
462 CandidateTy(GlobalVariable *GV, uint64_t UserCount, uint64_t AllocSize)
463 : GV(GV), UserCount(UserCount), Size(AllocSize) {}
464
465 bool operator<(const CandidateTy &Other) const {
466 // Fewer users makes module scope variable less attractive
467 if (UserCount < Other.UserCount) {
468 return true;
469 }
470 if (UserCount > Other.UserCount) {
471 return false;
472 }
473
474 // Bigger makes module scope variable less attractive
475 if (Size < Other.Size) {
476 return false;
477 }
478
479 if (Size > Other.Size) {
480 return true;
481 }
482
483 // Arbitrary but consistent
484 return GV->getName() < Other.GV->getName();
485 }
486 };
487
488 CandidateTy MostUsed;
489
490 for (auto &K : LDSVars) {
491 GlobalVariable *GV = K.first;
492 if (K.second.size() <= 1) {
493 // A variable reachable by only one kernel is best lowered with kernel
494 // strategy
495 continue;
496 }
497 CandidateTy Candidate(GV, K.second.size(), GV->getGlobalSize(DL));
498 if (MostUsed < Candidate)
499 MostUsed = Candidate;
500 }
501
502 return MostUsed.GV;
503 }
504
505 static void recordLDSAbsoluteAddress(Module *M, GlobalVariable *GV,
506 uint32_t Address) {
507 // Write the specified address into metadata where it can be retrieved by
508 // the assembler. Format is a half open range, [Address Address+1)
509 LLVMContext &Ctx = M->getContext();
510 auto *IntTy =
511 M->getDataLayout().getIntPtrType(Ctx, AMDGPUAS::LOCAL_ADDRESS);
512 auto *MinC = ConstantAsMetadata::get(ConstantInt::get(IntTy, Address));
513 auto *MaxC = ConstantAsMetadata::get(ConstantInt::get(IntTy, Address + 1));
514 GV->setMetadata(LLVMContext::MD_absolute_symbol,
515 MDNode::get(Ctx, {MinC, MaxC}));
516 }
517
518 DenseMap<Function *, Value *> tableKernelIndexCache;
519 Value *getTableLookupKernelIndex(Module &M, Function *F) {
520 // Accesses from a function use the amdgcn_lds_kernel_id intrinsic which
521 // lowers to a read from a live in register. Emit it once in the entry
522 // block to spare deduplicating it later.
523 auto [It, Inserted] = tableKernelIndexCache.try_emplace(F);
524 if (Inserted) {
525 auto InsertAt = F->getEntryBlock().getFirstNonPHIOrDbgOrAlloca();
526 IRBuilder<> Builder(&*InsertAt);
527
528 It->second = Builder.CreateIntrinsic(Intrinsic::amdgcn_lds_kernel_id, {});
529 }
530
531 return It->second;
532 }
533
534 static std::vector<Function *> assignLDSKernelIDToEachKernel(
535 Module *M, DenseSet<Function *> const &KernelsThatAllocateTableLDS,
536 DenseSet<Function *> const &KernelsThatIndirectlyAllocateDynamicLDS) {
537 // Associate kernels in the set with an arbitrary but reproducible order and
538 // annotate them with that order in metadata. This metadata is recognised by
539 // the backend and lowered to a SGPR which can be read from using
540 // amdgcn_lds_kernel_id.
541
542 std::vector<Function *> OrderedKernels;
543 if (!KernelsThatAllocateTableLDS.empty() ||
544 !KernelsThatIndirectlyAllocateDynamicLDS.empty()) {
545
546 for (Function &Func : M->functions()) {
547 if (Func.isDeclaration())
548 continue;
549 if (!isKernel(Func))
550 continue;
551
552 if (KernelsThatAllocateTableLDS.contains(&Func) ||
553 KernelsThatIndirectlyAllocateDynamicLDS.contains(&Func)) {
554 assert(Func.hasName()); // else fatal error earlier
555 OrderedKernels.push_back(&Func);
556 }
557 }
558
559 // Put them in an arbitrary but reproducible order
560 OrderedKernels = sortByName(std::move(OrderedKernels));
561
562 // Annotate the kernels with their order in this vector
563 LLVMContext &Ctx = M->getContext();
564 IRBuilder<> Builder(Ctx);
565
566 if (OrderedKernels.size() > UINT32_MAX) {
567 // 32 bit keeps it in one SGPR. > 2**32 kernels won't fit on the GPU
568 reportFatalUsageError("unimplemented LDS lowering for > 2**32 kernels");
569 }
570
571 for (size_t i = 0; i < OrderedKernels.size(); i++) {
572 Metadata *AttrMDArgs[1] = {
573 ConstantAsMetadata::get(Builder.getInt32(i)),
574 };
575 OrderedKernels[i]->setMetadata("llvm.amdgcn.lds.kernel.id",
576 MDNode::get(Ctx, AttrMDArgs));
577 }
578 }
579 return OrderedKernels;
580 }
581
582 static void partitionVariablesIntoIndirectStrategies(
583 Module &M, LDSUsesInfoTy const &LDSUsesInfo,
584 VariableFunctionMap &LDSToKernelsThatNeedToAccessItIndirectly,
585 DenseSet<GlobalVariable *> &ModuleScopeVariables,
586 DenseSet<GlobalVariable *> &TableLookupVariables,
587 DenseSet<GlobalVariable *> &KernelAccessVariables,
588 DenseSet<GlobalVariable *> &DynamicVariables) {
589
590 GlobalVariable *HybridModuleRoot =
591 LoweringKindLoc != LoweringKind::hybrid
592 ? nullptr
593 : chooseBestVariableForModuleStrategy(
594 M.getDataLayout(), LDSToKernelsThatNeedToAccessItIndirectly);
595
596 DenseSet<Function *> const EmptySet;
597 DenseSet<Function *> const &HybridModuleRootKernels =
598 HybridModuleRoot
599 ? LDSToKernelsThatNeedToAccessItIndirectly[HybridModuleRoot]
600 : EmptySet;
601
602 for (auto &K : LDSToKernelsThatNeedToAccessItIndirectly) {
603 // Each iteration of this loop assigns exactly one global variable to
604 // exactly one of the implementation strategies.
605
606 GlobalVariable *GV = K.first;
608 assert(K.second.size() != 0);
609
610 if (AMDGPU::isDynamicLDS(*GV)) {
611 DynamicVariables.insert(GV);
612 continue;
613 }
614
615 switch (LoweringKindLoc) {
616 case LoweringKind::module:
617 ModuleScopeVariables.insert(GV);
618 break;
619
620 case LoweringKind::table:
621 TableLookupVariables.insert(GV);
622 break;
623
624 case LoweringKind::kernel:
625 if (K.second.size() == 1) {
626 KernelAccessVariables.insert(GV);
627 } else {
628 // FIXME: This should use DiagnosticInfo
630 "cannot lower LDS '" + GV->getName() +
631 "' to kernel access as it is reachable from multiple kernels");
632 }
633 break;
634
635 case LoweringKind::hybrid: {
636 if (GV == HybridModuleRoot) {
637 assert(K.second.size() != 1);
638 ModuleScopeVariables.insert(GV);
639 } else if (K.second.size() == 1) {
640 KernelAccessVariables.insert(GV);
641 } else if (K.second == HybridModuleRootKernels) {
642 ModuleScopeVariables.insert(GV);
643 } else {
644 TableLookupVariables.insert(GV);
645 }
646 break;
647 }
648 }
649 }
650
651 // All LDS variables accessed indirectly have now been partitioned into
652 // the distinct lowering strategies.
653 assert(ModuleScopeVariables.size() + TableLookupVariables.size() +
654 KernelAccessVariables.size() + DynamicVariables.size() ==
655 LDSToKernelsThatNeedToAccessItIndirectly.size());
656 }
657
658 static GlobalVariable *lowerModuleScopeStructVariables(
659 Module &M, DenseSet<GlobalVariable *> const &ModuleScopeVariables,
660 DenseSet<Function *> const &KernelsThatAllocateModuleLDS) {
661 // Create a struct to hold the ModuleScopeVariables
662 // Replace all uses of those variables from non-kernel functions with the
663 // new struct instance Replace only the uses from kernel functions that will
664 // allocate this instance. That is a space optimisation - kernels that use a
665 // subset of the module scope struct and do not need to allocate it for
666 // indirect calls will only allocate the subset they use (they do so as part
667 // of the per-kernel lowering).
668 if (ModuleScopeVariables.empty()) {
669 return nullptr;
670 }
671
672 LLVMContext &Ctx = M.getContext();
673
674 LDSVariableReplacement ModuleScopeReplacement =
675 createLDSVariableReplacement(M, "llvm.amdgcn.module.lds",
676 ModuleScopeVariables);
677
678 appendToCompilerUsed(M, {static_cast<GlobalValue *>(
680 cast<Constant>(ModuleScopeReplacement.SGV),
681 PointerType::getUnqual(Ctx)))});
682
683 // module.lds will be allocated at zero in any kernel that allocates it
684 recordLDSAbsoluteAddress(&M, ModuleScopeReplacement.SGV, 0);
685
686 // historic
687 removeLocalVarsFromUsedLists(M, ModuleScopeVariables);
688
689 // Replace all uses of module scope variable from non-kernel functions
690 replaceLDSVariablesWithStruct(
691 M, ModuleScopeVariables, ModuleScopeReplacement, [&](Use &U) {
692 Instruction *I = dyn_cast<Instruction>(U.getUser());
693 if (!I) {
694 return false;
695 }
696 Function *F = I->getFunction();
697 return !isKernel(*F);
698 });
699
700 // Replace uses of module scope variable from kernel functions that
701 // allocate the module scope variable, otherwise leave them unchanged
702 // Record on each kernel whether the module scope global is used by it
703
704 for (Function &Func : M.functions()) {
705 if (Func.isDeclaration() || !isKernel(Func))
706 continue;
707
708 if (KernelsThatAllocateModuleLDS.contains(&Func)) {
709 replaceLDSVariablesWithStruct(
710 M, ModuleScopeVariables, ModuleScopeReplacement, [&](Use &U) {
711 Instruction *I = dyn_cast<Instruction>(U.getUser());
712 if (!I) {
713 return false;
714 }
715 Function *F = I->getFunction();
716 return F == &Func;
717 });
718
719 markUsedByKernel(&Func, ModuleScopeReplacement.SGV);
720 }
721 }
722
723 return ModuleScopeReplacement.SGV;
724 }
725
727 lowerKernelScopeStructVariables(
728 Module &M, LDSUsesInfoTy &LDSUsesInfo,
729 DenseSet<GlobalVariable *> const &ModuleScopeVariables,
730 DenseSet<Function *> const &KernelsThatAllocateModuleLDS,
731 GlobalVariable *MaybeModuleScopeStruct) {
732
733 // Create a struct for each kernel for the non-module-scope variables.
734
736 for (Function &Func : M.functions()) {
737 if (Func.isDeclaration() || !isKernel(Func))
738 continue;
739
740 DenseSet<GlobalVariable *> KernelUsedVariables;
741 // Allocating variables that are used directly in this struct to get
742 // alignment aware allocation and predictable frame size.
743 for (auto &v : LDSUsesInfo.direct_access[&Func]) {
744 if (!AMDGPU::isDynamicLDS(*v)) {
745 KernelUsedVariables.insert(v);
746 }
747 }
748
749 // Allocating variables that are accessed indirectly so that a lookup of
750 // this struct instance can find them from nested functions.
751 for (auto &v : LDSUsesInfo.indirect_access[&Func]) {
752 if (!AMDGPU::isDynamicLDS(*v)) {
753 KernelUsedVariables.insert(v);
754 }
755 }
756
757 // Variables allocated in module lds must all resolve to that struct,
758 // not to the per-kernel instance.
759 if (KernelsThatAllocateModuleLDS.contains(&Func)) {
760 for (GlobalVariable *v : ModuleScopeVariables) {
761 KernelUsedVariables.erase(v);
762 }
763 }
764
765 if (KernelUsedVariables.empty()) {
766 // Either used no LDS, or the LDS it used was all in the module struct
767 // or dynamically sized
768 continue;
769 }
770
771 // The association between kernel function and LDS struct is done by
772 // symbol name, which only works if the function in question has a
773 // name This is not expected to be a problem in practice as kernels
774 // are called by name making anonymous ones (which are named by the
775 // backend) difficult to use. This does mean that llvm test cases need
776 // to name the kernels.
777 if (!Func.hasName()) {
778 reportFatalUsageError("anonymous kernels cannot use LDS variables");
779 }
780
781 std::string VarName =
782 (Twine("llvm.amdgcn.kernel.") + Func.getName() + ".lds").str();
783
784 auto Replacement =
785 createLDSVariableReplacement(M, VarName, KernelUsedVariables);
786
787 // If any indirect uses, create a direct use to ensure allocation
788 // TODO: Simpler to unconditionally mark used but that regresses
789 // codegen in test/CodeGen/AMDGPU/noclobber-barrier.ll
790 auto Accesses = LDSUsesInfo.indirect_access.find(&Func);
791 if ((Accesses != LDSUsesInfo.indirect_access.end()) &&
792 !Accesses->second.empty())
793 markUsedByKernel(&Func, Replacement.SGV);
794
795 // remove preserves existing codegen
796 removeLocalVarsFromUsedLists(M, KernelUsedVariables);
797 KernelToReplacement[&Func] = Replacement;
798
799 // Rewrite uses within kernel to the new struct
800 replaceLDSVariablesWithStruct(
801 M, KernelUsedVariables, Replacement, [&Func](Use &U) {
802 Instruction *I = dyn_cast<Instruction>(U.getUser());
803 return I && I->getFunction() == &Func;
804 });
805 }
806 return KernelToReplacement;
807 }
808
809 static GlobalVariable *
810 buildRepresentativeDynamicLDSInstance(Module &M, LDSUsesInfoTy &LDSUsesInfo,
811 Function *func) {
812 // Create a dynamic lds variable with a name associated with the passed
813 // function that has the maximum alignment of any dynamic lds variable
814 // reachable from this kernel. Dynamic LDS is allocated after the static LDS
815 // allocation, possibly after alignment padding. The representative variable
816 // created here has the maximum alignment of any other dynamic variable
817 // reachable by that kernel. All dynamic LDS variables are allocated at the
818 // same address in each kernel in order to provide the documented aliasing
819 // semantics. Setting the alignment here allows this IR pass to accurately
820 // predict the exact constant at which it will be allocated.
821
822 assert(isKernel(*func));
823
824 LLVMContext &Ctx = M.getContext();
825 const DataLayout &DL = M.getDataLayout();
826 Align MaxDynamicAlignment(1);
827
828 auto UpdateMaxAlignment = [&MaxDynamicAlignment, &DL](GlobalVariable *GV) {
829 if (AMDGPU::isDynamicLDS(*GV)) {
830 MaxDynamicAlignment =
831 std::max(MaxDynamicAlignment, AMDGPU::getAlign(DL, GV));
832 }
833 };
834
835 for (GlobalVariable *GV : LDSUsesInfo.indirect_access[func]) {
836 UpdateMaxAlignment(GV);
837 }
838
839 for (GlobalVariable *GV : LDSUsesInfo.direct_access[func]) {
840 UpdateMaxAlignment(GV);
841 }
842
843 assert(func->hasName()); // Checked by caller
844 auto *emptyCharArray = ArrayType::get(Type::getInt8Ty(Ctx), 0);
846 M, emptyCharArray, false, GlobalValue::ExternalLinkage, nullptr,
847 Twine("llvm.amdgcn." + func->getName() + ".dynlds"), nullptr, GlobalValue::NotThreadLocal, AMDGPUAS::LOCAL_ADDRESS,
848 false);
849 N->setAlignment(MaxDynamicAlignment);
850
852 return N;
853 }
854
855 DenseMap<Function *, GlobalVariable *> lowerDynamicLDSVariables(
856 Module &M, LDSUsesInfoTy &LDSUsesInfo,
857 DenseSet<Function *> const &KernelsThatIndirectlyAllocateDynamicLDS,
858 DenseSet<GlobalVariable *> const &DynamicVariables,
859 std::vector<Function *> const &OrderedKernels) {
860 DenseMap<Function *, GlobalVariable *> KernelToCreatedDynamicLDS;
861 if (!KernelsThatIndirectlyAllocateDynamicLDS.empty()) {
862 LLVMContext &Ctx = M.getContext();
863 IRBuilder<> Builder(Ctx);
865
866 std::vector<Constant *> newDynamicLDS;
867
868 // Table is built in the same order as OrderedKernels
869 for (auto &func : OrderedKernels) {
870
871 if (KernelsThatIndirectlyAllocateDynamicLDS.contains(func)) {
872 assert(isKernel(*func));
873 if (!func->hasName()) {
874 reportFatalUsageError("anonymous kernels cannot use LDS variables");
875 }
876
878 buildRepresentativeDynamicLDSInstance(M, LDSUsesInfo, func);
879
880 KernelToCreatedDynamicLDS[func] = N;
881
882 markUsedByKernel(func, N);
883
884 newDynamicLDS.push_back(N);
885 } else {
886 newDynamicLDS.push_back(PoisonValue::get(LocalPtrTy));
887 }
888 }
889 assert(OrderedKernels.size() == newDynamicLDS.size());
890
891 ArrayType *t = ArrayType::get(LocalPtrTy, newDynamicLDS.size());
892 Constant *init = ConstantArray::get(t, newDynamicLDS);
893 GlobalVariable *table = new GlobalVariable(
894 M, t, true, GlobalValue::InternalLinkage, init,
895 "llvm.amdgcn.dynlds.offset.table", nullptr,
897
898 for (GlobalVariable *GV : DynamicVariables) {
899 for (Use &U : make_early_inc_range(GV->uses())) {
900 auto *I = dyn_cast<Instruction>(U.getUser());
901 if (!I)
902 continue;
903 if (isKernel(*I->getFunction()))
904 continue;
905
906 replaceUseWithTableLookup(M, Builder, table, GV, U, nullptr);
907 }
908 }
909 }
910 return KernelToCreatedDynamicLDS;
911 }
912
913 bool runOnModule(Module &M) {
914 CallGraph CG = CallGraph(M);
915 bool Changed = superAlignLDSGlobals(M);
916
918
919 Changed = true; // todo: narrow this down
920
921 // For each kernel, what variables does it access directly or through
922 // callees
923 LDSUsesInfoTy LDSUsesInfo = getTransitiveUsesOfLDS(CG, M);
924
925 // For each variable accessed through callees, which kernels access it
926 VariableFunctionMap LDSToKernelsThatNeedToAccessItIndirectly;
927 for (auto &K : LDSUsesInfo.indirect_access) {
928 Function *F = K.first;
929 assert(isKernel(*F));
930 for (GlobalVariable *GV : K.second) {
931 LDSToKernelsThatNeedToAccessItIndirectly[GV].insert(F);
932 }
933 }
934
935 // Partition variables accessed indirectly into the different strategies
936 DenseSet<GlobalVariable *> ModuleScopeVariables;
937 DenseSet<GlobalVariable *> TableLookupVariables;
938 DenseSet<GlobalVariable *> KernelAccessVariables;
939 DenseSet<GlobalVariable *> DynamicVariables;
940 partitionVariablesIntoIndirectStrategies(
941 M, LDSUsesInfo, LDSToKernelsThatNeedToAccessItIndirectly,
942 ModuleScopeVariables, TableLookupVariables, KernelAccessVariables,
943 DynamicVariables);
944
945 // If the kernel accesses a variable that is going to be stored in the
946 // module instance through a call then that kernel needs to allocate the
947 // module instance
948 const DenseSet<Function *> KernelsThatAllocateModuleLDS =
949 kernelsThatIndirectlyAccessAnyOfPassedVariables(M, LDSUsesInfo,
950 ModuleScopeVariables);
951 const DenseSet<Function *> KernelsThatAllocateTableLDS =
952 kernelsThatIndirectlyAccessAnyOfPassedVariables(M, LDSUsesInfo,
953 TableLookupVariables);
954
955 const DenseSet<Function *> KernelsThatIndirectlyAllocateDynamicLDS =
956 kernelsThatIndirectlyAccessAnyOfPassedVariables(M, LDSUsesInfo,
957 DynamicVariables);
958
959 GlobalVariable *MaybeModuleScopeStruct = lowerModuleScopeStructVariables(
960 M, ModuleScopeVariables, KernelsThatAllocateModuleLDS);
961
963 lowerKernelScopeStructVariables(M, LDSUsesInfo, ModuleScopeVariables,
964 KernelsThatAllocateModuleLDS,
965 MaybeModuleScopeStruct);
966
967 // Lower zero cost accesses to the kernel instances just created
968 for (auto &GV : KernelAccessVariables) {
969 auto &funcs = LDSToKernelsThatNeedToAccessItIndirectly[GV];
970 assert(funcs.size() == 1); // Only one kernel can access it
971 LDSVariableReplacement Replacement =
972 KernelToReplacement[*(funcs.begin())];
973
975 Vec.insert(GV);
976
977 replaceLDSVariablesWithStruct(M, Vec, Replacement, [](Use &U) {
978 return isa<Instruction>(U.getUser());
979 });
980 }
981
982 // The ith element of this vector is kernel id i
983 std::vector<Function *> OrderedKernels =
984 assignLDSKernelIDToEachKernel(&M, KernelsThatAllocateTableLDS,
985 KernelsThatIndirectlyAllocateDynamicLDS);
986
987 if (!KernelsThatAllocateTableLDS.empty()) {
988 LLVMContext &Ctx = M.getContext();
989 IRBuilder<> Builder(Ctx);
990
991 // The order must be consistent between lookup table and accesses to
992 // lookup table
993 auto TableLookupVariablesOrdered =
994 sortByName(std::vector<GlobalVariable *>(TableLookupVariables.begin(),
995 TableLookupVariables.end()));
996
997 GlobalVariable *LookupTable = buildLookupTable(
998 M, TableLookupVariablesOrdered, OrderedKernels, KernelToReplacement);
999 replaceUsesInInstructionsWithTableLookup(M, TableLookupVariablesOrdered,
1000 LookupTable);
1001 }
1002
1003 DenseMap<Function *, GlobalVariable *> KernelToCreatedDynamicLDS =
1004 lowerDynamicLDSVariables(M, LDSUsesInfo,
1005 KernelsThatIndirectlyAllocateDynamicLDS,
1006 DynamicVariables, OrderedKernels);
1007
1008 // Strip amdgpu-no-lds-kernel-id from all functions reachable from the
1009 // kernel. We may have inferred this wasn't used prior to the pass.
1010 // TODO: We could filter out subgraphs that do not access LDS globals.
1011 for (auto *KernelSet : {&KernelsThatIndirectlyAllocateDynamicLDS,
1012 &KernelsThatAllocateTableLDS})
1013 for (Function *F : *KernelSet)
1014 removeFnAttrFromReachable(CG, F, {"amdgpu-no-lds-kernel-id"});
1015
1016 // All kernel frames have been allocated. Calculate and record the
1017 // addresses.
1018 {
1019 const DataLayout &DL = M.getDataLayout();
1020
1021 for (Function &Func : M.functions()) {
1022 if (Func.isDeclaration() || !isKernel(Func))
1023 continue;
1024
1025 // All three of these are optional. The first variable is allocated at
1026 // zero. They are allocated by AMDGPUMachineFunction as one block.
1027 // Layout:
1028 //{
1029 // module.lds
1030 // alignment padding
1031 // kernel instance
1032 // alignment padding
1033 // dynamic lds variables
1034 //}
1035
1036 const bool AllocateModuleScopeStruct =
1037 MaybeModuleScopeStruct &&
1038 KernelsThatAllocateModuleLDS.contains(&Func);
1039
1040 auto Replacement = KernelToReplacement.find(&Func);
1041 const bool AllocateKernelScopeStruct =
1042 Replacement != KernelToReplacement.end();
1043
1044 const bool AllocateDynamicVariable =
1045 KernelToCreatedDynamicLDS.contains(&Func);
1046
1047 uint32_t Offset = 0;
1048
1049 if (AllocateModuleScopeStruct) {
1050 // Allocated at zero, recorded once on construction, not once per
1051 // kernel
1052 Offset += MaybeModuleScopeStruct->getGlobalSize(DL);
1053 }
1054
1055 if (AllocateKernelScopeStruct) {
1056 GlobalVariable *KernelStruct = Replacement->second.SGV;
1057 Offset = alignTo(Offset, AMDGPU::getAlign(DL, KernelStruct));
1058 recordLDSAbsoluteAddress(&M, KernelStruct, Offset);
1059 Offset += KernelStruct->getGlobalSize(DL);
1060 }
1061
1062 // If there is dynamic allocation, the alignment needed is included in
1063 // the static frame size. There may be no reference to the dynamic
1064 // variable in the kernel itself, so without including it here, that
1065 // alignment padding could be missed.
1066 if (AllocateDynamicVariable) {
1067 GlobalVariable *DynamicVariable = KernelToCreatedDynamicLDS[&Func];
1068 Offset = alignTo(Offset, AMDGPU::getAlign(DL, DynamicVariable));
1069 recordLDSAbsoluteAddress(&M, DynamicVariable, Offset);
1070 }
1071
1072 if (Offset != 0) {
1073 (void)TM; // TODO: Account for target maximum LDS
1074 std::string Buffer;
1075 raw_string_ostream SS{Buffer};
1076 SS << format("%u", Offset);
1077
1078 // Instead of explicitly marking kernels that access dynamic variables
1079 // using special case metadata, annotate with min-lds == max-lds, i.e.
1080 // that there is no more space available for allocating more static
1081 // LDS variables. That is the right condition to prevent allocating
1082 // more variables which would collide with the addresses assigned to
1083 // dynamic variables.
1084 if (AllocateDynamicVariable)
1085 SS << format(",%u", Offset);
1086
1087 Func.addFnAttr("amdgpu-lds-size", Buffer);
1088 }
1089 }
1090 }
1091
1092 for (auto &GV : make_early_inc_range(M.globals()))
1094 // probably want to remove from used lists
1096 if (GV.use_empty())
1097 GV.eraseFromParent();
1098 }
1099
1100 return Changed;
1101 }
1102
1103private:
1104 // Increase the alignment of LDS globals if necessary to maximise the chance
1105 // that we can use aligned LDS instructions to access them.
1106 static bool superAlignLDSGlobals(Module &M) {
1107 const DataLayout &DL = M.getDataLayout();
1108 bool Changed = false;
1109 if (!SuperAlignLDSGlobals) {
1110 return Changed;
1111 }
1112
1113 for (auto &GV : M.globals()) {
1115 // Only changing alignment of LDS variables
1116 continue;
1117 }
1118 if (!GV.hasInitializer()) {
1119 // cuda/hip extern __shared__ variable, leave alignment alone
1120 continue;
1121 }
1122
1123 if (GV.isAbsoluteSymbolRef()) {
1124 // If the variable is already allocated, don't change the alignment
1125 continue;
1126 }
1127
1128 Align Alignment = AMDGPU::getAlign(DL, &GV);
1129 uint64_t GVSize = GV.getGlobalSize(DL);
1130
1131 if (GVSize > 8) {
1132 // We might want to use a b96 or b128 load/store
1133 Alignment = std::max(Alignment, Align(16));
1134 } else if (GVSize > 4) {
1135 // We might want to use a b64 load/store
1136 Alignment = std::max(Alignment, Align(8));
1137 } else if (GVSize > 2) {
1138 // We might want to use a b32 load/store
1139 Alignment = std::max(Alignment, Align(4));
1140 } else if (GVSize > 1) {
1141 // We might want to use a b16 load/store
1142 Alignment = std::max(Alignment, Align(2));
1143 }
1144
1145 if (Alignment != AMDGPU::getAlign(DL, &GV)) {
1146 Changed = true;
1147 GV.setAlignment(Alignment);
1148 }
1149 }
1150 return Changed;
1151 }
1152
1153 static LDSVariableReplacement createLDSVariableReplacement(
1154 Module &M, std::string VarName,
1155 DenseSet<GlobalVariable *> const &LDSVarsToTransform) {
1156 // Create a struct instance containing LDSVarsToTransform and map from those
1157 // variables to ConstantExprGEP
1158 // Variables may be introduced to meet alignment requirements. No aliasing
1159 // metadata is useful for these as they have no uses. Erased before return.
1160
1161 LLVMContext &Ctx = M.getContext();
1162 const DataLayout &DL = M.getDataLayout();
1163 assert(!LDSVarsToTransform.empty());
1164
1166 LayoutFields.reserve(LDSVarsToTransform.size());
1167 {
1168 // The order of fields in this struct depends on the order of
1169 // variables in the argument which varies when changing how they
1170 // are identified, leading to spurious test breakage.
1171 auto Sorted = sortByName(std::vector<GlobalVariable *>(
1172 LDSVarsToTransform.begin(), LDSVarsToTransform.end()));
1173
1174 for (GlobalVariable *GV : Sorted) {
1176 AMDGPU::getAlign(DL, GV));
1177 LayoutFields.emplace_back(F);
1178 }
1179 }
1180
1181 performOptimizedStructLayout(LayoutFields);
1182
1183 std::vector<GlobalVariable *> LocalVars;
1184 BitVector IsPaddingField;
1185 LocalVars.reserve(LDSVarsToTransform.size()); // will be at least this large
1186 IsPaddingField.reserve(LDSVarsToTransform.size());
1187 {
1188 uint64_t CurrentOffset = 0;
1189 for (auto &F : LayoutFields) {
1190 GlobalVariable *FGV =
1191 static_cast<GlobalVariable *>(const_cast<void *>(F.Id));
1192 Align DataAlign = F.Alignment;
1193
1194 uint64_t DataAlignV = DataAlign.value();
1195 if (uint64_t Rem = CurrentOffset % DataAlignV) {
1196 uint64_t Padding = DataAlignV - Rem;
1197
1198 // Append an array of padding bytes to meet alignment requested
1199 // Note (o + (a - (o % a)) ) % a == 0
1200 // (offset + Padding ) % align == 0
1201
1202 Type *ATy = ArrayType::get(Type::getInt8Ty(Ctx), Padding);
1203 LocalVars.push_back(new GlobalVariable(
1204 M, ATy, false, GlobalValue::InternalLinkage,
1206 AMDGPUAS::LOCAL_ADDRESS, false));
1207 IsPaddingField.push_back(true);
1208 CurrentOffset += Padding;
1209 }
1210
1211 LocalVars.push_back(FGV);
1212 IsPaddingField.push_back(false);
1213 CurrentOffset += F.Size;
1214 }
1215 }
1216
1217 std::vector<Type *> LocalVarTypes;
1218 LocalVarTypes.reserve(LocalVars.size());
1219 std::transform(
1220 LocalVars.cbegin(), LocalVars.cend(), std::back_inserter(LocalVarTypes),
1221 [](const GlobalVariable *V) -> Type * { return V->getValueType(); });
1222
1223 StructType *LDSTy = StructType::create(Ctx, LocalVarTypes, VarName + ".t");
1224
1225 Align StructAlign = AMDGPU::getAlign(DL, LocalVars[0]);
1226
1227 GlobalVariable *SGV = new GlobalVariable(
1228 M, LDSTy, false, GlobalValue::InternalLinkage, PoisonValue::get(LDSTy),
1230 false);
1231 SGV->setAlignment(StructAlign);
1232
1234 Type *I32 = Type::getInt32Ty(Ctx);
1235 for (size_t I = 0; I < LocalVars.size(); I++) {
1236 GlobalVariable *GV = LocalVars[I];
1237 Constant *GEPIdx[] = {ConstantInt::get(I32, 0), ConstantInt::get(I32, I)};
1238 Constant *GEP = ConstantExpr::getGetElementPtr(LDSTy, SGV, GEPIdx, true);
1239 if (IsPaddingField[I]) {
1240 assert(GV->use_empty());
1241 GV->eraseFromParent();
1242 } else {
1243 Map[GV] = GEP;
1244 }
1245 }
1246 assert(Map.size() == LDSVarsToTransform.size());
1247 return {SGV, std::move(Map)};
1248 }
1249
1250 template <typename PredicateTy>
1251 static void replaceLDSVariablesWithStruct(
1252 Module &M, DenseSet<GlobalVariable *> const &LDSVarsToTransformArg,
1253 const LDSVariableReplacement &Replacement, PredicateTy Predicate) {
1254 LLVMContext &Ctx = M.getContext();
1255 const DataLayout &DL = M.getDataLayout();
1256
1257 // A hack... we need to insert the aliasing info in a predictable order for
1258 // lit tests. Would like to have them in a stable order already, ideally the
1259 // same order they get allocated, which might mean an ordered set container
1260 auto LDSVarsToTransform = sortByName(std::vector<GlobalVariable *>(
1261 LDSVarsToTransformArg.begin(), LDSVarsToTransformArg.end()));
1262
1263 // Create alias.scope and their lists. Each field in the new structure
1264 // does not alias with all other fields.
1265 SmallVector<MDNode *> AliasScopes;
1266 SmallVector<Metadata *> NoAliasList;
1267 const size_t NumberVars = LDSVarsToTransform.size();
1268 if (NumberVars > 1) {
1269 MDBuilder MDB(Ctx);
1270 AliasScopes.reserve(NumberVars);
1272 for (size_t I = 0; I < NumberVars; I++) {
1274 AliasScopes.push_back(Scope);
1275 }
1276 NoAliasList.append(&AliasScopes[1], AliasScopes.end());
1277 }
1278
1279 // Replace uses of ith variable with a constantexpr to the corresponding
1280 // field of the instance that will be allocated by AMDGPUMachineFunction
1281 for (size_t I = 0; I < NumberVars; I++) {
1282 GlobalVariable *GV = LDSVarsToTransform[I];
1283 Constant *GEP = Replacement.LDSVarsToConstantGEP.at(GV);
1284
1286
1287 APInt APOff(DL.getIndexTypeSizeInBits(GEP->getType()), 0);
1288 GEP->stripAndAccumulateInBoundsConstantOffsets(DL, APOff);
1289 uint64_t Offset = APOff.getZExtValue();
1290
1291 Align A =
1292 commonAlignment(Replacement.SGV->getAlign().valueOrOne(), Offset);
1293
1294 if (I)
1295 NoAliasList[I - 1] = AliasScopes[I - 1];
1296 MDNode *NoAlias =
1297 NoAliasList.empty() ? nullptr : MDNode::get(Ctx, NoAliasList);
1298 MDNode *AliasScope =
1299 AliasScopes.empty() ? nullptr : MDNode::get(Ctx, {AliasScopes[I]});
1300
1301 refineUsesAlignmentAndAA(GEP, A, DL, AliasScope, NoAlias);
1302 }
1303 }
1304
1305 static void refineUsesAlignmentAndAA(Value *Ptr, Align A,
1306 const DataLayout &DL, MDNode *AliasScope,
1307 MDNode *NoAlias, unsigned MaxDepth = 5) {
1308 if (!MaxDepth || (A == 1 && !AliasScope))
1309 return;
1310
1311 ScopedNoAliasAAResult ScopedNoAlias;
1312
1313 for (User *U : Ptr->users()) {
1314 if (auto *I = dyn_cast<Instruction>(U)) {
1315 if (AliasScope && I->mayReadOrWriteMemory()) {
1316 MDNode *AS = I->getMetadata(LLVMContext::MD_alias_scope);
1317 AS = (AS ? MDNode::getMostGenericAliasScope(AS, AliasScope)
1318 : AliasScope);
1319 I->setMetadata(LLVMContext::MD_alias_scope, AS);
1320
1321 MDNode *NA = I->getMetadata(LLVMContext::MD_noalias);
1322
1323 // Scoped aliases can originate from two different domains.
1324 // First domain would be from LDS domain (created by this pass).
1325 // All entries (LDS vars) into LDS struct will have same domain.
1326
1327 // Second domain could be existing scoped aliases that are the
1328 // results of noalias params and subsequent optimizations that
1329 // may alter thesse sets.
1330
1331 // We need to be careful how we create new alias sets, and
1332 // have right scopes and domains for loads/stores of these new
1333 // LDS variables. We intersect NoAlias set if alias sets belong
1334 // to the same domain. This is the case if we have memcpy using
1335 // LDS variables. Both src and dst of memcpy would belong to
1336 // LDS struct, they donot alias.
1337 // On the other hand, if one of the domains is LDS and other is
1338 // existing domain prior to LDS, we need to have a union of all
1339 // these aliases set to preserve existing aliasing information.
1340
1341 SmallPtrSet<const MDNode *, 16> ExistingDomains, LDSDomains;
1342 ScopedNoAlias.collectScopedDomains(NA, ExistingDomains);
1343 ScopedNoAlias.collectScopedDomains(NoAlias, LDSDomains);
1344 auto Intersection = set_intersection(ExistingDomains, LDSDomains);
1345 if (Intersection.empty()) {
1346 NA = NA ? MDNode::concatenate(NA, NoAlias) : NoAlias;
1347 } else {
1348 NA = NA ? MDNode::intersect(NA, NoAlias) : NoAlias;
1349 }
1350 I->setMetadata(LLVMContext::MD_noalias, NA);
1351 }
1352 }
1353
1354 if (auto *LI = dyn_cast<LoadInst>(U)) {
1355 LI->setAlignment(std::max(A, LI->getAlign()));
1356 continue;
1357 }
1358 if (auto *SI = dyn_cast<StoreInst>(U)) {
1359 if (SI->getPointerOperand() == Ptr)
1360 SI->setAlignment(std::max(A, SI->getAlign()));
1361 continue;
1362 }
1363 if (auto *AI = dyn_cast<AtomicRMWInst>(U)) {
1364 // None of atomicrmw operations can work on pointers, but let's
1365 // check it anyway in case it will or we will process ConstantExpr.
1366 if (AI->getPointerOperand() == Ptr)
1367 AI->setAlignment(std::max(A, AI->getAlign()));
1368 continue;
1369 }
1370 if (auto *AI = dyn_cast<AtomicCmpXchgInst>(U)) {
1371 if (AI->getPointerOperand() == Ptr)
1372 AI->setAlignment(std::max(A, AI->getAlign()));
1373 continue;
1374 }
1375 if (auto *GEP = dyn_cast<GetElementPtrInst>(U)) {
1376 unsigned BitWidth = DL.getIndexTypeSizeInBits(GEP->getType());
1377 APInt Off(BitWidth, 0);
1378 if (GEP->getPointerOperand() == Ptr) {
1379 Align GA;
1380 if (GEP->accumulateConstantOffset(DL, Off))
1381 GA = commonAlignment(A, Off.getLimitedValue());
1382 refineUsesAlignmentAndAA(GEP, GA, DL, AliasScope, NoAlias,
1383 MaxDepth - 1);
1384 }
1385 continue;
1386 }
1387 if (auto *I = dyn_cast<Instruction>(U)) {
1388 if (I->getOpcode() == Instruction::BitCast ||
1389 I->getOpcode() == Instruction::AddrSpaceCast)
1390 refineUsesAlignmentAndAA(I, A, DL, AliasScope, NoAlias, MaxDepth - 1);
1391 }
1392 }
1393 }
1394};
1395
1396class AMDGPULowerModuleLDSLegacy : public ModulePass {
1397public:
1398 const AMDGPUTargetMachine *TM;
1399 static char ID;
1400
1401 AMDGPULowerModuleLDSLegacy(const AMDGPUTargetMachine *TM = nullptr)
1402 : ModulePass(ID), TM(TM) {}
1403
1404 void getAnalysisUsage(AnalysisUsage &AU) const override {
1405 if (!TM)
1407 }
1408
1409 bool runOnModule(Module &M) override {
1410 if (!TM) {
1411 auto &TPC = getAnalysis<TargetPassConfig>();
1412 TM = &TPC.getTM<AMDGPUTargetMachine>();
1413 }
1414
1415 return AMDGPULowerModuleLDS(*TM).runOnModule(M);
1416 }
1417};
1418
1419} // namespace
1420char AMDGPULowerModuleLDSLegacy::ID = 0;
1421
1422char &llvm::AMDGPULowerModuleLDSLegacyPassID = AMDGPULowerModuleLDSLegacy::ID;
1423
1424INITIALIZE_PASS_BEGIN(AMDGPULowerModuleLDSLegacy, DEBUG_TYPE,
1425 "Lower uses of LDS variables from non-kernel functions",
1426 false, false)
1428INITIALIZE_PASS_END(AMDGPULowerModuleLDSLegacy, DEBUG_TYPE,
1429 "Lower uses of LDS variables from non-kernel functions",
1431
1432ModulePass *
1434 return new AMDGPULowerModuleLDSLegacy(TM);
1435}
1436
1439 return AMDGPULowerModuleLDS(TM).runOnModule(M) ? PreservedAnalyses::none()
1441}
assert(UImm &&(UImm !=~static_cast< T >(0)) &&"Invalid immediate!")
aarch64 promote const
The AMDGPU TargetMachine interface definition for hw codegen targets.
MachineBasicBlock MachineBasicBlock::iterator DebugLoc DL
This file implements the BitVector class.
static GCRegistry::Add< ErlangGC > A("erlang", "erlang-compatible garbage collector")
This file provides interfaces used to build and manipulate a call graph, which is a very useful tool ...
#define clEnumValN(ENUMVAL, FLAGNAME, DESC)
This file contains the declarations for the subclasses of Constant, which represent the different fla...
DXIL Forward Handle Accesses
This file defines the DenseMap class.
This file defines the DenseSet and SmallDenseSet classes.
#define DEBUG_TYPE
Hexagon Common GEP
#define F(x, y, z)
Definition MD5.cpp:54
#define I(x, y, z)
Definition MD5.cpp:57
This file provides an interface for laying out a sequence of fields as a struct in a way that attempt...
#define INITIALIZE_PASS_DEPENDENCY(depName)
Definition PassSupport.h:42
#define INITIALIZE_PASS_END(passName, arg, name, cfg, analysis)
Definition PassSupport.h:44
#define INITIALIZE_PASS_BEGIN(passName, arg, name, cfg, analysis)
Definition PassSupport.h:39
This file contains some templates that are useful if you are working with the STL at all.
This is the interface for a metadata-based scoped no-alias analysis.
This file defines generic set operations that may be used on set's of different types,...
Target-Independent Code Generator Pass Configuration Options pass.
Class for arbitrary precision integers.
Definition APInt.h:78
uint64_t getZExtValue() const
Get zero extended value.
Definition APInt.h:1555
Represent the analysis usage information of a pass.
AnalysisUsage & addRequired()
ArrayRef - Represent a constant reference to an array (0 or more elements consecutively in memory),...
Definition ArrayRef.h:40
size_t size() const
size - Get the array size.
Definition ArrayRef.h:142
bool empty() const
empty - Check if the array is empty.
Definition ArrayRef.h:137
static LLVM_ABI ArrayType * get(Type *ElementType, uint64_t NumElements)
This static method is the primary way to construct an ArrayType.
LLVM Basic Block Representation.
Definition BasicBlock.h:62
LLVM_ABI const_iterator getFirstInsertionPt() const
Returns an iterator to the first instruction in this block that is suitable for inserting a non-PHI i...
void reserve(unsigned N)
Definition BitVector.h:367
void push_back(bool Val)
Definition BitVector.h:485
The basic data container for the call graph of a Module of IR.
Definition CallGraph.h:72
static LLVM_ABI Constant * get(ArrayType *T, ArrayRef< Constant * > V)
static ConstantAsMetadata * get(Constant *C)
Definition Metadata.h:537
static LLVM_ABI Constant * getPointerBitCastOrAddrSpaceCast(Constant *C, Type *Ty)
Create a BitCast or AddrSpaceCast for a pointer type depending on the address space.
static Constant * getGetElementPtr(Type *Ty, Constant *C, ArrayRef< Constant * > IdxList, GEPNoWrapFlags NW=GEPNoWrapFlags::none(), std::optional< ConstantRange > InRange=std::nullopt, Type *OnlyIfReducedTy=nullptr)
Getelementptr form.
Definition Constants.h:1284
This is an important base class in LLVM.
Definition Constant.h:43
LLVM_ABI void removeDeadConstantUsers() const
If there are any dead constant users dangling off of this constant, remove them.
A parsed version of the target data layout string in and methods for querying it.
Definition DataLayout.h:64
iterator find(const_arg_type_t< KeyT > Val)
Definition DenseMap.h:178
std::pair< iterator, bool > try_emplace(KeyT &&Key, Ts &&...Args)
Definition DenseMap.h:256
iterator end()
Definition DenseMap.h:81
bool contains(const_arg_type_t< KeyT > Val) const
Return true if the specified key is in the map, false otherwise.
Definition DenseMap.h:169
Implements a dense probed hash-table based set.
Definition DenseSet.h:279
LLVM_ABI void setMetadata(unsigned KindID, MDNode *Node)
Set a particular kind of metadata attachment.
LLVM_ABI bool isAbsoluteSymbolRef() const
Returns whether this is a reference to an absolute symbol.
Definition Globals.cpp:447
PointerType * getType() const
Global values are always pointers.
@ InternalLinkage
Rename collisions when linking (static functions).
Definition GlobalValue.h:60
@ ExternalLinkage
Externally visible function.
Definition GlobalValue.h:53
Type * getValueType() const
bool hasInitializer() const
Definitions have initializers, declarations don't.
LLVM_ABI uint64_t getGlobalSize(const DataLayout &DL) const
Get the size of this global variable in bytes.
Definition Globals.cpp:561
LLVM_ABI void eraseFromParent()
eraseFromParent - This method unlinks 'this' from the containing module and deletes it.
Definition Globals.cpp:530
void setAlignment(Align Align)
Sets the alignment attribute of the GlobalVariable.
This provides a uniform API for creating instructions and inserting them into a basic block: either a...
Definition IRBuilder.h:2787
bool runOnModule(Module &) override
ImmutablePasses are never run.
Definition Pass.h:302
This is an important class for using LLVM in a threaded context.
Definition LLVMContext.h:68
MDNode * createAnonymousAliasScope(MDNode *Domain, StringRef Name=StringRef())
Return metadata appropriate for an alias scope root node.
Definition MDBuilder.h:195
MDNode * createAnonymousAliasScopeDomain(StringRef Name=StringRef())
Return metadata appropriate for an alias scope domain node.
Definition MDBuilder.h:188
Metadata node.
Definition Metadata.h:1080
static LLVM_ABI MDNode * getMostGenericAliasScope(MDNode *A, MDNode *B)
static LLVM_ABI MDNode * concatenate(MDNode *A, MDNode *B)
Methods for metadata merging.
static MDTuple * get(LLVMContext &Context, ArrayRef< Metadata * > MDs)
Definition Metadata.h:1572
static LLVM_ABI MDNode * intersect(MDNode *A, MDNode *B)
Root of the metadata hierarchy.
Definition Metadata.h:64
ModulePass class - This class is used to implement unstructured interprocedural optimizations and ana...
Definition Pass.h:255
A Module instance is used to store all the information related to an LLVM module.
Definition Module.h:67
A container for an operand bundle being viewed as a set of values rather than a set of uses.
static PointerType * getUnqual(Type *ElementType)
This constructs a pointer to an object of the specified type in the default address space (address sp...
static LLVM_ABI PointerType * get(Type *ElementType, unsigned AddressSpace)
This constructs a pointer to an object of the specified type in a numbered address space.
static LLVM_ABI PoisonValue * get(Type *T)
Static factory methods - Return an 'poison' object of the specified type.
A set of analyses that are preserved following a run of a transformation pass.
Definition Analysis.h:112
static PreservedAnalyses none()
Convenience factory function for the empty preserved set.
Definition Analysis.h:115
static PreservedAnalyses all()
Construct a special preserved set that preserves all passes.
Definition Analysis.h:118
A simple AA result which uses scoped-noalias metadata to answer queries.
static LLVM_ABI void collectScopedDomains(const MDNode *NoAlias, SmallPtrSetImpl< const MDNode * > &Domains)
Collect the set of scoped domains relevant to the noalias scopes.
bool insert(const value_type &X)
Insert a new element into the SetVector.
Definition SetVector.h:151
size_type count(ConstPtrType Ptr) const
count - Return 1 if the specified pointer is in the set, 0 otherwise.
std::pair< iterator, bool > insert(PtrType Ptr)
Inserts Ptr if and only if there is no element in the container equal to Ptr.
SmallPtrSet - This class implements a set which is optimized for holding SmallSize or less elements.
reference emplace_back(ArgTypes &&... Args)
void reserve(size_type N)
void append(ItTy in_start, ItTy in_end)
Add the specified range to the end of the SmallVector.
void push_back(const T &Elt)
This is a 'vector' (really, a variable-sized array), optimized for the case when the array is small.
Class to represent struct types.
static LLVM_ABI StructType * create(LLVMContext &Context, StringRef Name)
This creates an identified struct.
Definition Type.cpp:619
Target-Independent Code Generator Pass Configuration Options.
Twine - A lightweight data structure for efficiently representing the concatenation of temporary valu...
Definition Twine.h:82
The instances of the Type class are immutable: once they are created, they are never changed.
Definition Type.h:45
static LLVM_ABI IntegerType * getInt32Ty(LLVMContext &C)
Definition Type.cpp:296
LLVM_ABI unsigned getPointerAddressSpace() const
Get the address space of this pointer or pointer vector type.
static LLVM_ABI IntegerType * getInt8Ty(LLVMContext &C)
Definition Type.cpp:294
A Use represents the edge between a Value definition and its users.
Definition Use.h:35
LLVM Value Representation.
Definition Value.h:75
iterator_range< user_iterator > users()
Definition Value.h:426
LLVM_ABI void replaceUsesWithIf(Value *New, llvm::function_ref< bool(Use &U)> ShouldReplace)
Go through the uses list for this definition and make each use point to "V" if the callback ShouldRep...
Definition Value.cpp:561
bool use_empty() const
Definition Value.h:346
iterator_range< use_iterator > uses()
Definition Value.h:380
bool hasName() const
Definition Value.h:262
LLVM_ABI StringRef getName() const
Return a constant reference to the value's name.
Definition Value.cpp:322
std::pair< iterator, bool > insert(const ValueT &V)
Definition DenseSet.h:202
size_type size() const
Definition DenseSet.h:87
bool contains(const_arg_type_t< ValueT > V) const
Check if the set contains the given element.
Definition DenseSet.h:175
bool erase(const ValueT &V)
Definition DenseSet.h:100
A raw_ostream that writes to an std::string.
Changed
@ LOCAL_ADDRESS
Address space for local memory.
@ CONSTANT_ADDRESS
Address space for constant memory (VTX2).
bool isDynamicLDS(const GlobalVariable &GV)
void removeFnAttrFromReachable(CallGraph &CG, Function *KernelRoot, ArrayRef< StringRef > FnAttrs)
Strip FnAttr attribute from any functions where we may have introduced its use.
LLVM_READNONE constexpr bool isKernel(CallingConv::ID CC)
LDSUsesInfoTy getTransitiveUsesOfLDS(const CallGraph &CG, Module &M)
bool isLDSVariableToLower(const GlobalVariable &GV)
bool eliminateConstantExprUsesOfLDSFromAllInstructions(Module &M)
Align getAlign(const DataLayout &DL, const GlobalVariable *GV)
DenseMap< GlobalVariable *, DenseSet< Function * > > VariableFunctionMap
unsigned ID
LLVM IR allows to use arbitrary numbers as calling convention identifiers.
Definition CallingConv.h:24
@ C
The default llvm calling convention, compatible with C.
Definition CallingConv.h:34
LLVM_ABI Function * getOrInsertDeclaration(Module *M, ID id, ArrayRef< Type * > Tys={})
Look up the Function declaration of the intrinsic id in the Module M.
ValuesClass values(OptsTy... Options)
Helper to build a ValuesClass by forwarding a variable number of arguments as an initializer list to ...
initializer< Ty > init(const Ty &Val)
This is an optimization pass for GlobalISel generic memory operations.
Definition Types.h:26
@ Offset
Definition DWP.cpp:532
bool operator<(int64_t V1, const APSInt &V2)
Definition APSInt.h:360
decltype(auto) dyn_cast(const From &Val)
dyn_cast<X> - Return the argument parameter cast to the specified type.
Definition Casting.h:643
iterator_range< early_inc_iterator_impl< detail::IterOfRange< RangeT > > > make_early_inc_range(RangeT &&Range)
Make a range that does early increment to allow mutation of the underlying range without disrupting i...
Definition STLExtras.h:634
void sort(IteratorTy Start, IteratorTy End)
Definition STLExtras.h:1636
char & AMDGPULowerModuleLDSLegacyPassID
bool isa(const From &Val)
isa<X> - Return true if the parameter to the template is an instance of one of the template type argu...
Definition Casting.h:547
S1Ty set_intersection(const S1Ty &S1, const S2Ty &S2)
set_intersection(A, B) - Return A ^ B
LLVM_ABI void removeFromUsedLists(Module &M, function_ref< bool(Constant *)> ShouldRemove)
Removes global values from the llvm.used and llvm.compiler.used arrays.
format_object< Ts... > format(const char *Fmt, const Ts &... Vals)
These are helper functions used to produce formatted output.
Definition Format.h:129
ModulePass * createAMDGPULowerModuleLDSLegacyPass(const AMDGPUTargetMachine *TM=nullptr)
@ Other
Any other memory.
Definition ModRef.h:68
LLVM_ABI void appendToCompilerUsed(Module &M, ArrayRef< GlobalValue * > Values)
Adds global values to the llvm.compiler.used list.
LLVM_ABI std::pair< uint64_t, Align > performOptimizedStructLayout(MutableArrayRef< OptimizedStructLayoutField > Fields)
Compute a layout for a struct containing the given fields, making a best-effort attempt to minimize t...
uint64_t alignTo(uint64_t Size, Align A)
Returns a multiple of A needed to store Size bytes.
Definition Alignment.h:144
constexpr unsigned BitWidth
decltype(auto) cast(const From &Val)
cast<X> - Return the argument parameter cast to the specified type.
Definition Casting.h:559
Align commonAlignment(Align A, uint64_t Offset)
Returns the alignment that satisfies both alignments.
Definition Alignment.h:201
AnalysisManager< Module > ModuleAnalysisManager
Convenience typedef for the Module analysis manager.
Definition MIRParser.h:39
LLVM_ABI void reportFatalUsageError(Error Err)
Report a fatal error that does not indicate a bug in LLVM.
Definition Error.cpp:177
#define N
PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM)
const AMDGPUTargetMachine & TM
Definition AMDGPU.h:139
FunctionVariableMap direct_access
FunctionVariableMap indirect_access
This struct is a compact representation of a valid (non-zero power of two) alignment.
Definition Alignment.h:39
constexpr uint64_t value() const
This is a hole in the type system and should not be abused.
Definition Alignment.h:77