File: | build/source/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp |
Warning: | line 688, column 5 Value stored to 'Changed' is never read |
Press '?' to see keyboard shortcuts
Keyboard shortcuts:
1 | //===-- AMDGPULowerModuleLDSPass.cpp ------------------------------*- C++ -*-=// |
2 | // |
3 | // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. |
4 | // See https://llvm.org/LICENSE.txt for license information. |
5 | // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception |
6 | // |
7 | //===----------------------------------------------------------------------===// |
8 | // |
9 | // This pass eliminates local data store, LDS, uses from non-kernel functions. |
10 | // LDS is contiguous memory allocated per kernel execution. |
11 | // |
12 | // Background. |
13 | // |
14 | // The programming model is global variables, or equivalently function local |
15 | // static variables, accessible from kernels or other functions. For uses from |
16 | // kernels this is straightforward - assign an integer to the kernel for the |
17 | // memory required by all the variables combined, allocate them within that. |
18 | // For uses from functions there are performance tradeoffs to choose between. |
19 | // |
20 | // This model means the GPU runtime can specify the amount of memory allocated. |
21 | // If this is more than the kernel assumed, the excess can be made available |
22 | // using a language specific feature, which IR represents as a variable with |
23 | // no initializer. This feature is not yet implemented for non-kernel functions. |
24 | // This lowering could be extended to handle that use case, but would probably |
25 | // require closer integration with promoteAllocaToLDS. |
26 | // |
27 | // Consequences of this GPU feature: |
28 | // - memory is limited and exceeding it halts compilation |
29 | // - a global accessed by one kernel exists independent of other kernels |
30 | // - a global exists independent of simultaneous execution of the same kernel |
31 | // - the address of the global may be different from different kernels as they |
32 | // do not alias, which permits only allocating variables they use |
33 | // - if the address is allowed to differ, functions need help to find it |
34 | // |
35 | // Uses from kernels are implemented here by grouping them in a per-kernel |
36 | // struct instance. This duplicates the variables, accurately modelling their |
37 | // aliasing properties relative to a single global representation. It also |
38 | // permits control over alignment via padding. |
39 | // |
40 | // Uses from functions are more complicated and the primary purpose of this |
41 | // IR pass. Several different lowering are chosen between to meet requirements |
42 | // to avoid allocating any LDS where it is not necessary, as that impacts |
43 | // occupancy and may fail the compilation, while not imposing overhead on a |
44 | // feature whose primary advantage over global memory is performance. The basic |
45 | // design goal is to avoid one kernel imposing overhead on another. |
46 | // |
47 | // Implementation. |
48 | // |
49 | // LDS variables with constant annotation or non-undef initializer are passed |
50 | // through unchanged for simplification or error diagnostics in later passes. |
51 | // Non-undef initializers are not yet implemented for LDS. |
52 | // |
53 | // LDS variables that are always allocated at the same address can be found |
54 | // by lookup at that address. Otherwise runtime information/cost is required. |
55 | // |
56 | // The simplest strategy possible is to group all LDS variables in a single |
57 | // struct and allocate that struct in every kernel such that the original |
58 | // variables are always at the same address. LDS is however a limited resource |
59 | // so this strategy is unusable in practice. It is not implemented here. |
60 | // |
61 | // Strategy | Precise allocation | Zero runtime cost | General purpose | |
62 | // --------+--------------------+-------------------+-----------------+ |
63 | // Module | No | Yes | Yes | |
64 | // Table | Yes | No | Yes | |
65 | // Kernel | Yes | Yes | No | |
66 | // Hybrid | Yes | Partial | Yes | |
67 | // |
68 | // Module spends LDS memory to save cycles. Table spends cycles and global |
69 | // memory to save LDS. Kernel is as fast as kernel allocation but only works |
70 | // for variables that are known reachable from a single kernel. Hybrid picks |
71 | // between all three. When forced to choose between LDS and cycles it minimises |
72 | // LDS use. |
73 | |
74 | // The "module" lowering implemented here finds LDS variables which are used by |
75 | // non-kernel functions and creates a new struct with a field for each of those |
76 | // LDS variables. Variables that are only used from kernels are excluded. |
77 | // Kernels that do not use this struct are annoteated with the attribute |
78 | // amdgpu-elide-module-lds which allows the back end to elide the allocation. |
79 | // |
80 | // The "table" lowering implemented here has three components. |
81 | // First kernels are assigned a unique integer identifier which is available in |
82 | // functions it calls through the intrinsic amdgcn_lds_kernel_id. The integer |
83 | // is passed through a specific SGPR, thus works with indirect calls. |
84 | // Second, each kernel allocates LDS variables independent of other kernels and |
85 | // writes the addresses it chose for each variable into an array in consistent |
86 | // order. If the kernel does not allocate a given variable, it writes undef to |
87 | // the corresponding array location. These arrays are written to a constant |
88 | // table in the order matching the kernel unique integer identifier. |
89 | // Third, uses from non-kernel functions are replaced with a table lookup using |
90 | // the intrinsic function to find the address of the variable. |
91 | // |
92 | // "Kernel" lowering is only applicable for variables that are unambiguously |
93 | // reachable from exactly one kernel. For those cases, accesses to the variable |
94 | // can be lowered to ConstantExpr address of a struct instance specific to that |
95 | // one kernel. This is zero cost in space and in compute. It will raise a fatal |
96 | // error on any variable that might be reachable from multiple kernels and is |
97 | // thus most easily used as part of the hybrid lowering strategy. |
98 | // |
99 | // Hybrid lowering is a mixture of the above. It uses the zero cost kernel |
100 | // lowering where it can. It lowers the variable accessed by the greatest |
101 | // number of kernels using the module strategy as that is free for the first |
102 | // variable. Any futher variables that can be lowered with the module strategy |
103 | // without incurring LDS memory overhead are. The remaining ones are lowered |
104 | // via table. |
105 | // |
106 | // Consequences |
107 | // - No heuristics or user controlled magic numbers, hybrid is the right choice |
108 | // - Kernels that don't use functions (or have had them all inlined) are not |
109 | // affected by any lowering for kernels that do. |
110 | // - Kernels that don't make indirect function calls are not affected by those |
111 | // that do. |
112 | // - Variables which are used by lots of kernels, e.g. those injected by a |
113 | // language runtime in most kernels, are expected to have no overhead |
114 | // - Implementations that instantiate templates per-kernel where those templates |
115 | // use LDS are expected to hit the "Kernel" lowering strategy |
116 | // - The runtime properties impose a cost in compiler implementation complexity |
117 | // |
118 | //===----------------------------------------------------------------------===// |
119 | |
120 | #include "AMDGPU.h" |
121 | #include "Utils/AMDGPUBaseInfo.h" |
122 | #include "Utils/AMDGPUMemoryUtils.h" |
123 | #include "llvm/ADT/BitVector.h" |
124 | #include "llvm/ADT/DenseMap.h" |
125 | #include "llvm/ADT/DenseSet.h" |
126 | #include "llvm/ADT/STLExtras.h" |
127 | #include "llvm/ADT/SetOperations.h" |
128 | #include "llvm/ADT/SetVector.h" |
129 | #include "llvm/Analysis/CallGraph.h" |
130 | #include "llvm/IR/Constants.h" |
131 | #include "llvm/IR/DerivedTypes.h" |
132 | #include "llvm/IR/IRBuilder.h" |
133 | #include "llvm/IR/InlineAsm.h" |
134 | #include "llvm/IR/Instructions.h" |
135 | #include "llvm/IR/IntrinsicsAMDGPU.h" |
136 | #include "llvm/IR/MDBuilder.h" |
137 | #include "llvm/IR/ReplaceConstant.h" |
138 | #include "llvm/InitializePasses.h" |
139 | #include "llvm/Pass.h" |
140 | #include "llvm/Support/CommandLine.h" |
141 | #include "llvm/Support/Debug.h" |
142 | #include "llvm/Support/OptimizedStructLayout.h" |
143 | #include "llvm/Transforms/Utils/BasicBlockUtils.h" |
144 | #include "llvm/Transforms/Utils/ModuleUtils.h" |
145 | |
146 | #include <tuple> |
147 | #include <vector> |
148 | |
149 | #include <cstdio> |
150 | |
151 | #define DEBUG_TYPE"amdgpu-lower-module-lds" "amdgpu-lower-module-lds" |
152 | |
153 | using namespace llvm; |
154 | |
155 | namespace { |
156 | |
157 | cl::opt<bool> SuperAlignLDSGlobals( |
158 | "amdgpu-super-align-lds-globals", |
159 | cl::desc("Increase alignment of LDS if it is not on align boundary"), |
160 | cl::init(true), cl::Hidden); |
161 | |
162 | enum class LoweringKind { module, table, kernel, hybrid }; |
163 | cl::opt<LoweringKind> LoweringKindLoc( |
164 | "amdgpu-lower-module-lds-strategy", |
165 | cl::desc("Specify lowering strategy for function LDS access:"), cl::Hidden, |
166 | cl::init(LoweringKind::hybrid), |
167 | cl::values( |
168 | clEnumValN(LoweringKind::table, "table", "Lower via table lookup")llvm::cl::OptionEnumValue { "table", int(LoweringKind::table) , "Lower via table lookup" }, |
169 | clEnumValN(LoweringKind::module, "module", "Lower via module struct")llvm::cl::OptionEnumValue { "module", int(LoweringKind::module ), "Lower via module struct" }, |
170 | clEnumValN(llvm::cl::OptionEnumValue { "kernel", int(LoweringKind::kernel ), "Lower variables reachable from one kernel, otherwise abort" } |
171 | LoweringKind::kernel, "kernel",llvm::cl::OptionEnumValue { "kernel", int(LoweringKind::kernel ), "Lower variables reachable from one kernel, otherwise abort" } |
172 | "Lower variables reachable from one kernel, otherwise abort")llvm::cl::OptionEnumValue { "kernel", int(LoweringKind::kernel ), "Lower variables reachable from one kernel, otherwise abort" }, |
173 | clEnumValN(LoweringKind::hybrid, "hybrid",llvm::cl::OptionEnumValue { "hybrid", int(LoweringKind::hybrid ), "Lower via mixture of above strategies" } |
174 | "Lower via mixture of above strategies")llvm::cl::OptionEnumValue { "hybrid", int(LoweringKind::hybrid ), "Lower via mixture of above strategies" })); |
175 | |
176 | bool isKernelLDS(const Function *F) { |
177 | // Some weirdness here. AMDGPU::isKernelCC does not call into |
178 | // AMDGPU::isKernel with the calling conv, it instead calls into |
179 | // isModuleEntryFunction which returns true for more calling conventions |
180 | // than AMDGPU::isKernel does. There's a FIXME on AMDGPU::isKernel. |
181 | // There's also a test that checks that the LDS lowering does not hit on |
182 | // a graphics shader, denoted amdgpu_ps, so stay with the limited case. |
183 | // Putting LDS in the name of the function to draw attention to this. |
184 | return AMDGPU::isKernel(F->getCallingConv()); |
185 | } |
186 | |
187 | class AMDGPULowerModuleLDS : public ModulePass { |
188 | |
189 | static void |
190 | removeLocalVarsFromUsedLists(Module &M, |
191 | const DenseSet<GlobalVariable *> &LocalVars) { |
192 | // The verifier rejects used lists containing an inttoptr of a constant |
193 | // so remove the variables from these lists before replaceAllUsesWith |
194 | SmallPtrSet<Constant *, 8> LocalVarsSet; |
195 | for (GlobalVariable *LocalVar : LocalVars) |
196 | LocalVarsSet.insert(cast<Constant>(LocalVar->stripPointerCasts())); |
197 | |
198 | removeFromUsedLists( |
199 | M, [&LocalVarsSet](Constant *C) { return LocalVarsSet.count(C); }); |
200 | |
201 | for (GlobalVariable *LocalVar : LocalVars) |
202 | LocalVar->removeDeadConstantUsers(); |
203 | } |
204 | |
205 | static void markUsedByKernel(IRBuilder<> &Builder, Function *Func, |
206 | GlobalVariable *SGV) { |
207 | // The llvm.amdgcn.module.lds instance is implicitly used by all kernels |
208 | // that might call a function which accesses a field within it. This is |
209 | // presently approximated to 'all kernels' if there are any such functions |
210 | // in the module. This implicit use is redefined as an explicit use here so |
211 | // that later passes, specifically PromoteAlloca, account for the required |
212 | // memory without any knowledge of this transform. |
213 | |
214 | // An operand bundle on llvm.donothing works because the call instruction |
215 | // survives until after the last pass that needs to account for LDS. It is |
216 | // better than inline asm as the latter survives until the end of codegen. A |
217 | // totally robust solution would be a function with the same semantics as |
218 | // llvm.donothing that takes a pointer to the instance and is lowered to a |
219 | // no-op after LDS is allocated, but that is not presently necessary. |
220 | |
221 | // This intrinsic is eliminated shortly before instruction selection. It |
222 | // does not suffice to indicate to ISel that a given global which is not |
223 | // immediately used by the kernel must still be allocated by it. An |
224 | // equivalent target specific intrinsic which lasts until immediately after |
225 | // codegen would suffice for that, but one would still need to ensure that |
226 | // the variables are allocated in the anticpated order. |
227 | |
228 | LLVMContext &Ctx = Func->getContext(); |
229 | |
230 | Builder.SetInsertPoint(Func->getEntryBlock().getFirstNonPHI()); |
231 | |
232 | FunctionType *FTy = FunctionType::get(Type::getVoidTy(Ctx), {}); |
233 | |
234 | Function *Decl = |
235 | Intrinsic::getDeclaration(Func->getParent(), Intrinsic::donothing, {}); |
236 | |
237 | Value *UseInstance[1] = {Builder.CreateInBoundsGEP( |
238 | SGV->getValueType(), SGV, ConstantInt::get(Type::getInt32Ty(Ctx), 0))}; |
239 | |
240 | Builder.CreateCall(FTy, Decl, {}, |
241 | {OperandBundleDefT<Value *>("ExplicitUse", UseInstance)}, |
242 | ""); |
243 | } |
244 | |
245 | static bool eliminateConstantExprUsesOfLDSFromAllInstructions(Module &M) { |
246 | // Constants are uniqued within LLVM. A ConstantExpr referring to a LDS |
247 | // global may have uses from multiple different functions as a result. |
248 | // This pass specialises LDS variables with respect to the kernel that |
249 | // allocates them. |
250 | |
251 | // This is semantically equivalent to (the unimplemented as slow): |
252 | // for (auto &F : M.functions()) |
253 | // for (auto &BB : F) |
254 | // for (auto &I : BB) |
255 | // for (Use &Op : I.operands()) |
256 | // if (constantExprUsesLDS(Op)) |
257 | // replaceConstantExprInFunction(I, Op); |
258 | |
259 | SmallVector<Constant *> LDSGlobals; |
260 | for (auto &GV : M.globals()) |
261 | if (AMDGPU::isLDSVariableToLower(GV)) |
262 | LDSGlobals.push_back(&GV); |
263 | |
264 | return convertUsersOfConstantsToInstructions(LDSGlobals); |
265 | } |
266 | |
267 | public: |
268 | static char ID; |
269 | |
270 | AMDGPULowerModuleLDS() : ModulePass(ID) { |
271 | initializeAMDGPULowerModuleLDSPass(*PassRegistry::getPassRegistry()); |
272 | } |
273 | |
274 | using FunctionVariableMap = DenseMap<Function *, DenseSet<GlobalVariable *>>; |
275 | |
276 | using VariableFunctionMap = DenseMap<GlobalVariable *, DenseSet<Function *>>; |
277 | |
278 | static void getUsesOfLDSByFunction(CallGraph const &CG, Module &M, |
279 | FunctionVariableMap &kernels, |
280 | FunctionVariableMap &functions) { |
281 | |
282 | // Get uses from the current function, excluding uses by called functions |
283 | // Two output variables to avoid walking the globals list twice |
284 | for (auto &GV : M.globals()) { |
285 | if (!AMDGPU::isLDSVariableToLower(GV)) { |
286 | continue; |
287 | } |
288 | |
289 | SmallVector<User *, 16> Stack(GV.users()); |
290 | for (User *V : GV.users()) { |
291 | if (auto *I = dyn_cast<Instruction>(V)) { |
292 | Function *F = I->getFunction(); |
293 | if (isKernelLDS(F)) { |
294 | kernels[F].insert(&GV); |
295 | } else { |
296 | functions[F].insert(&GV); |
297 | } |
298 | } |
299 | } |
300 | } |
301 | } |
302 | |
303 | struct LDSUsesInfoTy { |
304 | FunctionVariableMap direct_access; |
305 | FunctionVariableMap indirect_access; |
306 | }; |
307 | |
308 | static LDSUsesInfoTy getTransitiveUsesOfLDS(CallGraph const &CG, Module &M) { |
309 | |
310 | FunctionVariableMap direct_map_kernel; |
311 | FunctionVariableMap direct_map_function; |
312 | getUsesOfLDSByFunction(CG, M, direct_map_kernel, direct_map_function); |
313 | |
314 | // Collect variables that are used by functions whose address has escaped |
315 | DenseSet<GlobalVariable *> VariablesReachableThroughFunctionPointer; |
316 | for (Function &F : M.functions()) { |
317 | if (!isKernelLDS(&F)) |
318 | if (F.hasAddressTaken(nullptr, |
319 | /* IgnoreCallbackUses */ false, |
320 | /* IgnoreAssumeLikeCalls */ false, |
321 | /* IgnoreLLVMUsed */ true, |
322 | /* IgnoreArcAttachedCall */ false)) { |
323 | set_union(VariablesReachableThroughFunctionPointer, |
324 | direct_map_function[&F]); |
325 | } |
326 | } |
327 | |
328 | auto functionMakesUnknownCall = [&](const Function *F) -> bool { |
329 | assert(!F->isDeclaration())(static_cast <bool> (!F->isDeclaration()) ? void (0) : __assert_fail ("!F->isDeclaration()", "llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp" , 329, __extension__ __PRETTY_FUNCTION__)); |
330 | for (CallGraphNode::CallRecord R : *CG[F]) { |
331 | if (!R.second->getFunction()) { |
332 | return true; |
333 | } |
334 | } |
335 | return false; |
336 | }; |
337 | |
338 | // Work out which variables are reachable through function calls |
339 | FunctionVariableMap transitive_map_function = direct_map_function; |
340 | |
341 | // If the function makes any unknown call, assume the worst case that it can |
342 | // access all variables accessed by functions whose address escaped |
343 | for (Function &F : M.functions()) { |
344 | if (!F.isDeclaration() && functionMakesUnknownCall(&F)) { |
345 | if (!isKernelLDS(&F)) { |
346 | set_union(transitive_map_function[&F], |
347 | VariablesReachableThroughFunctionPointer); |
348 | } |
349 | } |
350 | } |
351 | |
352 | // Direct implementation of collecting all variables reachable from each |
353 | // function |
354 | for (Function &Func : M.functions()) { |
355 | if (Func.isDeclaration() || isKernelLDS(&Func)) |
356 | continue; |
357 | |
358 | DenseSet<Function *> seen; // catches cycles |
359 | SmallVector<Function *, 4> wip{&Func}; |
360 | |
361 | while (!wip.empty()) { |
362 | Function *F = wip.pop_back_val(); |
363 | |
364 | // Can accelerate this by referring to transitive map for functions that |
365 | // have already been computed, with more care than this |
366 | set_union(transitive_map_function[&Func], direct_map_function[F]); |
367 | |
368 | for (CallGraphNode::CallRecord R : *CG[F]) { |
369 | Function *ith = R.second->getFunction(); |
370 | if (ith) { |
371 | if (!seen.contains(ith)) { |
372 | seen.insert(ith); |
373 | wip.push_back(ith); |
374 | } |
375 | } |
376 | } |
377 | } |
378 | } |
379 | |
380 | // direct_map_kernel lists which variables are used by the kernel |
381 | // find the variables which are used through a function call |
382 | FunctionVariableMap indirect_map_kernel; |
383 | |
384 | for (Function &Func : M.functions()) { |
385 | if (Func.isDeclaration() || !isKernelLDS(&Func)) |
386 | continue; |
387 | |
388 | for (CallGraphNode::CallRecord R : *CG[&Func]) { |
389 | Function *ith = R.second->getFunction(); |
390 | if (ith) { |
391 | set_union(indirect_map_kernel[&Func], transitive_map_function[ith]); |
392 | } else { |
393 | set_union(indirect_map_kernel[&Func], |
394 | VariablesReachableThroughFunctionPointer); |
395 | } |
396 | } |
397 | } |
398 | |
399 | return {std::move(direct_map_kernel), std::move(indirect_map_kernel)}; |
400 | } |
401 | |
402 | struct LDSVariableReplacement { |
403 | GlobalVariable *SGV = nullptr; |
404 | DenseMap<GlobalVariable *, Constant *> LDSVarsToConstantGEP; |
405 | }; |
406 | |
407 | // remap from lds global to a constantexpr gep to where it has been moved to |
408 | // for each kernel |
409 | // an array with an element for each kernel containing where the corresponding |
410 | // variable was remapped to |
411 | |
412 | static Constant *getAddressesOfVariablesInKernel( |
413 | LLVMContext &Ctx, ArrayRef<GlobalVariable *> Variables, |
414 | DenseMap<GlobalVariable *, Constant *> &LDSVarsToConstantGEP) { |
415 | // Create a ConstantArray containing the address of each Variable within the |
416 | // kernel corresponding to LDSVarsToConstantGEP, or poison if that kernel |
417 | // does not allocate it |
418 | // TODO: Drop the ptrtoint conversion |
419 | |
420 | Type *I32 = Type::getInt32Ty(Ctx); |
421 | |
422 | ArrayType *KernelOffsetsType = ArrayType::get(I32, Variables.size()); |
423 | |
424 | SmallVector<Constant *> Elements; |
425 | for (size_t i = 0; i < Variables.size(); i++) { |
426 | GlobalVariable *GV = Variables[i]; |
427 | if (LDSVarsToConstantGEP.count(GV) != 0) { |
428 | auto elt = ConstantExpr::getPtrToInt(LDSVarsToConstantGEP[GV], I32); |
429 | Elements.push_back(elt); |
430 | } else { |
431 | Elements.push_back(PoisonValue::get(I32)); |
432 | } |
433 | } |
434 | return ConstantArray::get(KernelOffsetsType, Elements); |
435 | } |
436 | |
437 | static GlobalVariable *buildLookupTable( |
438 | Module &M, ArrayRef<GlobalVariable *> Variables, |
439 | ArrayRef<Function *> kernels, |
440 | DenseMap<Function *, LDSVariableReplacement> &KernelToReplacement) { |
441 | if (Variables.empty()) { |
442 | return nullptr; |
443 | } |
444 | LLVMContext &Ctx = M.getContext(); |
445 | |
446 | const size_t NumberVariables = Variables.size(); |
447 | const size_t NumberKernels = kernels.size(); |
448 | |
449 | ArrayType *KernelOffsetsType = |
450 | ArrayType::get(Type::getInt32Ty(Ctx), NumberVariables); |
451 | |
452 | ArrayType *AllKernelsOffsetsType = |
453 | ArrayType::get(KernelOffsetsType, NumberKernels); |
454 | |
455 | std::vector<Constant *> overallConstantExprElts(NumberKernels); |
456 | for (size_t i = 0; i < NumberKernels; i++) { |
457 | LDSVariableReplacement Replacement = KernelToReplacement[kernels[i]]; |
458 | overallConstantExprElts[i] = getAddressesOfVariablesInKernel( |
459 | Ctx, Variables, Replacement.LDSVarsToConstantGEP); |
460 | } |
461 | |
462 | Constant *init = |
463 | ConstantArray::get(AllKernelsOffsetsType, overallConstantExprElts); |
464 | |
465 | return new GlobalVariable( |
466 | M, AllKernelsOffsetsType, true, GlobalValue::InternalLinkage, init, |
467 | "llvm.amdgcn.lds.offset.table", nullptr, GlobalValue::NotThreadLocal, |
468 | AMDGPUAS::CONSTANT_ADDRESS); |
469 | } |
470 | |
471 | void replaceUsesInInstructionsWithTableLookup( |
472 | Module &M, ArrayRef<GlobalVariable *> ModuleScopeVariables, |
473 | GlobalVariable *LookupTable) { |
474 | |
475 | LLVMContext &Ctx = M.getContext(); |
476 | IRBuilder<> Builder(Ctx); |
477 | Type *I32 = Type::getInt32Ty(Ctx); |
478 | |
479 | |
480 | for (size_t Index = 0; Index < ModuleScopeVariables.size(); Index++) { |
481 | auto *GV = ModuleScopeVariables[Index]; |
482 | |
483 | for (Use &U : make_early_inc_range(GV->uses())) { |
484 | auto *I = dyn_cast<Instruction>(U.getUser()); |
485 | if (!I) |
486 | continue; |
487 | |
488 | Value *tableKernelIndex = |
489 | getTableLookupKernelIndex(M, I->getFunction()); |
490 | |
491 | // So if the phi uses this value multiple times, what does this look |
492 | // like? |
493 | if (auto *Phi = dyn_cast<PHINode>(I)) { |
494 | BasicBlock *BB = Phi->getIncomingBlock(U); |
495 | Builder.SetInsertPoint(&(*(BB->getFirstInsertionPt()))); |
496 | } else { |
497 | Builder.SetInsertPoint(I); |
498 | } |
499 | |
500 | Value *GEPIdx[3] = { |
501 | ConstantInt::get(I32, 0), |
502 | tableKernelIndex, |
503 | ConstantInt::get(I32, Index), |
504 | }; |
505 | |
506 | |
507 | Value *Address = Builder.CreateInBoundsGEP( |
508 | LookupTable->getValueType(), LookupTable, GEPIdx, GV->getName()); |
509 | |
510 | Value *loaded = Builder.CreateLoad(I32, Address); |
511 | |
512 | Value *replacement = |
513 | Builder.CreateIntToPtr(loaded, GV->getType(), GV->getName()); |
514 | |
515 | U.set(replacement); |
516 | } |
517 | } |
518 | } |
519 | |
520 | static DenseSet<Function *> kernelsThatIndirectlyAccessAnyOfPassedVariables( |
521 | Module &M, LDSUsesInfoTy &LDSUsesInfo, |
522 | DenseSet<GlobalVariable *> const &VariableSet) { |
523 | |
524 | DenseSet<Function *> KernelSet; |
525 | |
526 | if (VariableSet.empty()) return KernelSet; |
527 | |
528 | for (Function &Func : M.functions()) { |
529 | if (Func.isDeclaration() || !isKernelLDS(&Func)) |
530 | continue; |
531 | for (GlobalVariable *GV : LDSUsesInfo.indirect_access[&Func]) { |
532 | if (VariableSet.contains(GV)) { |
533 | KernelSet.insert(&Func); |
534 | break; |
535 | } |
536 | } |
537 | } |
538 | |
539 | return KernelSet; |
540 | } |
541 | |
542 | static GlobalVariable * |
543 | chooseBestVariableForModuleStrategy(const DataLayout &DL, |
544 | VariableFunctionMap &LDSVars) { |
545 | // Find the global variable with the most indirect uses from kernels |
546 | |
547 | struct CandidateTy { |
548 | GlobalVariable *GV = nullptr; |
549 | size_t UserCount = 0; |
550 | size_t Size = 0; |
551 | |
552 | CandidateTy() = default; |
553 | |
554 | CandidateTy(GlobalVariable *GV, uint64_t UserCount, uint64_t AllocSize) |
555 | : GV(GV), UserCount(UserCount), Size(AllocSize) {} |
556 | |
557 | bool operator<(const CandidateTy &Other) const { |
558 | // Fewer users makes module scope variable less attractive |
559 | if (UserCount < Other.UserCount) { |
560 | return true; |
561 | } |
562 | if (UserCount > Other.UserCount) { |
563 | return false; |
564 | } |
565 | |
566 | // Bigger makes module scope variable less attractive |
567 | if (Size < Other.Size) { |
568 | return false; |
569 | } |
570 | |
571 | if (Size > Other.Size) { |
572 | return true; |
573 | } |
574 | |
575 | // Arbitrary but consistent |
576 | return GV->getName() < Other.GV->getName(); |
577 | } |
578 | }; |
579 | |
580 | CandidateTy MostUsed; |
581 | |
582 | for (auto &K : LDSVars) { |
583 | GlobalVariable *GV = K.first; |
584 | if (K.second.size() <= 1) { |
585 | // A variable reachable by only one kernel is best lowered with kernel |
586 | // strategy |
587 | continue; |
588 | } |
589 | CandidateTy Candidate(GV, K.second.size(), |
590 | DL.getTypeAllocSize(GV->getValueType()).getFixedValue()); |
591 | if (MostUsed < Candidate) |
592 | MostUsed = Candidate; |
593 | } |
594 | |
595 | return MostUsed.GV; |
596 | } |
597 | |
598 | static void recordLDSAbsoluteAddress(Module *M, GlobalVariable *GV, |
599 | uint32_t Address) { |
600 | // Write the specified address into metadata where it can be retrieved by |
601 | // the assembler. Format is a half open range, [Address Address+1) |
602 | LLVMContext &Ctx = M->getContext(); |
603 | auto *IntTy = |
604 | M->getDataLayout().getIntPtrType(Ctx, AMDGPUAS::LOCAL_ADDRESS); |
605 | auto *MinC = ConstantAsMetadata::get(ConstantInt::get(IntTy, Address)); |
606 | auto *MaxC = ConstantAsMetadata::get(ConstantInt::get(IntTy, Address + 1)); |
607 | GV->setMetadata(LLVMContext::MD_absolute_symbol, |
608 | MDNode::get(Ctx, {MinC, MaxC})); |
609 | } |
610 | |
611 | DenseMap<Function *, Value *> tableKernelIndexCache; |
612 | Value *getTableLookupKernelIndex(Module &M, Function *F) { |
613 | // Accesses from a function use the amdgcn_lds_kernel_id intrinsic which |
614 | // lowers to a read from a live in register. Emit it once in the entry |
615 | // block to spare deduplicating it later. |
616 | if (tableKernelIndexCache.count(F) == 0) { |
617 | LLVMContext &Ctx = M.getContext(); |
618 | IRBuilder<> Builder(Ctx); |
619 | FunctionType *FTy = FunctionType::get(Type::getInt32Ty(Ctx), {}); |
620 | Function *Decl = |
621 | Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_lds_kernel_id, {}); |
622 | |
623 | BasicBlock::iterator it = |
624 | F->getEntryBlock().getFirstNonPHIOrDbgOrAlloca(); |
625 | Instruction &i = *it; |
626 | Builder.SetInsertPoint(&i); |
627 | |
628 | tableKernelIndexCache[F] = Builder.CreateCall(FTy, Decl, {}); |
629 | } |
630 | |
631 | return tableKernelIndexCache[F]; |
632 | } |
633 | |
634 | std::vector<Function *> assignLDSKernelIDToEachKernel( |
635 | Module *M, DenseSet<Function *> const &KernelsThatAllocateTableLDS) { |
636 | // Associate kernels in the set with an arbirary but reproducible order and |
637 | // annotate them with that order in metadata. This metadata is recognised by |
638 | // the backend and lowered to a SGPR which can be read from using |
639 | // amdgcn_lds_kernel_id. |
640 | |
641 | std::vector<Function *> OrderedKernels; |
642 | |
643 | for (Function &Func : M->functions()) { |
644 | if (Func.isDeclaration()) |
645 | continue; |
646 | if (!isKernelLDS(&Func)) |
647 | continue; |
648 | |
649 | if (KernelsThatAllocateTableLDS.contains(&Func)) { |
650 | assert(Func.hasName())(static_cast <bool> (Func.hasName()) ? void (0) : __assert_fail ("Func.hasName()", "llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp" , 650, __extension__ __PRETTY_FUNCTION__)); // else fatal error earlier |
651 | OrderedKernels.push_back(&Func); |
652 | } |
653 | } |
654 | |
655 | // Put them in an arbitrary but reproducible order |
656 | llvm::sort(OrderedKernels.begin(), OrderedKernels.end(), |
657 | [](const Function *lhs, const Function *rhs) -> bool { |
658 | return lhs->getName() < rhs->getName(); |
659 | }); |
660 | |
661 | // Annotate the kernels with their order in this vector |
662 | LLVMContext &Ctx = M->getContext(); |
663 | IRBuilder<> Builder(Ctx); |
664 | |
665 | if (OrderedKernels.size() > UINT32_MAX(4294967295U)) { |
666 | // 32 bit keeps it in one SGPR. > 2**32 kernels won't fit on the GPU |
667 | report_fatal_error("Unimplemented LDS lowering for > 2**32 kernels"); |
668 | } |
669 | |
670 | for (size_t i = 0; i < OrderedKernels.size(); i++) { |
671 | Metadata *AttrMDArgs[1] = { |
672 | ConstantAsMetadata::get(Builder.getInt32(i)), |
673 | }; |
674 | OrderedKernels[i]->setMetadata("llvm.amdgcn.lds.kernel.id", |
675 | MDNode::get(Ctx, AttrMDArgs)); |
676 | |
677 | } |
678 | |
679 | |
680 | return OrderedKernels; |
681 | } |
682 | |
683 | bool runOnModule(Module &M) override { |
684 | LLVMContext &Ctx = M.getContext(); |
685 | CallGraph CG = CallGraph(M); |
686 | bool Changed = superAlignLDSGlobals(M); |
687 | |
688 | Changed |= eliminateConstantExprUsesOfLDSFromAllInstructions(M); |
Value stored to 'Changed' is never read | |
689 | |
690 | Changed = true; // todo: narrow this down |
691 | |
692 | // For each kernel, what variables does it access directly or through |
693 | // callees |
694 | LDSUsesInfoTy LDSUsesInfo = getTransitiveUsesOfLDS(CG, M); |
695 | |
696 | // For each variable accessed through callees, which kernels access it |
697 | VariableFunctionMap LDSToKernelsThatNeedToAccessItIndirectly; |
698 | for (auto &K : LDSUsesInfo.indirect_access) { |
699 | Function *F = K.first; |
700 | assert(isKernelLDS(F))(static_cast <bool> (isKernelLDS(F)) ? void (0) : __assert_fail ("isKernelLDS(F)", "llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp" , 700, __extension__ __PRETTY_FUNCTION__)); |
701 | for (GlobalVariable *GV : K.second) { |
702 | LDSToKernelsThatNeedToAccessItIndirectly[GV].insert(F); |
703 | } |
704 | } |
705 | |
706 | // Partition variables accessed indirectly into the different strategies |
707 | DenseSet<GlobalVariable *> ModuleScopeVariables; |
708 | DenseSet<GlobalVariable *> TableLookupVariables; |
709 | DenseSet<GlobalVariable *> KernelAccessVariables; |
710 | |
711 | { |
712 | GlobalVariable *HybridModuleRoot = |
713 | LoweringKindLoc != LoweringKind::hybrid |
714 | ? nullptr |
715 | : chooseBestVariableForModuleStrategy( |
716 | M.getDataLayout(), |
717 | LDSToKernelsThatNeedToAccessItIndirectly); |
718 | |
719 | DenseSet<Function *> const EmptySet; |
720 | DenseSet<Function *> const &HybridModuleRootKernels = |
721 | HybridModuleRoot |
722 | ? LDSToKernelsThatNeedToAccessItIndirectly[HybridModuleRoot] |
723 | : EmptySet; |
724 | |
725 | for (auto &K : LDSToKernelsThatNeedToAccessItIndirectly) { |
726 | // Each iteration of this loop assigns exactly one global variable to |
727 | // exactly one of the implementation strategies. |
728 | |
729 | GlobalVariable *GV = K.first; |
730 | assert(AMDGPU::isLDSVariableToLower(*GV))(static_cast <bool> (AMDGPU::isLDSVariableToLower(*GV)) ? void (0) : __assert_fail ("AMDGPU::isLDSVariableToLower(*GV)" , "llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp", 730, __extension__ __PRETTY_FUNCTION__)); |
731 | assert(K.second.size() != 0)(static_cast <bool> (K.second.size() != 0) ? void (0) : __assert_fail ("K.second.size() != 0", "llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp" , 731, __extension__ __PRETTY_FUNCTION__)); |
732 | |
733 | switch (LoweringKindLoc) { |
734 | case LoweringKind::module: |
735 | ModuleScopeVariables.insert(GV); |
736 | break; |
737 | |
738 | case LoweringKind::table: |
739 | TableLookupVariables.insert(GV); |
740 | break; |
741 | |
742 | case LoweringKind::kernel: |
743 | if (K.second.size() == 1) { |
744 | KernelAccessVariables.insert(GV); |
745 | } else { |
746 | report_fatal_error( |
747 | "cannot lower LDS '" + GV->getName() + |
748 | "' to kernel access as it is reachable from multiple kernels"); |
749 | } |
750 | break; |
751 | |
752 | case LoweringKind::hybrid: { |
753 | if (GV == HybridModuleRoot) { |
754 | assert(K.second.size() != 1)(static_cast <bool> (K.second.size() != 1) ? void (0) : __assert_fail ("K.second.size() != 1", "llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp" , 754, __extension__ __PRETTY_FUNCTION__)); |
755 | ModuleScopeVariables.insert(GV); |
756 | } else if (K.second.size() == 1) { |
757 | KernelAccessVariables.insert(GV); |
758 | } else if (set_is_subset(K.second, HybridModuleRootKernels)) { |
759 | ModuleScopeVariables.insert(GV); |
760 | } else { |
761 | TableLookupVariables.insert(GV); |
762 | } |
763 | break; |
764 | } |
765 | } |
766 | } |
767 | |
768 | // All LDS variables accessed indirectly have now been partitioned into |
769 | // the distinct lowering strategies. |
770 | assert(ModuleScopeVariables.size() + TableLookupVariables.size() +(static_cast <bool> (ModuleScopeVariables.size() + TableLookupVariables .size() + KernelAccessVariables.size() == LDSToKernelsThatNeedToAccessItIndirectly .size()) ? void (0) : __assert_fail ("ModuleScopeVariables.size() + TableLookupVariables.size() + KernelAccessVariables.size() == LDSToKernelsThatNeedToAccessItIndirectly.size()" , "llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp", 772, __extension__ __PRETTY_FUNCTION__)) |
771 | KernelAccessVariables.size() ==(static_cast <bool> (ModuleScopeVariables.size() + TableLookupVariables .size() + KernelAccessVariables.size() == LDSToKernelsThatNeedToAccessItIndirectly .size()) ? void (0) : __assert_fail ("ModuleScopeVariables.size() + TableLookupVariables.size() + KernelAccessVariables.size() == LDSToKernelsThatNeedToAccessItIndirectly.size()" , "llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp", 772, __extension__ __PRETTY_FUNCTION__)) |
772 | LDSToKernelsThatNeedToAccessItIndirectly.size())(static_cast <bool> (ModuleScopeVariables.size() + TableLookupVariables .size() + KernelAccessVariables.size() == LDSToKernelsThatNeedToAccessItIndirectly .size()) ? void (0) : __assert_fail ("ModuleScopeVariables.size() + TableLookupVariables.size() + KernelAccessVariables.size() == LDSToKernelsThatNeedToAccessItIndirectly.size()" , "llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp", 772, __extension__ __PRETTY_FUNCTION__)); |
773 | } |
774 | |
775 | // If the kernel accesses a variable that is going to be stored in the |
776 | // module instance through a call then that kernel needs to allocate the |
777 | // module instance |
778 | DenseSet<Function *> KernelsThatAllocateModuleLDS = |
779 | kernelsThatIndirectlyAccessAnyOfPassedVariables(M, LDSUsesInfo, |
780 | ModuleScopeVariables); |
781 | DenseSet<Function *> KernelsThatAllocateTableLDS = |
782 | kernelsThatIndirectlyAccessAnyOfPassedVariables(M, LDSUsesInfo, |
783 | TableLookupVariables); |
784 | |
785 | GlobalVariable *MaybeModuleScopeStruct = nullptr; |
786 | if (!ModuleScopeVariables.empty()) { |
787 | LDSVariableReplacement ModuleScopeReplacement = |
788 | createLDSVariableReplacement(M, "llvm.amdgcn.module.lds", |
789 | ModuleScopeVariables); |
790 | MaybeModuleScopeStruct = ModuleScopeReplacement.SGV; |
791 | appendToCompilerUsed(M, |
792 | {static_cast<GlobalValue *>( |
793 | ConstantExpr::getPointerBitCastOrAddrSpaceCast( |
794 | cast<Constant>(ModuleScopeReplacement.SGV), |
795 | Type::getInt8PtrTy(Ctx)))}); |
796 | |
797 | // module.lds will be allocated at zero in any kernel that allocates it |
798 | recordLDSAbsoluteAddress(&M, ModuleScopeReplacement.SGV, 0); |
799 | |
800 | // historic |
801 | removeLocalVarsFromUsedLists(M, ModuleScopeVariables); |
802 | |
803 | // Replace all uses of module scope variable from non-kernel functions |
804 | replaceLDSVariablesWithStruct( |
805 | M, ModuleScopeVariables, ModuleScopeReplacement, [&](Use &U) { |
806 | Instruction *I = dyn_cast<Instruction>(U.getUser()); |
807 | if (!I) { |
808 | return false; |
809 | } |
810 | Function *F = I->getFunction(); |
811 | return !isKernelLDS(F); |
812 | }); |
813 | |
814 | // Replace uses of module scope variable from kernel functions that |
815 | // allocate the module scope variable, otherwise leave them unchanged |
816 | // Record on each kernel whether the module scope global is used by it |
817 | |
818 | LLVMContext &Ctx = M.getContext(); |
819 | IRBuilder<> Builder(Ctx); |
820 | |
821 | for (Function &Func : M.functions()) { |
822 | if (Func.isDeclaration() || !isKernelLDS(&Func)) |
823 | continue; |
824 | |
825 | if (KernelsThatAllocateModuleLDS.contains(&Func)) { |
826 | replaceLDSVariablesWithStruct( |
827 | M, ModuleScopeVariables, ModuleScopeReplacement, [&](Use &U) { |
828 | Instruction *I = dyn_cast<Instruction>(U.getUser()); |
829 | if (!I) { |
830 | return false; |
831 | } |
832 | Function *F = I->getFunction(); |
833 | return F == &Func; |
834 | }); |
835 | |
836 | markUsedByKernel(Builder, &Func, ModuleScopeReplacement.SGV); |
837 | |
838 | } else { |
839 | Func.addFnAttr("amdgpu-elide-module-lds"); |
840 | } |
841 | } |
842 | } |
843 | |
844 | // Create a struct for each kernel for the non-module-scope variables |
845 | DenseMap<Function *, LDSVariableReplacement> KernelToReplacement; |
846 | for (Function &Func : M.functions()) { |
847 | if (Func.isDeclaration() || !isKernelLDS(&Func)) |
848 | continue; |
849 | |
850 | DenseSet<GlobalVariable *> KernelUsedVariables; |
851 | // Allocating variables that are used directly in this struct to get |
852 | // alignment aware allocation and predictable frame size. |
853 | for (auto &v : LDSUsesInfo.direct_access[&Func]) { |
854 | KernelUsedVariables.insert(v); |
855 | } |
856 | |
857 | // Allocating variables that are accessed indirectly so that a lookup of |
858 | // this struct instance can find them from nested functions. |
859 | for (auto &v : LDSUsesInfo.indirect_access[&Func]) { |
860 | KernelUsedVariables.insert(v); |
861 | } |
862 | |
863 | // Variables allocated in module lds must all resolve to that struct, |
864 | // not to the per-kernel instance. |
865 | if (KernelsThatAllocateModuleLDS.contains(&Func)) { |
866 | for (GlobalVariable *v : ModuleScopeVariables) { |
867 | KernelUsedVariables.erase(v); |
868 | } |
869 | } |
870 | |
871 | if (KernelUsedVariables.empty()) { |
872 | // Either used no LDS, or the LDS it used was all in the module struct |
873 | continue; |
874 | } |
875 | |
876 | // The association between kernel function and LDS struct is done by |
877 | // symbol name, which only works if the function in question has a |
878 | // name This is not expected to be a problem in practice as kernels |
879 | // are called by name making anonymous ones (which are named by the |
880 | // backend) difficult to use. This does mean that llvm test cases need |
881 | // to name the kernels. |
882 | if (!Func.hasName()) { |
883 | report_fatal_error("Anonymous kernels cannot use LDS variables"); |
884 | } |
885 | |
886 | std::string VarName = |
887 | (Twine("llvm.amdgcn.kernel.") + Func.getName() + ".lds").str(); |
888 | |
889 | auto Replacement = |
890 | createLDSVariableReplacement(M, VarName, KernelUsedVariables); |
891 | |
892 | // This struct is allocated at a predictable address that can be |
893 | // calculated now, recorded in metadata then used to lower references to |
894 | // it during codegen. |
895 | { |
896 | // frame layout, starting from 0 |
897 | //{ |
898 | // module.lds |
899 | // alignment padding |
900 | // kernel instance |
901 | //} |
902 | |
903 | if (!MaybeModuleScopeStruct || |
904 | Func.hasFnAttribute("amdgpu-elide-module-lds")) { |
905 | // There's no module.lds for this kernel so this replacement struct |
906 | // goes first |
907 | recordLDSAbsoluteAddress(&M, Replacement.SGV, 0); |
908 | } else { |
909 | const DataLayout &DL = M.getDataLayout(); |
910 | TypeSize ModuleSize = |
911 | DL.getTypeAllocSize(MaybeModuleScopeStruct->getValueType()); |
912 | GlobalVariable *KernelStruct = Replacement.SGV; |
913 | Align KernelAlign = AMDGPU::getAlign(DL, KernelStruct); |
914 | recordLDSAbsoluteAddress(&M, Replacement.SGV, |
915 | alignTo(ModuleSize, KernelAlign)); |
916 | } |
917 | } |
918 | |
919 | // remove preserves existing codegen |
920 | removeLocalVarsFromUsedLists(M, KernelUsedVariables); |
921 | KernelToReplacement[&Func] = Replacement; |
922 | |
923 | // Rewrite uses within kernel to the new struct |
924 | replaceLDSVariablesWithStruct( |
925 | M, KernelUsedVariables, Replacement, [&Func](Use &U) { |
926 | Instruction *I = dyn_cast<Instruction>(U.getUser()); |
927 | return I && I->getFunction() == &Func; |
928 | }); |
929 | } |
930 | |
931 | // Lower zero cost accesses to the kernel instances just created |
932 | for (auto &GV : KernelAccessVariables) { |
933 | auto &funcs = LDSToKernelsThatNeedToAccessItIndirectly[GV]; |
934 | assert(funcs.size() == 1)(static_cast <bool> (funcs.size() == 1) ? void (0) : __assert_fail ("funcs.size() == 1", "llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp" , 934, __extension__ __PRETTY_FUNCTION__)); // Only one kernel can access it |
935 | LDSVariableReplacement Replacement = |
936 | KernelToReplacement[*(funcs.begin())]; |
937 | |
938 | DenseSet<GlobalVariable *> Vec; |
939 | Vec.insert(GV); |
940 | |
941 | // TODO: Looks like a latent bug, Replacement may not be marked |
942 | // UsedByKernel here |
943 | replaceLDSVariablesWithStruct(M, Vec, Replacement, [](Use &U) { |
944 | return isa<Instruction>(U.getUser()); |
945 | }); |
946 | } |
947 | |
948 | if (!KernelsThatAllocateTableLDS.empty()) { |
949 | LLVMContext &Ctx = M.getContext(); |
950 | IRBuilder<> Builder(Ctx); |
951 | |
952 | // The ith element of this vector is kernel id i |
953 | std::vector<Function *> OrderedKernels = |
954 | assignLDSKernelIDToEachKernel(&M, KernelsThatAllocateTableLDS); |
955 | |
956 | for (size_t i = 0; i < OrderedKernels.size(); i++) { |
957 | markUsedByKernel(Builder, OrderedKernels[i], |
958 | KernelToReplacement[OrderedKernels[i]].SGV); |
959 | } |
960 | |
961 | // The order must be consistent between lookup table and accesses to |
962 | // lookup table |
963 | std::vector<GlobalVariable *> TableLookupVariablesOrdered( |
964 | TableLookupVariables.begin(), TableLookupVariables.end()); |
965 | llvm::sort(TableLookupVariablesOrdered.begin(), |
966 | TableLookupVariablesOrdered.end(), |
967 | [](const GlobalVariable *lhs, const GlobalVariable *rhs) { |
968 | return lhs->getName() < rhs->getName(); |
969 | }); |
970 | |
971 | GlobalVariable *LookupTable = buildLookupTable( |
972 | M, TableLookupVariablesOrdered, OrderedKernels, KernelToReplacement); |
973 | replaceUsesInInstructionsWithTableLookup(M, TableLookupVariablesOrdered, |
974 | LookupTable); |
975 | } |
976 | |
977 | for (auto &GV : make_early_inc_range(M.globals())) |
978 | if (AMDGPU::isLDSVariableToLower(GV)) { |
979 | // probably want to remove from used lists |
980 | GV.removeDeadConstantUsers(); |
981 | if (GV.use_empty()) |
982 | GV.eraseFromParent(); |
983 | } |
984 | |
985 | return Changed; |
986 | } |
987 | |
988 | private: |
989 | // Increase the alignment of LDS globals if necessary to maximise the chance |
990 | // that we can use aligned LDS instructions to access them. |
991 | static bool superAlignLDSGlobals(Module &M) { |
992 | const DataLayout &DL = M.getDataLayout(); |
993 | bool Changed = false; |
994 | if (!SuperAlignLDSGlobals) { |
995 | return Changed; |
996 | } |
997 | |
998 | for (auto &GV : M.globals()) { |
999 | if (GV.getType()->getPointerAddressSpace() != AMDGPUAS::LOCAL_ADDRESS) { |
1000 | // Only changing alignment of LDS variables |
1001 | continue; |
1002 | } |
1003 | if (!GV.hasInitializer()) { |
1004 | // cuda/hip extern __shared__ variable, leave alignment alone |
1005 | continue; |
1006 | } |
1007 | |
1008 | Align Alignment = AMDGPU::getAlign(DL, &GV); |
1009 | TypeSize GVSize = DL.getTypeAllocSize(GV.getValueType()); |
1010 | |
1011 | if (GVSize > 8) { |
1012 | // We might want to use a b96 or b128 load/store |
1013 | Alignment = std::max(Alignment, Align(16)); |
1014 | } else if (GVSize > 4) { |
1015 | // We might want to use a b64 load/store |
1016 | Alignment = std::max(Alignment, Align(8)); |
1017 | } else if (GVSize > 2) { |
1018 | // We might want to use a b32 load/store |
1019 | Alignment = std::max(Alignment, Align(4)); |
1020 | } else if (GVSize > 1) { |
1021 | // We might want to use a b16 load/store |
1022 | Alignment = std::max(Alignment, Align(2)); |
1023 | } |
1024 | |
1025 | if (Alignment != AMDGPU::getAlign(DL, &GV)) { |
1026 | Changed = true; |
1027 | GV.setAlignment(Alignment); |
1028 | } |
1029 | } |
1030 | return Changed; |
1031 | } |
1032 | |
1033 | static LDSVariableReplacement createLDSVariableReplacement( |
1034 | Module &M, std::string VarName, |
1035 | DenseSet<GlobalVariable *> const &LDSVarsToTransform) { |
1036 | // Create a struct instance containing LDSVarsToTransform and map from those |
1037 | // variables to ConstantExprGEP |
1038 | // Variables may be introduced to meet alignment requirements. No aliasing |
1039 | // metadata is useful for these as they have no uses. Erased before return. |
1040 | |
1041 | LLVMContext &Ctx = M.getContext(); |
1042 | const DataLayout &DL = M.getDataLayout(); |
1043 | assert(!LDSVarsToTransform.empty())(static_cast <bool> (!LDSVarsToTransform.empty()) ? void (0) : __assert_fail ("!LDSVarsToTransform.empty()", "llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp" , 1043, __extension__ __PRETTY_FUNCTION__)); |
1044 | |
1045 | SmallVector<OptimizedStructLayoutField, 8> LayoutFields; |
1046 | LayoutFields.reserve(LDSVarsToTransform.size()); |
1047 | { |
1048 | // The order of fields in this struct depends on the order of |
1049 | // varables in the argument which varies when changing how they |
1050 | // are identified, leading to spurious test breakage. |
1051 | std::vector<GlobalVariable *> Sorted(LDSVarsToTransform.begin(), |
1052 | LDSVarsToTransform.end()); |
1053 | llvm::sort(Sorted.begin(), Sorted.end(), |
1054 | [](const GlobalVariable *lhs, const GlobalVariable *rhs) { |
1055 | return lhs->getName() < rhs->getName(); |
1056 | }); |
1057 | for (GlobalVariable *GV : Sorted) { |
1058 | OptimizedStructLayoutField F(GV, |
1059 | DL.getTypeAllocSize(GV->getValueType()), |
1060 | AMDGPU::getAlign(DL, GV)); |
1061 | LayoutFields.emplace_back(F); |
1062 | } |
1063 | } |
1064 | |
1065 | performOptimizedStructLayout(LayoutFields); |
1066 | |
1067 | std::vector<GlobalVariable *> LocalVars; |
1068 | BitVector IsPaddingField; |
1069 | LocalVars.reserve(LDSVarsToTransform.size()); // will be at least this large |
1070 | IsPaddingField.reserve(LDSVarsToTransform.size()); |
1071 | { |
1072 | uint64_t CurrentOffset = 0; |
1073 | for (size_t I = 0; I < LayoutFields.size(); I++) { |
1074 | GlobalVariable *FGV = static_cast<GlobalVariable *>( |
1075 | const_cast<void *>(LayoutFields[I].Id)); |
1076 | Align DataAlign = LayoutFields[I].Alignment; |
1077 | |
1078 | uint64_t DataAlignV = DataAlign.value(); |
1079 | if (uint64_t Rem = CurrentOffset % DataAlignV) { |
1080 | uint64_t Padding = DataAlignV - Rem; |
1081 | |
1082 | // Append an array of padding bytes to meet alignment requested |
1083 | // Note (o + (a - (o % a)) ) % a == 0 |
1084 | // (offset + Padding ) % align == 0 |
1085 | |
1086 | Type *ATy = ArrayType::get(Type::getInt8Ty(Ctx), Padding); |
1087 | LocalVars.push_back(new GlobalVariable( |
1088 | M, ATy, false, GlobalValue::InternalLinkage, UndefValue::get(ATy), |
1089 | "", nullptr, GlobalValue::NotThreadLocal, AMDGPUAS::LOCAL_ADDRESS, |
1090 | false)); |
1091 | IsPaddingField.push_back(true); |
1092 | CurrentOffset += Padding; |
1093 | } |
1094 | |
1095 | LocalVars.push_back(FGV); |
1096 | IsPaddingField.push_back(false); |
1097 | CurrentOffset += LayoutFields[I].Size; |
1098 | } |
1099 | } |
1100 | |
1101 | std::vector<Type *> LocalVarTypes; |
1102 | LocalVarTypes.reserve(LocalVars.size()); |
1103 | std::transform( |
1104 | LocalVars.cbegin(), LocalVars.cend(), std::back_inserter(LocalVarTypes), |
1105 | [](const GlobalVariable *V) -> Type * { return V->getValueType(); }); |
1106 | |
1107 | StructType *LDSTy = StructType::create(Ctx, LocalVarTypes, VarName + ".t"); |
1108 | |
1109 | Align StructAlign = AMDGPU::getAlign(DL, LocalVars[0]); |
1110 | |
1111 | GlobalVariable *SGV = new GlobalVariable( |
1112 | M, LDSTy, false, GlobalValue::InternalLinkage, UndefValue::get(LDSTy), |
1113 | VarName, nullptr, GlobalValue::NotThreadLocal, AMDGPUAS::LOCAL_ADDRESS, |
1114 | false); |
1115 | SGV->setAlignment(StructAlign); |
1116 | |
1117 | DenseMap<GlobalVariable *, Constant *> Map; |
1118 | Type *I32 = Type::getInt32Ty(Ctx); |
1119 | for (size_t I = 0; I < LocalVars.size(); I++) { |
1120 | GlobalVariable *GV = LocalVars[I]; |
1121 | Constant *GEPIdx[] = {ConstantInt::get(I32, 0), ConstantInt::get(I32, I)}; |
1122 | Constant *GEP = ConstantExpr::getGetElementPtr(LDSTy, SGV, GEPIdx, true); |
1123 | if (IsPaddingField[I]) { |
1124 | assert(GV->use_empty())(static_cast <bool> (GV->use_empty()) ? void (0) : __assert_fail ("GV->use_empty()", "llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp" , 1124, __extension__ __PRETTY_FUNCTION__)); |
1125 | GV->eraseFromParent(); |
1126 | } else { |
1127 | Map[GV] = GEP; |
1128 | } |
1129 | } |
1130 | assert(Map.size() == LDSVarsToTransform.size())(static_cast <bool> (Map.size() == LDSVarsToTransform.size ()) ? void (0) : __assert_fail ("Map.size() == LDSVarsToTransform.size()" , "llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp", 1130 , __extension__ __PRETTY_FUNCTION__)); |
1131 | return {SGV, std::move(Map)}; |
1132 | } |
1133 | |
1134 | template <typename PredicateTy> |
1135 | void replaceLDSVariablesWithStruct( |
1136 | Module &M, DenseSet<GlobalVariable *> const &LDSVarsToTransformArg, |
1137 | LDSVariableReplacement Replacement, PredicateTy Predicate) { |
1138 | LLVMContext &Ctx = M.getContext(); |
1139 | const DataLayout &DL = M.getDataLayout(); |
1140 | |
1141 | // A hack... we need to insert the aliasing info in a predictable order for |
1142 | // lit tests. Would like to have them in a stable order already, ideally the |
1143 | // same order they get allocated, which might mean an ordered set container |
1144 | std::vector<GlobalVariable *> LDSVarsToTransform( |
1145 | LDSVarsToTransformArg.begin(), LDSVarsToTransformArg.end()); |
1146 | llvm::sort(LDSVarsToTransform.begin(), LDSVarsToTransform.end(), |
1147 | [](const GlobalVariable *lhs, const GlobalVariable *rhs) { |
1148 | return lhs->getName() < rhs->getName(); |
1149 | }); |
1150 | |
1151 | // Create alias.scope and their lists. Each field in the new structure |
1152 | // does not alias with all other fields. |
1153 | SmallVector<MDNode *> AliasScopes; |
1154 | SmallVector<Metadata *> NoAliasList; |
1155 | const size_t NumberVars = LDSVarsToTransform.size(); |
1156 | if (NumberVars > 1) { |
1157 | MDBuilder MDB(Ctx); |
1158 | AliasScopes.reserve(NumberVars); |
1159 | MDNode *Domain = MDB.createAnonymousAliasScopeDomain(); |
1160 | for (size_t I = 0; I < NumberVars; I++) { |
1161 | MDNode *Scope = MDB.createAnonymousAliasScope(Domain); |
1162 | AliasScopes.push_back(Scope); |
1163 | } |
1164 | NoAliasList.append(&AliasScopes[1], AliasScopes.end()); |
1165 | } |
1166 | |
1167 | // Replace uses of ith variable with a constantexpr to the corresponding |
1168 | // field of the instance that will be allocated by AMDGPUMachineFunction |
1169 | for (size_t I = 0; I < NumberVars; I++) { |
1170 | GlobalVariable *GV = LDSVarsToTransform[I]; |
1171 | Constant *GEP = Replacement.LDSVarsToConstantGEP[GV]; |
1172 | |
1173 | GV->replaceUsesWithIf(GEP, Predicate); |
1174 | |
1175 | APInt APOff(DL.getIndexTypeSizeInBits(GEP->getType()), 0); |
1176 | GEP->stripAndAccumulateInBoundsConstantOffsets(DL, APOff); |
1177 | uint64_t Offset = APOff.getZExtValue(); |
1178 | |
1179 | Align A = |
1180 | commonAlignment(Replacement.SGV->getAlign().valueOrOne(), Offset); |
1181 | |
1182 | if (I) |
1183 | NoAliasList[I - 1] = AliasScopes[I - 1]; |
1184 | MDNode *NoAlias = |
1185 | NoAliasList.empty() ? nullptr : MDNode::get(Ctx, NoAliasList); |
1186 | MDNode *AliasScope = |
1187 | AliasScopes.empty() ? nullptr : MDNode::get(Ctx, {AliasScopes[I]}); |
1188 | |
1189 | refineUsesAlignmentAndAA(GEP, A, DL, AliasScope, NoAlias); |
1190 | } |
1191 | } |
1192 | |
1193 | void refineUsesAlignmentAndAA(Value *Ptr, Align A, const DataLayout &DL, |
1194 | MDNode *AliasScope, MDNode *NoAlias, |
1195 | unsigned MaxDepth = 5) { |
1196 | if (!MaxDepth || (A == 1 && !AliasScope)) |
1197 | return; |
1198 | |
1199 | for (User *U : Ptr->users()) { |
1200 | if (auto *I = dyn_cast<Instruction>(U)) { |
1201 | if (AliasScope && I->mayReadOrWriteMemory()) { |
1202 | MDNode *AS = I->getMetadata(LLVMContext::MD_alias_scope); |
1203 | AS = (AS ? MDNode::getMostGenericAliasScope(AS, AliasScope) |
1204 | : AliasScope); |
1205 | I->setMetadata(LLVMContext::MD_alias_scope, AS); |
1206 | |
1207 | MDNode *NA = I->getMetadata(LLVMContext::MD_noalias); |
1208 | NA = (NA ? MDNode::intersect(NA, NoAlias) : NoAlias); |
1209 | I->setMetadata(LLVMContext::MD_noalias, NA); |
1210 | } |
1211 | } |
1212 | |
1213 | if (auto *LI = dyn_cast<LoadInst>(U)) { |
1214 | LI->setAlignment(std::max(A, LI->getAlign())); |
1215 | continue; |
1216 | } |
1217 | if (auto *SI = dyn_cast<StoreInst>(U)) { |
1218 | if (SI->getPointerOperand() == Ptr) |
1219 | SI->setAlignment(std::max(A, SI->getAlign())); |
1220 | continue; |
1221 | } |
1222 | if (auto *AI = dyn_cast<AtomicRMWInst>(U)) { |
1223 | // None of atomicrmw operations can work on pointers, but let's |
1224 | // check it anyway in case it will or we will process ConstantExpr. |
1225 | if (AI->getPointerOperand() == Ptr) |
1226 | AI->setAlignment(std::max(A, AI->getAlign())); |
1227 | continue; |
1228 | } |
1229 | if (auto *AI = dyn_cast<AtomicCmpXchgInst>(U)) { |
1230 | if (AI->getPointerOperand() == Ptr) |
1231 | AI->setAlignment(std::max(A, AI->getAlign())); |
1232 | continue; |
1233 | } |
1234 | if (auto *GEP = dyn_cast<GetElementPtrInst>(U)) { |
1235 | unsigned BitWidth = DL.getIndexTypeSizeInBits(GEP->getType()); |
1236 | APInt Off(BitWidth, 0); |
1237 | if (GEP->getPointerOperand() == Ptr) { |
1238 | Align GA; |
1239 | if (GEP->accumulateConstantOffset(DL, Off)) |
1240 | GA = commonAlignment(A, Off.getLimitedValue()); |
1241 | refineUsesAlignmentAndAA(GEP, GA, DL, AliasScope, NoAlias, |
1242 | MaxDepth - 1); |
1243 | } |
1244 | continue; |
1245 | } |
1246 | if (auto *I = dyn_cast<Instruction>(U)) { |
1247 | if (I->getOpcode() == Instruction::BitCast || |
1248 | I->getOpcode() == Instruction::AddrSpaceCast) |
1249 | refineUsesAlignmentAndAA(I, A, DL, AliasScope, NoAlias, MaxDepth - 1); |
1250 | } |
1251 | } |
1252 | } |
1253 | }; |
1254 | |
1255 | } // namespace |
1256 | char AMDGPULowerModuleLDS::ID = 0; |
1257 | |
1258 | char &llvm::AMDGPULowerModuleLDSID = AMDGPULowerModuleLDS::ID; |
1259 | |
1260 | INITIALIZE_PASS(AMDGPULowerModuleLDS, DEBUG_TYPE,static void *initializeAMDGPULowerModuleLDSPassOnce(PassRegistry &Registry) { PassInfo *PI = new PassInfo( "Lower uses of LDS variables from non-kernel functions" , "amdgpu-lower-module-lds", &AMDGPULowerModuleLDS::ID, PassInfo ::NormalCtor_t(callDefaultCtor<AMDGPULowerModuleLDS>), false , false); Registry.registerPass(*PI, true); return PI; } static llvm::once_flag InitializeAMDGPULowerModuleLDSPassFlag; void llvm::initializeAMDGPULowerModuleLDSPass(PassRegistry &Registry ) { llvm::call_once(InitializeAMDGPULowerModuleLDSPassFlag, initializeAMDGPULowerModuleLDSPassOnce , std::ref(Registry)); } |
1261 | "Lower uses of LDS variables from non-kernel functions", false,static void *initializeAMDGPULowerModuleLDSPassOnce(PassRegistry &Registry) { PassInfo *PI = new PassInfo( "Lower uses of LDS variables from non-kernel functions" , "amdgpu-lower-module-lds", &AMDGPULowerModuleLDS::ID, PassInfo ::NormalCtor_t(callDefaultCtor<AMDGPULowerModuleLDS>), false , false); Registry.registerPass(*PI, true); return PI; } static llvm::once_flag InitializeAMDGPULowerModuleLDSPassFlag; void llvm::initializeAMDGPULowerModuleLDSPass(PassRegistry &Registry ) { llvm::call_once(InitializeAMDGPULowerModuleLDSPassFlag, initializeAMDGPULowerModuleLDSPassOnce , std::ref(Registry)); } |
1262 | false)static void *initializeAMDGPULowerModuleLDSPassOnce(PassRegistry &Registry) { PassInfo *PI = new PassInfo( "Lower uses of LDS variables from non-kernel functions" , "amdgpu-lower-module-lds", &AMDGPULowerModuleLDS::ID, PassInfo ::NormalCtor_t(callDefaultCtor<AMDGPULowerModuleLDS>), false , false); Registry.registerPass(*PI, true); return PI; } static llvm::once_flag InitializeAMDGPULowerModuleLDSPassFlag; void llvm::initializeAMDGPULowerModuleLDSPass(PassRegistry &Registry ) { llvm::call_once(InitializeAMDGPULowerModuleLDSPassFlag, initializeAMDGPULowerModuleLDSPassOnce , std::ref(Registry)); } |
1263 | |
1264 | ModulePass *llvm::createAMDGPULowerModuleLDSPass() { |
1265 | return new AMDGPULowerModuleLDS(); |
1266 | } |
1267 | |
1268 | PreservedAnalyses AMDGPULowerModuleLDSPass::run(Module &M, |
1269 | ModuleAnalysisManager &) { |
1270 | return AMDGPULowerModuleLDS().runOnModule(M) ? PreservedAnalyses::none() |
1271 | : PreservedAnalyses::all(); |
1272 | } |