LLVM 20.0.0git
|
AMD Kernel Code Object (amd_kernel_code_t). More...
#include "Target/AMDGPU/AMDKernelCodeT.h"
Public Attributes | |
uint32_t | amd_kernel_code_version_major |
uint32_t | amd_kernel_code_version_minor |
uint16_t | amd_machine_kind |
uint16_t | amd_machine_version_major |
uint16_t | amd_machine_version_minor |
uint16_t | amd_machine_version_stepping |
int64_t | kernel_code_entry_byte_offset |
Byte offset (possibly negative) from start of amd_kernel_code_t object to kernel's entry point instruction. | |
int64_t | kernel_code_prefetch_byte_offset |
Range of bytes to consider prefetching expressed as an offset and size. | |
uint64_t | kernel_code_prefetch_byte_size |
uint64_t | reserved0 |
Reserved. Must be 0. | |
uint64_t | compute_pgm_resource_registers |
Shader program settings for CS. | |
uint32_t | code_properties |
Code properties. | |
uint32_t | workitem_private_segment_byte_size |
The amount of memory required for the combined private, spill and arg segments for a work-item in bytes. | |
uint32_t | workgroup_group_segment_byte_size |
The amount of group segment memory required by a work-group in bytes. | |
uint32_t | gds_segment_byte_size |
Number of byte of GDS required by kernel dispatch. | |
uint64_t | kernarg_segment_byte_size |
The size in bytes of the kernarg segment that holds the values of the arguments to the kernel. | |
uint32_t | workgroup_fbarrier_count |
Number of fbarrier's used in the kernel and all functions it calls. | |
uint16_t | wavefront_sgpr_count |
Number of scalar registers used by a wavefront. | |
uint16_t | workitem_vgpr_count |
Number of vector registers used by each work-item. | |
uint16_t | reserved_vgpr_first |
If reserved_vgpr_count is 0 then must be 0. | |
uint16_t | reserved_vgpr_count |
The number of consecutive VGPRs reserved by the client. | |
uint16_t | reserved_sgpr_first |
If reserved_sgpr_count is 0 then must be 0. | |
uint16_t | reserved_sgpr_count |
The number of consecutive SGPRs reserved by the client. | |
uint16_t | debug_wavefront_private_segment_offset_sgpr |
If is_debug_supported is 0 then must be 0. | |
uint16_t | debug_private_segment_buffer_sgpr |
If is_debug_supported is 0 then must be 0. | |
uint8_t | kernarg_segment_alignment |
The maximum byte alignment of variables used by the kernel in the specified memory segment. | |
uint8_t | group_segment_alignment |
uint8_t | private_segment_alignment |
uint8_t | wavefront_size |
Wavefront size expressed as a power of two. | |
int32_t | call_convention |
uint8_t | reserved3 [12] |
uint64_t | runtime_loader_kernel_symbol |
uint64_t | control_directives [16] |
AMD Kernel Code Object (amd_kernel_code_t).
GPU CP uses the AMD Kernel Code Object to set up the hardware to execute the kernel dispatch.
Initial Kernel Register State.
Initial kernel register state will be set up by CP/SPI prior to the start of execution of every wavefront. This is limited by the constraints of the current hardware.
The order of the SGPR registers is defined, but the Finalizer can specify which ones are actually setup in the amd_kernel_code_t object using the enable_sgpr_* bit fields. The register numbers used for enabled registers are dense starting at SGPR0: the first enabled register is SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have an SGPR number.
The initial SGPRs comprise up to 16 User SRGPs that are set up by CP and apply to all waves of the grid. It is possible to specify more than 16 User SGPRs using the enable_sgpr_* bit fields, in which case only the first 16 are actually initialized. These are then immediately followed by the System SGPRs that are set up by ADC/SPI and can have different values for each wave of the grid dispatch.
SGPR register initial state is defined as follows:
Private Segment Buffer (enable_sgpr_private_segment_buffer): Number of User SGPR registers: 4. V# that can be used, together with Scratch Wave Offset as an offset, to access the Private/Spill/Arg segments using a segment address. It must be set as follows:
Dispatch Ptr (enable_sgpr_dispatch_ptr): Number of User SGPR registers: 2. 64 bit address of AQL dispatch packet for kernel actually executing.
Queue Ptr (enable_sgpr_queue_ptr): Number of User SGPR registers: 2. 64 bit address of AmdQueue object for AQL queue on which the dispatch packet was queued.
Kernarg Segment Ptr (enable_sgpr_kernarg_segment_ptr): Number of User SGPR registers: 2. 64 bit address of Kernarg segment. This is directly copied from the kernargPtr in the dispatch packet. Having CP load it once avoids loading it at the beginning of every wavefront.
Dispatch Id (enable_sgpr_dispatch_id): Number of User SGPR registers: 2. 64 bit Dispatch ID of the dispatch packet being executed.
Flat Scratch Init (enable_sgpr_flat_scratch_init): Number of User SGPR registers: 2. This is 2 SGPRs.
For CI/VI: The first SGPR is a 32 bit byte offset from SH_MEM_HIDDEN_PRIVATE_BASE to base of memory for scratch for this dispatch. This is the same offset used in computing the Scratch Segment Buffer base address. The value of Scratch Wave Offset must be added by the kernel code and moved to SGPRn-4 for use as the FLAT SCRATCH BASE in flat memory instructions.
The second SGPR is 32 bit byte size of a single work-item's scratch memory usage. This is directly loaded from the dispatch packet Private Segment Byte Size and rounded up to a multiple of DWORD.
The kernel code must move to SGPRn-3 for use as the FLAT SCRATCH SIZE in flat memory instructions. Having CP load it once avoids loading it at the beginning of every wavefront.
For PI: This is the 64 bit base address of the scratch backing memory for allocated by CP for this dispatch.
Private Segment Size (enable_sgpr_private_segment_size): Number of User SGPR registers: 1. The 32 bit byte size of a single work-item's scratch memory allocation. This is the value from the dispatch packet. Private Segment Byte Size rounded up by CP to a multiple of DWORD.
Having CP load it once avoids loading it at the beginning of every wavefront.
Grid Work-Group Count X (enable_sgpr_grid_workgroup_count_x): Number of User SGPR registers: 1. 32 bit count of the number of work-groups in the X dimension for the grid being executed. Computed from the fields in the HsaDispatchPacket as ((gridSize.x+workgroupSize.x-1)/workgroupSize.x).
Grid Work-Group Count Y (enable_sgpr_grid_workgroup_count_y): Number of User SGPR registers: 1. 32 bit count of the number of work-groups in the Y dimension for the grid being executed. Computed from the fields in the HsaDispatchPacket as ((gridSize.y+workgroupSize.y-1)/workgroupSize.y).
Only initialized if <16 previous SGPRs initialized.
Grid Work-Group Count Z (enable_sgpr_grid_workgroup_count_z): Number of User SGPR registers: 1. 32 bit count of the number of work-groups in the Z dimension for the grid being executed. Computed from the fields in the HsaDispatchPacket as ((gridSize.z+workgroupSize.z-1)/workgroupSize.z).
Only initialized if <16 previous SGPRs initialized.
Work-Group Id X (enable_sgpr_workgroup_id_x): Number of System SGPR registers: 1. 32 bit work group id in X dimension of grid for wavefront. Always present.
Work-Group Id Y (enable_sgpr_workgroup_id_y): Number of System SGPR registers: 1. 32 bit work group id in Y dimension of grid for wavefront.
Work-Group Id Z (enable_sgpr_workgroup_id_z): Number of System SGPR registers: 1. 32 bit work group id in Z dimension of grid for wavefront. If present then Work-group Id Y will also be present
Work-Group Info (enable_sgpr_workgroup_info): Number of System SGPR registers: 1. {first_wave, 14'b0000, ordered_append_term[10:0], threadgroup_size_in_waves[5:0]}
Private Segment Wave Byte Offset (enable_sgpr_private_segment_wave_byte_offset): Number of System SGPR registers: 1. 32 bit byte offset from base of dispatch scratch base. Must be used as an offset with Private/Spill/Arg segment address when using Scratch Segment Buffer. It must be added to Flat Scratch Offset if setting up FLAT SCRATCH for flat addressing.
The order of the VGPR registers is defined, but the Finalizer can specify which ones are actually setup in the amd_kernel_code_t object using the enableVgpr* bit fields. The register numbers used for enabled registers are dense starting at VGPR0: the first enabled register is VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have an VGPR number.
VGPR register initial state is defined as follows:
Work-Item Id X (always initialized): Number of registers: 1. 32 bit work item id in X dimension of work-group for wavefront lane.
Work-Item Id X (enable_vgpr_workitem_id > 0): Number of registers: 1. 32 bit work item id in Y dimension of work-group for wavefront lane.
Work-Item Id X (enable_vgpr_workitem_id > 0): Number of registers: 1. 32 bit work item id in Z dimension of work-group for wavefront lane.
The setting of registers is being done by existing GPU hardware as follows: 1) SGPRs before the Work-Group Ids are set by CP using the 16 User Data registers. 2) Work-group Id registers X, Y, Z are set by SPI which supports any combination including none. 3) Scratch Wave Offset is also set by SPI which is why its value cannot be added into the value Flat Scratch Offset which would avoid the Finalizer generated prolog having to do the add. 4) The VGPRs are set by SPI which only supports specifying either (X), (X, Y) or (X, Y, Z).
Flat Scratch Dispatch Offset and Flat Scratch Size are adjacent SGRRs so they can be moved as a 64 bit value to the hardware required SGPRn-3 and SGPRn-4 respectively using the Finalizer ?FLAT_SCRATCH? Register.
The global segment can be accessed either using flat operations or buffer operations. If buffer operations are used then the Global Buffer used to access HSAIL Global/Readonly/Kernarg (which are combine) segments using a segment address is not passed into the kernel code by CP since its base address is always 0. Instead the Finalizer generates prolog code to initialize 4 SGPRs with a V# that has the following properties, and then uses that in the buffer instructions:
When the Global Buffer is used to access the Kernarg segment, must add the dispatch packet kernArgPtr to a kernarg segment address before using this V#. Alternatively scalar loads can be used if the kernarg offset is uniform, as the kernarg segment is constant for the duration of the kernel execution.
Definition at line 526 of file AMDKernelCodeT.h.
uint32_t amd_kernel_code_t::amd_kernel_code_version_major |
Definition at line 527 of file AMDKernelCodeT.h.
uint32_t amd_kernel_code_t::amd_kernel_code_version_minor |
Definition at line 528 of file AMDKernelCodeT.h.
uint16_t amd_kernel_code_t::amd_machine_kind |
Definition at line 529 of file AMDKernelCodeT.h.
uint16_t amd_kernel_code_t::amd_machine_version_major |
Definition at line 530 of file AMDKernelCodeT.h.
uint16_t amd_kernel_code_t::amd_machine_version_minor |
Definition at line 531 of file AMDKernelCodeT.h.
uint16_t amd_kernel_code_t::amd_machine_version_stepping |
Definition at line 532 of file AMDKernelCodeT.h.
int32_t amd_kernel_code_t::call_convention |
Definition at line 645 of file AMDKernelCodeT.h.
uint32_t amd_kernel_code_t::code_properties |
Code properties.
See amd_code_property_mask_t for a full list of properties.
Definition at line 562 of file AMDKernelCodeT.h.
uint64_t amd_kernel_code_t::compute_pgm_resource_registers |
Shader program settings for CS.
Contains COMPUTE_PGM_RSRC1 and COMPUTE_PGM_RSRC2 registers.
Definition at line 558 of file AMDKernelCodeT.h.
uint64_t amd_kernel_code_t::control_directives[16] |
Definition at line 648 of file AMDKernelCodeT.h.
uint16_t amd_kernel_code_t::debug_private_segment_buffer_sgpr |
If is_debug_supported is 0 then must be 0.
Otherwise, this is the fixed SGPR number of the first of 4 SGPRs used to hold the scratch V# used for the entire kernel execution, or uint16_t(-1) if the registers are not used or not known.
Definition at line 629 of file AMDKernelCodeT.h.
uint16_t amd_kernel_code_t::debug_wavefront_private_segment_offset_sgpr |
If is_debug_supported is 0 then must be 0.
Otherwise, this is the fixed SGPR number used to hold the wave scratch offset for the entire kernel execution, or uint16_t(-1) if the register is not used or not known.
Definition at line 623 of file AMDKernelCodeT.h.
uint32_t amd_kernel_code_t::gds_segment_byte_size |
Number of byte of GDS required by kernel dispatch.
Must be 0 if not using GDS.
Definition at line 578 of file AMDKernelCodeT.h.
uint8_t amd_kernel_code_t::group_segment_alignment |
Definition at line 635 of file AMDKernelCodeT.h.
uint8_t amd_kernel_code_t::kernarg_segment_alignment |
The maximum byte alignment of variables used by the kernel in the specified memory segment.
Expressed as a power of two. Must be at least HSA_POWERTWO_16.
Definition at line 634 of file AMDKernelCodeT.h.
uint64_t amd_kernel_code_t::kernarg_segment_byte_size |
The size in bytes of the kernarg segment that holds the values of the arguments to the kernel.
This could be used by CP to prefetch the kernarg segment pointed to by the dispatch packet.
Definition at line 583 of file AMDKernelCodeT.h.
int64_t amd_kernel_code_t::kernel_code_entry_byte_offset |
Byte offset (possibly negative) from start of amd_kernel_code_t object to kernel's entry point instruction.
The actual code for the kernel is required to be 256 byte aligned to match hardware requirements (SQ cache line is 16). The code must be position independent code (PIC) for AMD devices to give runtime the option of copying code to discrete GPU memory or APU L2 cache. The Finalizer should endeavour to allocate all kernel machine code in contiguous memory pages so that a device pre-fetcher will tend to only pre-fetch Kernel Code objects, improving cache performance.
Definition at line 544 of file AMDKernelCodeT.h.
int64_t amd_kernel_code_t::kernel_code_prefetch_byte_offset |
Range of bytes to consider prefetching expressed as an offset and size.
The offset is from the start (possibly negative) of amd_kernel_code_t object. Set both to 0 if no prefetch information is available.
Definition at line 550 of file AMDKernelCodeT.h.
uint64_t amd_kernel_code_t::kernel_code_prefetch_byte_size |
Definition at line 551 of file AMDKernelCodeT.h.
uint8_t amd_kernel_code_t::private_segment_alignment |
Definition at line 636 of file AMDKernelCodeT.h.
uint64_t amd_kernel_code_t::reserved0 |
Reserved. Must be 0.
Definition at line 554 of file AMDKernelCodeT.h.
uint8_t amd_kernel_code_t::reserved3[12] |
Definition at line 646 of file AMDKernelCodeT.h.
uint16_t amd_kernel_code_t::reserved_sgpr_count |
The number of consecutive SGPRs reserved by the client.
If is_debug_supported then this count includes SGPRs reserved for debugger use.
Definition at line 617 of file AMDKernelCodeT.h.
uint16_t amd_kernel_code_t::reserved_sgpr_first |
If reserved_sgpr_count is 0 then must be 0.
Otherwise, this is the first fixed SGPR number reserved.
Definition at line 612 of file AMDKernelCodeT.h.
uint16_t amd_kernel_code_t::reserved_vgpr_count |
The number of consecutive VGPRs reserved by the client.
If is_debug_supported then this count includes VGPRs reserved for debugger use.
Definition at line 608 of file AMDKernelCodeT.h.
uint16_t amd_kernel_code_t::reserved_vgpr_first |
If reserved_vgpr_count is 0 then must be 0.
Otherwise, this is the first fixed VGPR number reserved.
Definition at line 603 of file AMDKernelCodeT.h.
uint64_t amd_kernel_code_t::runtime_loader_kernel_symbol |
Definition at line 647 of file AMDKernelCodeT.h.
uint16_t amd_kernel_code_t::wavefront_sgpr_count |
Number of scalar registers used by a wavefront.
This includes the special SGPRs for VCC, Flat Scratch Base, Flat Scratch Size and XNACK (for GFX8 (VI)). It does not include the 16 SGPR added if a trap handler is enabled. Used to set COMPUTE_PGM_RSRC1.SGPRS.
Definition at line 595 of file AMDKernelCodeT.h.
uint8_t amd_kernel_code_t::wavefront_size |
Wavefront size expressed as a power of two.
Must be a power of 2 in range 1..64 inclusive. Used to support runtime query that obtains wavefront size, which may be used by application to allocated dynamic group memory and set the dispatch work-group size.
Definition at line 643 of file AMDKernelCodeT.h.
uint32_t amd_kernel_code_t::workgroup_fbarrier_count |
Number of fbarrier's used in the kernel and all functions it calls.
If the implementation uses group memory to allocate the fbarriers then that amount must already be included in the workgroup_group_segment_byte_size total.
Definition at line 589 of file AMDKernelCodeT.h.
uint32_t amd_kernel_code_t::workgroup_group_segment_byte_size |
The amount of group segment memory required by a work-group in bytes.
This does not include any dynamically allocated group segment memory that may be added when the kernel is dispatched.
Definition at line 574 of file AMDKernelCodeT.h.
uint32_t amd_kernel_code_t::workitem_private_segment_byte_size |
The amount of memory required for the combined private, spill and arg segments for a work-item in bytes.
If is_dynamic_callstack is 1 then additional space must be added to this value for the call stack.
Definition at line 568 of file AMDKernelCodeT.h.
uint16_t amd_kernel_code_t::workitem_vgpr_count |
Number of vector registers used by each work-item.
Used to set COMPUTE_PGM_RSRC1.VGPRS.
Definition at line 599 of file AMDKernelCodeT.h.