You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I initially spotted this issue with AOMP but it seems from upstream clang. ROCm/aomp#24
reproducer:
git clone https://github.com/ye-luo/miniqmc
cd miniqmc/build
cmake -DCMAKE_CXX_COMPILER=clang++ -DENABLE_OFFLOAD=ON
-DUSE_OBJECT_TARGET=ON -DCMAKE_EXE_LINKER_FLAGS="-v" ..
make -j32 check_spo_batched
all the 6 kernels use 254 registers.
Then I comment out "target teams" at 159, 311, 405.
make -j32 check_spo_batched
now all the 3 kernels left use 243 registers.
If I add
-DCMAKE_CXX_FLAGS="-Xcuda-ptxas -v" to cmake and print out register usage reported by ptxas. The three kernels take 146, 30, 30 registers when compiled.
I think the register usage is fine when kernels are compiled individually.
Somehow at linking, all the assembled kernels get the worst register usage among all the individual kernels.
It destroys performance completely.
The text was updated successfully, but these errors were encountered:
Partially resolved with 5b0581a and D83832. For better automatic results we need LTO. I'll mark this as fixed for now as the original issue is gone (I think).
Extended Description
I initially spotted this issue with AOMP but it seems from upstream clang.
ROCm/aomp#24
reproducer:
git clone https://github.com/ye-luo/miniqmc
cd miniqmc/build
cmake -DCMAKE_CXX_COMPILER=clang++ -DENABLE_OFFLOAD=ON
-DUSE_OBJECT_TARGET=ON -DCMAKE_EXE_LINKER_FLAGS="-v" ..
make -j32 check_spo_batched
all the 6 kernels use 254 registers.
Then I comment out "target teams" at 159, 311, 405.
make -j32 check_spo_batched
now all the 3 kernels left use 243 registers.
If I add
-DCMAKE_CXX_FLAGS="-Xcuda-ptxas -v" to cmake and print out register usage reported by ptxas. The three kernels take 146, 30, 30 registers when compiled.
I think the register usage is fine when kernels are compiled individually.
Somehow at linking, all the assembled kernels get the worst register usage among all the individual kernels.
It destroys performance completely.
The text was updated successfully, but these errors were encountered: