Created attachment 24690 [details] Non-reduced, un-preprocessed source Here is a reduced test case: #include <stdlib.h> typedef union { int16_t i16 __attribute__((__vector_size__(32))); int32_t i32 __attribute__((__vector_size__(32))); } simde__m256i_private; simde__m256i_private simde__m256i_to_private(); int simde_mm256_madd_epi16() { simde__m256i_private r_, a_ = simde__m256i_to_private(), b_ = simde__m256i_to_private(); for (size_t i = 0; i < sizeof sizeof(r_); i += 2) r_.i32[i] = a_.i16[i] * b_.i16[i] + a_.i16[i + 1] * b_.i16[i + 1]; simde__m256i_from_private(r_); } Compile with -O2 using clang (clang++ works) on x86_64. Godbolt link: https://godbolt.org/z/71o5hdY4h The problem only manifests in my codebase with clang 12, but this test case seems to reliably reproduce the issue in earlier versions as well (back to 7 on godbolt). I'm also attaching the original (non-reduced) source. Please let me know if you need any additional information.
This should prevent the crashing: https://reviews.llvm.org/rGa283d7258360 But we want the output to be a "pmaddwd" instruction?
The matching code for this pattern was translated from a different/existing pattern here: https://reviews.llvm.org/D49636 But it looks like we need to make more adjustments - extract subvector, not truncate? Also possible that the inputs are shorter vectors than the output?
Is the "sizeof sizeof(r_)" in the for loop a mistake in your code that exposed the bug? It doesn't logically make sense.
More pmadd matching: https://reviews.llvm.org/rGe694e19a7931
Resolving as fixed. I added a test based on Craig's suggestion in https://reviews.llvm.org/D99531 that shows we could go even further to try to match pmadd, so if there's a real-world need for that, please file another bug.