define i32 @xor4_popcount(i32 %x) { %m = and i32 %x, 15 %p = tail call i32 @llvm.ctpop.i32(i32 %m) %r = and i32 %p, 1 ret i32 %r } declare i32 @llvm.ctpop.i32(i32) -------------------------------------------------------------------- % llc -o - pop.ll xorl %eax, %eax testb $15, %dil setnp %al % llc -o - pop.ll -mattr=popcnt andl $15, %edi popcntl %edi, %eax andl $1, %eax -------------------------------------------------------------------- Debug spew shows that we convert to a parity node either way, but then convert back to ctpop for a target that supports that instruction. Test and set likely has better latency/throughput than popcnt + 2 mask instructions on all recent x86 CPUs.
This example is derived from the post-commit discussion in: https://reviews.llvm.org/D110170
Can I take this?
(In reply to Craig Topper from comment #2) > Can I take this? Sure!
Candidate patch https://reviews.llvm.org/D111249
Should be fixed after 58b68e70ebf6308f982426a2618782f473218eed