Hello, Code: bool foo(int *n, int bit) { unsigned mask = (1u << bit); return (__sync_fetch_and_or (n, mask) & mask) != 0; } Clang -O3: foo(int*, int): mov edx, 1 mov ecx, esi shl edx, cl mov eax, dword ptr [rdi] .LBB0_1: mov ecx, eax or ecx, edx lock cmpxchg dword ptr [rdi], ecx jne .LBB0_1 test eax, edx setne al ret GCC 7+ generates code using BTS instruction: foo(int*, int): lock bts DWORD PTR [rdi], esi setc al ret For more code examples see link: https://godbolt.org/g/28nkRu
Just as a comment, https://reviews.llvm.org/D48606#1144023 suggested that memory versions of BT* should not be used.
We also don't optimize the single bit immediate version bool foo(int *n, int bit) { unsigned mask = 0x80; return (__sync_fetch_and_or (n, mask) & mask) != 0; }
Even though BTS/BTR/BTC are like >10 uop flows, they are probably still useful for this atomic case because we can remove the entire loop around the cmpxchg. Unfortunately, this is tricky to implement. The AtomicExpandPass currently turns all atomicrmw or/and/xors that need the previous value into binop + cmpxchg before X86 selection dag. We could try to disable this for the single bit case, but if the shift instruction and the atomicrmw are in separate basic blocks, we won't be able to find the bit position during basic block at a time isel. We almost need an IR construct for atomic bittest+set/clear/complement as one instruction.