37974 – Opportunity to use BTS instruction instead of cmpxchg

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 37974 - Opportunity to use BTS instruction instead of cmpxchg

Summary: Opportunity to use BTS instruction instead of cmpxchg

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Backend: X86 (show other bugs)
Version:	trunk
Hardware:	PC Linux

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-06-28 01:55 PDT by David Bolvansky
Modified:	2018-07-13 21:34 PDT (History)
CC List:	8 users (show)

See Also:
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description David Bolvansky 2018-06-28 01:55:33 PDT

Hello,

Code:
bool foo(int *n, int bit) {
  unsigned mask = (1u << bit);

  return (__sync_fetch_and_or (n, mask) & mask) != 0;
}

Clang -O3:
foo(int*, int):
  mov edx, 1
  mov ecx, esi
  shl edx, cl
  mov eax, dword ptr [rdi]
.LBB0_1: 
  mov ecx, eax
  or ecx, edx
  lock cmpxchg dword ptr [rdi], ecx
  jne .LBB0_1
  test eax, edx
  setne al
  ret

GCC 7+ generates code using BTS instruction:
foo(int*, int):
  lock bts DWORD PTR [rdi], esi
  setc al
  ret


For more code examples see link:
https://godbolt.org/g/28nkRu

Comment 1 Roman Lebedev 2018-06-28 03:05:35 PDT

Just as a comment, https://reviews.llvm.org/D48606#1144023 suggested that memory versions of BT* should not be used.

Comment 2 Craig Topper 2018-06-28 11:02:10 PDT

We also don't optimize the single bit immediate version

bool foo(int *n, int bit) {
  unsigned mask = 0x80;

  return (__sync_fetch_and_or (n, mask) & mask) != 0;
}

Comment 3 Craig Topper 2018-06-28 11:57:55 PDT

Even though BTS/BTR/BTC are like >10 uop flows, they are probably still useful for this atomic case because we can remove the entire loop around the cmpxchg.

Unfortunately, this is tricky to implement. The AtomicExpandPass currently turns all atomicrmw or/and/xors that need the previous value into binop + cmpxchg before X86 selection dag. We could try to disable this for the single bit case, but if the shift instruction and the atomicrmw are in separate basic blocks, we won't be able to find the bit position during basic block at a time isel.

We almost need an IR construct for atomic bittest+set/clear/complement as one instruction.