Atomic Operations and Synchronization Representations

Written by Chandler Carruth

A Proposal for New LLVM Intrinsics

The LLVM representation aims to be light-weight and low-level while being expressive, typed, and extensible at the same time. It aims to be a "universal IR" of sorts, by being at a low enough level that high-level ideas may be cleanly mapped to it (similar to how microprocessors are "universal IR's", allowing many source languages to be mapped to them).

LLVM Language Reference

This proposal attempts to expand the "universal IR" of LLVM to include hardware constructs for atomic operations and memory synchronization. This will provide an interface to the hardware, not an interface to the programmer. It is aimed at a low enough level to allow any programming models or APIs which need atomic behaviors to map cleanly onto it. It is also modeled primarily on hardware behavior. Just as hardware provides a "unviresal IR" for source languages, it also provides a starting point for developing a "universal" atomic operation and synchronization IR.

The proposal is for an LLVM hardware interface. It is not an API such as high-level threading libraries, software transaction memory systems, lower-level atomic primitives, and intrinsic functionss as found in BSD, GNU libc, atomic_ops, APR, and other system and application libraries. The hardware interface provided by LLVM should allow a clean implementation of all of these APIs and parallel programming models. No one model or paradigm should be selected above others unless the hardware itself ubiquitously does so.

Understanding the hardware itself becomes the first step toward crafting a representation that is unbiased toward the language representations, and clearly maps onto the hardware. The various target architectures for LLVM were researched in order to investigate what capabilities were directly provided by the hardware. All of these targets provide sufficient functionality to achieve the atomicity needed by APIs and programming interfaces. The differences between these targets are how they provided the necessary atomicity, and how much is directly available through hardware constructs. The following table summarizes the hardware constructs provided across the various architectures.

Architectures Memory Synchronization Atomic Compare and Swap Atomic Test and Set Atomic Swap Atomic Add Atomic Sub Atomic Increment Atomic Decrement
SPARC MEMBAR CASA and CASXA Unreliable Depreciated N/A N/A N/A N/A
x86 / x86_64 MFENCE, SFENCE, LFENCE LOCK CMPXCHG LOCK BTS LOCK XCHG LOCK XADD LOCK XADD LOCK INC LOCK DEC
ia64 mf and .acq / .rel modifiers cmpxchg N/A N/A fetchadd fetchadd N/A N/A
PPC sync Spinning Conditional Load/Store: lwacx and stwcx
MIPS SYNC Spinning Conditional Load/Store: LL and SC
ARM DMB Spinning Conditional Load/Store: LRREX and STREX
Alpha mb, wmb Spinning Conditional Load/Store: ld*_l and st*_c

Memory Synchronization / Barrier

All architectures provide some memory synchronization functionality. These are often called memory barriers or fences. There are two ways of implementing these. The first provides memory constraints only for a specific operation. These often are constructed as modifiers on that operation. Secondly, they may be based on a standalone instruction enforcing some ordering of memory accesses relative to that instruction.

The only architecture which has support for operation-based synchronization constructs is the Itanium, and it still supports a coarse grained standalone instruction. Because all other target architectures represent these constraints as standalone constructs and the fact that an operation based constraint cannot guard other operations, a standalone representation was chosen for this proposal. While some standalone representation is necessary, it will be possible to extend the operations themselves to have built-in memory constraints if future hardware developments or further support for Itanium demand it. This would be an incremental improvement on this proposal and is not inhibited by using a coarse grained standalone representation.

The single instruction constraints can, at their most flexible, constrain any set of possible pairings of loads from memory and stores to memory. That is, they can provide a barrier between loads and stores, between loads and loads, between stores and loads, etc. These pairings can then be combined logically to provide a barrier between loads and loads, as well as between loads and stores, all with a single instruction constraint. This most flexible arrangement was selected for the proposal in order to efficiently provide all available memory constraint constructs on the hardware targets. A graceful fallback to a sufficient representation is always provided.

Atomic Operations

Atomic operations on all architectures provide a mechanism to modify the value of data in memory atomically. These can be thought of as read-modify-write or load-modify-store operations where all three actions occur atomically. They make up the basis for synchronization constructs such as mutual exclusion locks and semaphores. Note, however, that these are typically not blocking operations, nor do they directly provide locks or semaphores. They simply provide atomicity guarantees needed to efficiently and effectively implement these and other concurrent thread synchronization mechanisms.

The most important of these operations is the compare-and-swap construct. This operation has a consensus number of +Inf, which allows it to support many synchronization constructs that other operations do not. Specifically, test-and-set, fetch-and-add, and other constructs are not sufficient for non-blocking resolution of consensus between more than two processes or threads. All of the hardware targets for LLVM support some form of compare and swap, allowing this to be the cornerstone atomic operation of this proposal.

Several architectures provide address reservation, and a conditional store mechanism (some literature calls this load-linked, store-conditionally) to implement atomic operations. These constructs allow the operation to be performed immediately and locally. The atomicity of the operation with respect to a specific address is checked, and if it was completed successfully, the result is stored back to memory. Otherwise, the atomicity was lost, and the operation is performed again until it successfully achieves an atomic sequence. This is similar to transaction systems where the transaction is not committed to the material database until its atomicity has been assured.

This conditional store mechanism is also used on other architectures when the specific operation is not directly supported atomically in hardware. In these cases, the compare-and-swap hardware support can be used to attempt the operation until atomicity has been achieved. Neither of these uses is blocking, and both retain full atomic behavior. They are simply the forms these various atomic actions take on the respective hardware implementations.

Proposed LLVM Representation

The proposed LLVM representation of these atomic operations and synchronization constructs of modern parallel systems consists of five intrinsic functions. Each is presented below and documented similarly to those defined in LLVM's Language Reference. Additionally, rough outlines for the implementation of these intrinsics are provided for all of LLVM's major target architectures. The representation includes all major atomic behaviors available on these architectures. They can be easily emulated on any architecture which does not support them natively, and by providing LLVM representations for them, the supporting architectures can lower them correctly to their native representations.

'llvm.atomic.cas' Intrinsic

Syntax:

declare <ty> @llvm.atomic.cas( <ty>* <ptr>, <ty> <cmp>, <ty> <swap> )

Overview:

This compares a value in shared memory to a given value. If they are equal, the value in the shared memory is swapped with some value.

Arguments:

The 'llvm.atomic.cas' intrinsic takes three arguments. The 'ptr' must be a pointer to a value of the type 'ty'. Both 'cmp' and 'swap' must be values of type 'ty'. All three of the 'ty' types must be the same integer type. They can be any size integer, but the targets may only lower integer representations they support.

Semantics:

This entire intrinsic must be executed atomically. It first compares the value in shared memory pointed to by 'ptr' with the value 'cmp'. If they are equal, the value in memory is replaced with the value of 'swap', else the value in memory remains the same. The value originally stored in memory is yielded in either case.

This operation does not perform a true swap due to the semantics of SSA. Rather, it yields the value that would be swapped into 'swap'. It yields this value in all cases, even when the change to memory was not performed, allowing a quick check against 'cmp' to determine success or failure of the swap, and immediately have the actual memory value available.

Examples:
%ptr      = malloc i32
            store i32 4, %ptr

%val1     = add i32 4, 4
%result1  = call i32 @llvm.atomic.cas( i32* %ptr, i32 4, %val1 )
                                          ; yields {i32}:result1 = 4
%swapped1 = icmp eq i32 %result1, 4       ; yields {i1}:swapped1 = true
%memval1  = load i32* %ptr                ; yields {i32}:memval1 = 8

%val2     = add i32 1, 1
%result2  = call i32 @llvm.atomic.cas( i32* %ptr, i32 5, %val2 )
                                          ; yields {i32}:result2 = 8
%swapped2 = icmp eq i32 %result2, 5       ; yields {i1}:swapped2 = false
%memval2  = load i32* %ptr                ; yields {i32}:memval2 = 8
Implementations:
SPARC x86 Itanium PPC MIPS ARM Alpha
; i0 = ptr
; i1 = cmp
; i2 = swap
; o0 = result
or        %g0,%i2,%o0
casa      [%i0] 0,%i1,%o0
; eax = cmp
; ebx = ptr
; ecx = swap
; edx = result
movl      %ecx,%edx
lock
cmpxchgl  %edx,(%ebx)
movl      %eax,%edx
; in0 = ptr
; in1 = cmp
; in2 = swap
; ret0 = result
mov       r1=in0
mov       ar.ccv=in1
mov       r2=in2
cmpxchg4  r0=[r1],r2
mov       ret0=r0
; r0 = result
; r1 = ptr
; r2 = cmp
; r3 = swap
spin:
lwarx     r0,0,r1
cmpw      r0,r2
bne-      exit
stwcx.    r3,0,r1
bne-      spin
exit:
; s0 = result
; s1 = ptr
; s2 = cmp
; s3 = swap
spin:
add       $t0,$s3,$0
ll        $s0,0($s1)
bne       $s0,$s2,exit
sc        $t0,0($s1)
beq       $t0,$0,spin
exit:
; r0 = result
; r1 = ptr
; r2 = cmp
; r3 = swap
spin:
ldrex     r0,[r1]
cmp       r0,r2
b.ne      exit
strex     r4,r3,[r1]
cmp       r4,0
b.ne      spin
exit:
; r0 = result
; r1 = ptr
; r2 = cmp
; r3 = swap
spin:
ldl_l     r0,0(r1)
cmpeq     r0,r2,r4
beq       r4,exit
stl_c     r3,0(r1)
beq       r3,spin
exit:

'llvm.atomic.swap' Intrinsic

Syntax:

declare <ty> @llvm.atomic.swap( <ty>* <ptr>, <ty> <swap> )

Overview:

This intrinsic swaps the value stored in shared memory at 'ptr' with 'swap' and yields the value from memory.

Arguments:

The 'llvm.atomic.swap' intrinsic takes two arguments. The first, 'ptr', is a pointer to a value of type 'ty'. The 'swap' argument must be of type 'ty'. The type 'ty' must be an integer type and can be of any size, but the targets may only lower integer representations they support.

Semantics:

This intrinsic loads the value pointed to by 'ptr', and stores 'swap' back into it atomically. Due to SSA rules, the value from memory is yielded rather than stored in 'swap'.

Examples:
%ptr      = malloc i32
            store i32 4, %ptr

%val1     = add i32 4, 4
%result1  = call i32 @llvm.atomic.swap( i32* %ptr, i32 %val1 )
                                        ; yields {i32}:result1 = 4
%swapped1 = icmp eq i32 %result1, 4     ; yields {i1}:swapped1 = true
%memval1  = load i32* %ptr              ; yields {i32}:memval1 = 8

%val2     = add i32 1, 1
%result2  = call i32 @llvm.atomic.swap( i32* %ptr, i32 %val2 )
                                        ; yields {i32}:result2 = 8
%swapped2 = icmp eq i32 %result2, 8     ; yields {i1}:swapped2 = true
%memval2  = load i32* %ptr              ; yields {i32}:memval2 = 2
Implementations:
SPARC x86 Itanium PPC MIPS ARM Alpha
; i0 = ptr
; i1 = swap
; o0 = result
lduw      [%i0] 0,%l0
1:
or        %i1,%g0,%o0
casa      [%i0] 0,%l0,%o0
subcc     %l0,%o0,%l1
brne,pn   1
or        %o0,%g0,%l0
; eax = result
; ebx = ptr
; ecx = swap
movl      %ecx,%eax
xchg      %eax,(%ebx)
; TODO
; r0 = result
; r1 = ptr
; r2 = swap
spin:
lwarx     r0,0,r1
stwcx.    r2,0,r1
bne-      spin
; s0 = result
; s1 = ptr
; s2 = swap
spin:
add       $t0,$s2,$0
ll        $s0,0($s1)
sc        $t0,0($s1)
beq       $t0,$0,spin
; r0 = result
; r1 = ptr
; r2 = swap
spin:
ldrex     r0,[r1]
strex     r3,r2,[r1]
cmp       r3,0
b.ne      spin
; r0 = result
; r1 = ptr
; r2 = swap
spin:
ldl_l     r0,0(r1)
stl_c     r2,0(r1)
beq       r2,spin
exit:

'llvm.atomic.las' Intrinsic

Syntax:

declare <ty> @llvm.atomic.las( <ty>* <ptr>, <ty> <delta> )

Overview:

This intrinsic adds 'delta' to the value stored in shared memory at 'ptr'. It yields the original value at 'ptr'.

Arguments:

The intrinsic takes two arguments, the first a pointer to a value of type 'ty' and the second a value of type 'ty'. The type 'ty' must be an integer type and can be of any size, but the targets may only lower integer representations they support.

Semantics:

This intrinsic does a series of operations atomically. It first loads the value stored at 'ptr'. It then adds 'delta', stores the result to 'ptr'. It yields the original value stored at 'ptr'.

Examples:
%ptr      = malloc i32
            store i32 4, %ptr
%result1  = call i32 @llvm.atomic.las( i32* %ptr, i32 4 )
                                    ; yields {i32}:result1 = 4
%result2  = call i32 @llvm.atomic.las( i32* %ptr, i32 2 )
                                    ; yields {i32}:result2 = 8
%result3  = call i32 @llvm.atomic.las( i32* %ptr, i32 5 )
                                    ; yields {i32}:result3 = 10
%memval   = load i32* %ptr          ; yields {i32}:memval1 = 15
%swapped  = icmp eq i32 %memval, 15 ; yields {i1}:swapped  = true
Implementations:
SPARC x86 Itanium PPC MIPS ARM Alpha
; i0 = ptr
; i1 = delta
; o0 = result
lduw      [%i0] 0,%l0
1:
add       %i1,%l0,%o0
casa      [%i0] 0,%l0,%o0
subcc     %l0,%o0,%l1
brne,pn   1
or        %o0,%g0,%l0
; eax = result
; ebx = ptr
; ecx = delta
movl      %ecx,%eax
lock
xaddl     %eax,(%ebx)
; TODO
; r0 = result
; r1 = ptr
; r2 = delta
spin:
lwarx     r0,0,r1
add       r3,r2,r0
stwcx.    r3,0,r1
bne-      spin
; s0 = result
; s1 = ptr
; s2 = delta
spin:
ll        $s0,0($s1)
add       $t1,$s0,$s2
sc        $t1,0($s1)
beq       $t1,$0,spin
; r0 = result
; r1 = ptr
; r2 = delta
spin:
ldrex     r0,[r1]
add       r3,r0,r2
strex     r4,r3,[r1]
cmp       r4,0
b.ne      spin
; r0 = result
; r1 = ptr
; r2 = delta
spin:
ldl_l     r0,0(r1)
addl      r0,r2,r3
stl_c     r3,0(r1)
beq       r3,spin
exit:

'llvm.atomic.lss' Intrinsic

Syntax:

declare <ty> @llvm.atomic.lss( <ty>* <ptr>, <ty> <delta> )

Overview:

This intrinsic subtracts 'delta' from the value stored in shared memory at 'ptr'. It yields the original value at 'ptr'.

Arguments:

The intrinsic takes two arguments, the first a pointer to a value of type 'ty' and the second a value of type 'ty'. The type 'ty' must be an integer type and can be of any size, but the targets may only lower integer representations they support.

Semantics:

This intrinsic does a series of operations atomically. It first loads the value stored at 'ptr'. It then subtracts 'delta', stores the result to 'ptr'. It yields the original value stored at 'ptr'.

Examples:
%ptr      = malloc i32
            store i32 32, %ptr
%result1  = call i32 @llvm.atomic.lss( i32* %ptr, i32 4 )
                                    ; yields {i32}:result1 = 32
%result2  = call i32 @llvm.atomic.lss( i32* %ptr, i32 2 )
                                    ; yields {i32}:result2 = 28
%result3  = call i32 @llvm.atomic.lss( i32* %ptr, i32 5 )
                                    ; yields {i32}:result3 = 26
%memval   = load i32* %ptr          ; yields {i32}:memval1 = 21
%swapped  = icmp eq i32 %memval, 21 ; yields {i1}:swapped  = true
Implementations:
SPARC x86 Itanium PPC MIPS ARM Alpha
; i0 = ptr
; i1 = delta
; o0 = result
lduw      [%i0] 0,%l0
1:
sub       %i1,%l0,%o0
casa      [%i0] 0,%l0,%o0
subcc     %l0,%o0,%l1
brne,pn   1
or        %o0,%g0,%l0
; eax = result
; ebx = ptr
; ecx = delta
movl      %ecx,%eax
negl      %eax
lock
xaddl     %eax,(%ebx)
; TODO
; r0 = result
; r1 = ptr
; r2 = delta
spin:
lwarx     r0,0,r1
sub       r3,r2,r0
stwcx.    r3,0,r1
bne-      spin
; s0 = result
; s1 = ptr
; s2 = delta
spin:
ll        $s0,0($s1)
sub       $t1,$s0,$s2
sc        $t1,0($s1)
beq       $t1,$0,spin
; r0 = result
; r1 = ptr
; r2 = delta
spin:
ldrex     r0,[r1]
sub       r3,r0,r2
strex     r4,r3,[r1]
cmp       r4,0
b.ne      spin
; r0 = result
; r1 = ptr
; r2 = delta
spin:
ldl_l     r0,0(r1)
subl      r0,r2,r3
stl_c     r3,0(r1)
beq       r3,spin
exit:

'llvm.atomic.membarrier' Intrinsic

Syntax:

declare void @llvm.atomic.membarrier( i1 <ll>, i1 <ls>, i1 <sl>, i1 <ss> )

Overview:

The 'llvm.atomic.membarrier' intrinsic guarantees ordering between specific pairs of memory access types.

Arguments:

The 'llvm.atomic.membarrier' intrinsic requires four boolean arguments. Each argument enables a specific barrier as listed below.

Semantics:

This intrinsic causes the system to enforce some ordering constraints upon the loads and stores of the program. This barrier does not indicate when any events will occur, it only enforces an order in which they occur. For any of the specified pairs of load and store operations (f.ex. load-load, or store-load), all of the first operations preceding the barrier will complete before any of the second operations succeeding the barrier begin. Specifically the semantics for each pairing is as follows:

These semantics are applied with a logical "and" behavior when more than one is enabled in a single memory barrier intrinsic.

Example:
%ptr      = malloc i32
            store i32 4, %ptr

%result1  = load i32* %ptr      ; yields {i32}:result1 = 4
            call void @llvm.atomic.membarrier( i1 false, i1 true, i1 false, i1 false )
                                ; guarantee the above finishes
            store i32 8, %ptr   ; before this begins
Implementations:
SPARC x86 Itanium PPC MIPS ARM Alpha
ll,!ls,!sl,!ss
membar    #LoadLoad
lfence
mf
sync
sync
dmb
mb
!ll,ls,!sl,!ss
membar    #LoadStore
mfence
mf
sync
sync
dmb
mb
!ll,!ls,sl,!ss
membar    #StoreLoad
mfence
mf
sync
sync
dmb
mb
!ll,!ls,!sl,ss
membar    #StoreStore
sfence
mf
eieio
sync
dmb
wmb
ll,ls,!sl,!ss
membar    #LoadLoad | #LoadStore
mfence
mf
sync
sync
dmb
mb
ll,!ls,sl,!ss
membar    #LoadLoad | #StoreLoad
mfence
mf
sync
sync
dmb
mb
ll,!ls,!sl,ss
membar    #LoadLoad | #StoreStore
mfence
mf
sync
sync
dmb
mb
!ll,ls,sl,!ss
membar    #LoadStore | #StoreLoad
mfence
mf
sync
sync
dmb
mb
!ll,ls,!sl,ss
membar    #LoadStore | #StoreStore
mfence
mf
sync
sync
dmb
mb
!ll,!ls,sl,ss
membar    #StoreStore | #StoreLoad
mfence
mf
sync
sync
dmb
mb
ll,ls,sl,!ss
membar    #LoadLoad | #LoadStore | #StoreLoad
mfence
mf
sync
sync
dmb
mb
ll,ls,!sl,ss
membar    #LoadLoad | #LoadStore | #StoreStore
mfence
mf
sync
sync
dmb
mb
ll,!ls,sl,ss
membar    #LoadLoad | #StoreLoad | #StoreStore
mfence
mf
sync
sync
dmb
mb
!ll,ls,sl,ss
membar    #LoadStore | #StoreStore | #StoreLoad
mfence
mf
sync
sync
dmb
mb
ll,ls,sl,ss
membar    #LoadLoad | #LoadStore | #StoreStore | #StoreLoad
mfence
mf
sync
sync
dmb
mb

GCC Builtin Lowering

A primary use of these intrinsics is to provide a lowering for GCC builtin functions representing atomic constructs and memory synchronization. The following attempts to provide starting points for those implementations. This should be correct in principle, although they will likely need adjustment to use directly in the front end. Additionally, the GCC specification is quite vague about memory barriers. These implementations assume a very conservative interpretation.

result = __sync_fetch_and_add( <ty>* ptr, <ty> value )
            call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%result   = call <ty> @llvm.atomic.las( <ty>* %ptr, <ty> %value )
result = __sync_fetch_and_sub( <ty>* ptr, <ty> value )
            call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%result   = call <ty> @llvm.atomic.lss( <ty>* %ptr, <ty> %value )
result = __sync_fetch_and_or( <ty>* ptr, <ty> value )
Spin:
            call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%ptr_val  = load <ty>* %ptr
%or_val   = or <ty> %ptr_val, %value
%result   = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %ptr_val, %or_val )
%success  = icmp eq <ty> %result, %ptr_val
            br i1 %success, label %Done, label %Spin
Done:
result = __sync_fetch_and_and( <ty>* ptr, <ty> value )
Spin:
            call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%ptr_val  = load <ty>* %ptr
%and_val  = and <ty> %ptr_val, %value
%result   = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %ptr_val, %and_val )
%success  = icmp eq <ty> %result, %ptr_val
            br i1 %success, label %Done, label %Spin
Done:
result = __sync_fetch_and_xor( <ty>* ptr, <ty> value )
Spin:
            call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%ptr_val  = load <ty>* %ptr
%xor_val  = xor <ty> %ptr_val, %value
%result   = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %ptr_val, %xor_val )
%success  = icmp eq <ty> %result, %ptr_val
            br i1 %success, label %Done, label %Spin
Done:
result = __sync_fetch_and_nand( <ty>* ptr, <ty> value )
Spin:
            call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%ptr_val  = load <ty>* %ptr
%and_val  = and <ty> %ptr_val, %value
%nand_val = xor <ty> %and_val, -1
%result   = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %ptr_val, <ty> %nand_val )
%success  = icmp eq <ty> %result, %ptr_val
            br i1 %success, label %Done, label %Spin
Done:
result = __sync_add_and_fetch( <ty>* ptr, <ty> value )
            call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%old_val  = call <ty> @llvm.atomic.las( <ty>* %ptr, <ty> %value )
%result   = add <ty> %old_val, %value
result = __sync_sub_and_fetch( <ty>* ptr, <ty> value )
            call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%old_val  = call <ty> @llvm.atomic.lss( <ty>* %ptr, <ty> %value )
%result   = sub <ty> %old_val, %value
result = __sync_or_and_fetch( <ty>* ptr, <ty> value )
Spin:
            call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%ptr_val  = load <ty>* %ptr
%result   = or <ty> %ptr_val, %value
%old_val  = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %ptr_val, <ty> %result )
%success  = icmp eq <ty> %old_val, %ptr_val
            br i1 %success, label %Done, label %Spin
Done:
result = __sync_and_and_fetch( <ty>* ptr, <ty> value )
Spin:
            call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%ptr_val  = load <ty>* %ptr
%result   = and <ty> %ptr_val, %value
%old_val  = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %ptr_val, <ty> %result )
%success  = icmp eq <ty> %old_val, %ptr_val
            br i1 %success, label %Done, label %Spin
Done:
result = __sync_xor_and_fetch( <ty>* ptr, <ty> value )
Spin:
            call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%ptr_val  = load <ty>* %ptr
%result   = xor <ty> %ptr_val, %value
%old_val  = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %ptr_val, <ty> %result )
%success  = icmp eq <ty> %old_val, %ptr_val
            br i1 %success, label %Done, label %Spin
Done:
result = __sync_nand_and_fetch( <ty>* ptr, <ty> value )
Spin:
            call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%ptr_val  = load <ty>* %ptr
%and_val  = and <ty> %ptr_val, %value
%result   = xor <ty> %and_val, -1
%old_val  = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %ptr_val, <ty> %result )
%success  = icmp eq <ty> %old_val, %ptr_val
            br i1 %success, label %Done, label %Spin
Done:
result = __sync_val_compare_and_swap( <ty>* ptr, <ty> oldval, <ty> newval )
            call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%result   = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %oldval, <ty> %newval )
result = __sync_bool_compare_and_swap( <ty>* ptr, <ty> oldval, <ty> newval )
            call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%tmp      = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %oldval, <ty> %newval )
%result   = icmp eq <ty> %tmp, %oldval
__sync_synchronize()
            call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
result = __sync_lock_test_and_set( <ty>* ptr, <ty> value )
%result   = call <ty> @llvm.atomic.swap( <ty>* %ptr, <ty> %value )
            call void @llvm.atomic.membarrier( i1 false, i1 false, i1 true, i1 true )
            call void @llvm.atomic.membarrier( sl, ss )
result = __sync_lock_release( <ty>* ptr )
            call void @llvm.atomic.membarrier( i1 false, i1 true, i1 false, i1 true )
            store <ty>* %ptr, <ty> 0

References