The LLVM representation aims to be light-weight and low-level while being expressive, typed, and extensible at the same time. It aims to be a "universal IR" of sorts, by being at a low enough level that high-level ideas may be cleanly mapped to it (similar to how microprocessors are "universal IR's", allowing many source languages to be mapped to them).
This proposal attempts to expand the "universal IR" of LLVM to include hardware constructs for atomic operations and memory synchronization. This will provide an interface to the hardware, not an interface to the programmer. It is aimed at a low enough level to allow any programming models or APIs which need atomic behaviors to map cleanly onto it. It is also modeled primarily on hardware behavior. Just as hardware provides a "unviresal IR" for source languages, it also provides a starting point for developing a "universal" atomic operation and synchronization IR.
The proposal is for an LLVM hardware interface. It is not an API such as high-level threading libraries, software transaction memory systems, lower-level atomic primitives, and intrinsic functionss as found in BSD, GNU libc, atomic_ops, APR, and other system and application libraries. The hardware interface provided by LLVM should allow a clean implementation of all of these APIs and parallel programming models. No one model or paradigm should be selected above others unless the hardware itself ubiquitously does so.
Understanding the hardware itself becomes the first step toward crafting a representation that is unbiased toward the language representations, and clearly maps onto the hardware. The various target architectures for LLVM were researched in order to investigate what capabilities were directly provided by the hardware. All of these targets provide sufficient functionality to achieve the atomicity needed by APIs and programming interfaces. The differences between these targets are how they provided the necessary atomicity, and how much is directly available through hardware constructs. The following table summarizes the hardware constructs provided across the various architectures.
| Architectures | Memory Synchronization | Atomic Compare and Swap | Atomic Test and Set | Atomic Swap | Atomic Add | Atomic Sub | Atomic Increment | Atomic Decrement |
|---|---|---|---|---|---|---|---|---|
| SPARC | MEMBAR | CASA and CASXA | Unreliable | Depreciated | N/A | N/A | N/A | N/A |
| x86 / x86_64 | MFENCE, SFENCE, LFENCE | LOCK CMPXCHG | LOCK BTS | LOCK XCHG | LOCK XADD | LOCK XADD | LOCK INC | LOCK DEC |
| ia64 | mf and .acq / .rel modifiers | cmpxchg | N/A | N/A | fetchadd | fetchadd | N/A | N/A |
| PPC | sync | Spinning Conditional Load/Store: lwacx and stwcx | ||||||
| MIPS | SYNC | Spinning Conditional Load/Store: LL and SC | ||||||
| ARM | DMB | Spinning Conditional Load/Store: LRREX and STREX | ||||||
| Alpha | mb, wmb | Spinning Conditional Load/Store: ld*_l and st*_c | ||||||
All architectures provide some memory synchronization functionality. These are often called memory barriers or fences. There are two ways of implementing these. The first provides memory constraints only for a specific operation. These often are constructed as modifiers on that operation. Secondly, they may be based on a standalone instruction enforcing some ordering of memory accesses relative to that instruction.
The only architecture which has support for operation-based synchronization constructs is the Itanium, and it still supports a coarse grained standalone instruction. Because all other target architectures represent these constraints as standalone constructs and the fact that an operation based constraint cannot guard other operations, a standalone representation was chosen for this proposal. While some standalone representation is necessary, it will be possible to extend the operations themselves to have built-in memory constraints if future hardware developments or further support for Itanium demand it. This would be an incremental improvement on this proposal and is not inhibited by using a coarse grained standalone representation.
The single instruction constraints can, at their most flexible, constrain any set of possible pairings of loads from memory and stores to memory. That is, they can provide a barrier between loads and stores, between loads and loads, between stores and loads, etc. These pairings can then be combined logically to provide a barrier between loads and loads, as well as between loads and stores, all with a single instruction constraint. This most flexible arrangement was selected for the proposal in order to efficiently provide all available memory constraint constructs on the hardware targets. A graceful fallback to a sufficient representation is always provided.
Atomic operations on all architectures provide a mechanism to modify the value of data in memory atomically. These can be thought of as read-modify-write or load-modify-store operations where all three actions occur atomically. They make up the basis for synchronization constructs such as mutual exclusion locks and semaphores. Note, however, that these are typically not blocking operations, nor do they directly provide locks or semaphores. They simply provide atomicity guarantees needed to efficiently and effectively implement these and other concurrent thread synchronization mechanisms.
The most important of these operations is the compare-and-swap construct. This operation has a consensus number of +Inf, which allows it to support many synchronization constructs that other operations do not. Specifically, test-and-set, fetch-and-add, and other constructs are not sufficient for non-blocking resolution of consensus between more than two processes or threads. All of the hardware targets for LLVM support some form of compare and swap, allowing this to be the cornerstone atomic operation of this proposal.
Several architectures provide address reservation, and a conditional store mechanism (some literature calls this load-linked, store-conditionally) to implement atomic operations. These constructs allow the operation to be performed immediately and locally. The atomicity of the operation with respect to a specific address is checked, and if it was completed successfully, the result is stored back to memory. Otherwise, the atomicity was lost, and the operation is performed again until it successfully achieves an atomic sequence. This is similar to transaction systems where the transaction is not committed to the material database until its atomicity has been assured.
This conditional store mechanism is also used on other architectures when the specific operation is not directly supported atomically in hardware. In these cases, the compare-and-swap hardware support can be used to attempt the operation until atomicity has been achieved. Neither of these uses is blocking, and both retain full atomic behavior. They are simply the forms these various atomic actions take on the respective hardware implementations.
The proposed LLVM representation of these atomic operations and synchronization constructs of modern parallel systems consists of five intrinsic functions. Each is presented below and documented similarly to those defined in LLVM's Language Reference. Additionally, rough outlines for the implementation of these intrinsics are provided for all of LLVM's major target architectures. The representation includes all major atomic behaviors available on these architectures. They can be easily emulated on any architecture which does not support them natively, and by providing LLVM representations for them, the supporting architectures can lower them correctly to their native representations.
declare <ty> @llvm.atomic.cas( <ty>* <ptr>, <ty> <cmp>, <ty> <swap> )
This compares a value in shared memory to a given value. If they are equal, the value in the shared memory is swapped with some value.
The 'llvm.atomic.cas' intrinsic takes three arguments. The 'ptr' must be a pointer to a value of the type 'ty'. Both 'cmp' and 'swap' must be values of type 'ty'. All three of the 'ty' types must be the same integer type. They can be any size integer, but the targets may only lower integer representations they support.
This entire intrinsic must be executed atomically. It first compares the value in shared memory pointed to by 'ptr' with the value 'cmp'. If they are equal, the value in memory is replaced with the value of 'swap', else the value in memory remains the same. The value originally stored in memory is yielded in either case.
This operation does not perform a true swap due to the semantics of SSA. Rather, it yields the value that would be swapped into 'swap'. It yields this value in all cases, even when the change to memory was not performed, allowing a quick check against 'cmp' to determine success or failure of the swap, and immediately have the actual memory value available.
%ptr = malloc i32
store i32 4, %ptr
%val1 = add i32 4, 4
%result1 = call i32 @llvm.atomic.cas( i32* %ptr, i32 4, %val1 )
; yields {i32}:result1 = 4
%swapped1 = icmp eq i32 %result1, 4 ; yields {i1}:swapped1 = true
%memval1 = load i32* %ptr ; yields {i32}:memval1 = 8
%val2 = add i32 1, 1
%result2 = call i32 @llvm.atomic.cas( i32* %ptr, i32 5, %val2 )
; yields {i32}:result2 = 8
%swapped2 = icmp eq i32 %result2, 5 ; yields {i1}:swapped2 = false
%memval2 = load i32* %ptr ; yields {i32}:memval2 = 8
| SPARC | x86 | Itanium | PPC | MIPS | ARM | Alpha |
|---|---|---|---|---|---|---|
; i0 = ptr ; i1 = cmp ; i2 = swap ; o0 = result or %g0,%i2,%o0 casa [%i0] 0,%i1,%o0 |
; eax = cmp ; ebx = ptr ; ecx = swap ; edx = result movl %ecx,%edx lock cmpxchgl %edx,(%ebx) movl %eax,%edx |
; in0 = ptr ; in1 = cmp ; in2 = swap ; ret0 = result mov r1=in0 mov ar.ccv=in1 mov r2=in2 cmpxchg4 r0=[r1],r2 mov ret0=r0 |
; r0 = result ; r1 = ptr ; r2 = cmp ; r3 = swap spin: lwarx r0,0,r1 cmpw r0,r2 bne- exit stwcx. r3,0,r1 bne- spin exit: |
; s0 = result ; s1 = ptr ; s2 = cmp ; s3 = swap spin: add $t0,$s3,$0 ll $s0,0($s1) bne $s0,$s2,exit sc $t0,0($s1) beq $t0,$0,spin exit: |
; r0 = result ; r1 = ptr ; r2 = cmp ; r3 = swap spin: ldrex r0,[r1] cmp r0,r2 b.ne exit strex r4,r3,[r1] cmp r4,0 b.ne spin exit: |
; r0 = result ; r1 = ptr ; r2 = cmp ; r3 = swap spin: ldl_l r0,0(r1) cmpeq r0,r2,r4 beq r4,exit stl_c r3,0(r1) beq r3,spin exit: |
declare <ty> @llvm.atomic.swap( <ty>* <ptr>, <ty> <swap> )
This intrinsic swaps the value stored in shared memory at 'ptr' with 'swap' and yields the value from memory.
The 'llvm.atomic.swap' intrinsic takes two arguments. The first, 'ptr', is a pointer to a value of type 'ty'. The 'swap' argument must be of type 'ty'. The type 'ty' must be an integer type and can be of any size, but the targets may only lower integer representations they support.
This intrinsic loads the value pointed to by 'ptr', and stores 'swap' back into it atomically. Due to SSA rules, the value from memory is yielded rather than stored in 'swap'.
%ptr = malloc i32
store i32 4, %ptr
%val1 = add i32 4, 4
%result1 = call i32 @llvm.atomic.swap( i32* %ptr, i32 %val1 )
; yields {i32}:result1 = 4
%swapped1 = icmp eq i32 %result1, 4 ; yields {i1}:swapped1 = true
%memval1 = load i32* %ptr ; yields {i32}:memval1 = 8
%val2 = add i32 1, 1
%result2 = call i32 @llvm.atomic.swap( i32* %ptr, i32 %val2 )
; yields {i32}:result2 = 8
%swapped2 = icmp eq i32 %result2, 8 ; yields {i1}:swapped2 = true
%memval2 = load i32* %ptr ; yields {i32}:memval2 = 2
| SPARC | x86 | Itanium | PPC | MIPS | ARM | Alpha |
|---|---|---|---|---|---|---|
; i0 = ptr ; i1 = swap ; o0 = result lduw [%i0] 0,%l0 1: or %i1,%g0,%o0 casa [%i0] 0,%l0,%o0 subcc %l0,%o0,%l1 brne,pn 1 or %o0,%g0,%l0 |
; eax = result ; ebx = ptr ; ecx = swap movl %ecx,%eax xchg %eax,(%ebx) |
; TODO |
; r0 = result ; r1 = ptr ; r2 = swap spin: lwarx r0,0,r1 stwcx. r2,0,r1 bne- spin |
; s0 = result ; s1 = ptr ; s2 = swap spin: add $t0,$s2,$0 ll $s0,0($s1) sc $t0,0($s1) beq $t0,$0,spin |
; r0 = result ; r1 = ptr ; r2 = swap spin: ldrex r0,[r1] strex r3,r2,[r1] cmp r3,0 b.ne spin |
; r0 = result ; r1 = ptr ; r2 = swap spin: ldl_l r0,0(r1) stl_c r2,0(r1) beq r2,spin exit: |
declare <ty> @llvm.atomic.las( <ty>* <ptr>, <ty> <delta> )
This intrinsic adds 'delta' to the value stored in shared memory at 'ptr'. It yields the original value at 'ptr'.
The intrinsic takes two arguments, the first a pointer to a value of type 'ty' and the second a value of type 'ty'. The type 'ty' must be an integer type and can be of any size, but the targets may only lower integer representations they support.
This intrinsic does a series of operations atomically. It first loads the value stored at 'ptr'. It then adds 'delta', stores the result to 'ptr'. It yields the original value stored at 'ptr'.
%ptr = malloc i32
store i32 4, %ptr
%result1 = call i32 @llvm.atomic.las( i32* %ptr, i32 4 )
; yields {i32}:result1 = 4
%result2 = call i32 @llvm.atomic.las( i32* %ptr, i32 2 )
; yields {i32}:result2 = 8
%result3 = call i32 @llvm.atomic.las( i32* %ptr, i32 5 )
; yields {i32}:result3 = 10
%memval = load i32* %ptr ; yields {i32}:memval1 = 15
%swapped = icmp eq i32 %memval, 15 ; yields {i1}:swapped = true
| SPARC | x86 | Itanium | PPC | MIPS | ARM | Alpha |
|---|---|---|---|---|---|---|
; i0 = ptr ; i1 = delta ; o0 = result lduw [%i0] 0,%l0 1: add %i1,%l0,%o0 casa [%i0] 0,%l0,%o0 subcc %l0,%o0,%l1 brne,pn 1 or %o0,%g0,%l0 |
; eax = result ; ebx = ptr ; ecx = delta movl %ecx,%eax lock xaddl %eax,(%ebx) |
; TODO |
; r0 = result ; r1 = ptr ; r2 = delta spin: lwarx r0,0,r1 add r3,r2,r0 stwcx. r3,0,r1 bne- spin |
; s0 = result ; s1 = ptr ; s2 = delta spin: ll $s0,0($s1) add $t1,$s0,$s2 sc $t1,0($s1) beq $t1,$0,spin |
; r0 = result ; r1 = ptr ; r2 = delta spin: ldrex r0,[r1] add r3,r0,r2 strex r4,r3,[r1] cmp r4,0 b.ne spin |
; r0 = result ; r1 = ptr ; r2 = delta spin: ldl_l r0,0(r1) addl r0,r2,r3 stl_c r3,0(r1) beq r3,spin exit: |
declare <ty> @llvm.atomic.lss( <ty>* <ptr>, <ty> <delta> )
This intrinsic subtracts 'delta' from the value stored in shared memory at 'ptr'. It yields the original value at 'ptr'.
The intrinsic takes two arguments, the first a pointer to a value of type 'ty' and the second a value of type 'ty'. The type 'ty' must be an integer type and can be of any size, but the targets may only lower integer representations they support.
This intrinsic does a series of operations atomically. It first loads the value stored at 'ptr'. It then subtracts 'delta', stores the result to 'ptr'. It yields the original value stored at 'ptr'.
%ptr = malloc i32
store i32 32, %ptr
%result1 = call i32 @llvm.atomic.lss( i32* %ptr, i32 4 )
; yields {i32}:result1 = 32
%result2 = call i32 @llvm.atomic.lss( i32* %ptr, i32 2 )
; yields {i32}:result2 = 28
%result3 = call i32 @llvm.atomic.lss( i32* %ptr, i32 5 )
; yields {i32}:result3 = 26
%memval = load i32* %ptr ; yields {i32}:memval1 = 21
%swapped = icmp eq i32 %memval, 21 ; yields {i1}:swapped = true
| SPARC | x86 | Itanium | PPC | MIPS | ARM | Alpha |
|---|---|---|---|---|---|---|
; i0 = ptr ; i1 = delta ; o0 = result lduw [%i0] 0,%l0 1: sub %i1,%l0,%o0 casa [%i0] 0,%l0,%o0 subcc %l0,%o0,%l1 brne,pn 1 or %o0,%g0,%l0 |
; eax = result ; ebx = ptr ; ecx = delta movl %ecx,%eax negl %eax lock xaddl %eax,(%ebx) |
; TODO |
; r0 = result ; r1 = ptr ; r2 = delta spin: lwarx r0,0,r1 sub r3,r2,r0 stwcx. r3,0,r1 bne- spin |
; s0 = result ; s1 = ptr ; s2 = delta spin: ll $s0,0($s1) sub $t1,$s0,$s2 sc $t1,0($s1) beq $t1,$0,spin |
; r0 = result ; r1 = ptr ; r2 = delta spin: ldrex r0,[r1] sub r3,r0,r2 strex r4,r3,[r1] cmp r4,0 b.ne spin |
; r0 = result ; r1 = ptr ; r2 = delta spin: ldl_l r0,0(r1) subl r0,r2,r3 stl_c r3,0(r1) beq r3,spin exit: |
declare void @llvm.atomic.membarrier( i1 <ll>, i1 <ls>, i1 <sl>, i1 <ss> )
The 'llvm.atomic.membarrier' intrinsic guarantees ordering between specific pairs of memory access types.
The 'llvm.atomic.membarrier' intrinsic requires four boolean arguments. Each argument enables a specific barrier as listed below.
This intrinsic causes the system to enforce some ordering constraints upon the loads and stores of the program. This barrier does not indicate when any events will occur, it only enforces an order in which they occur. For any of the specified pairs of load and store operations (f.ex. load-load, or store-load), all of the first operations preceding the barrier will complete before any of the second operations succeeding the barrier begin. Specifically the semantics for each pairing is as follows:
%ptr = malloc i32
store i32 4, %ptr
%result1 = load i32* %ptr ; yields {i32}:result1 = 4
call void @llvm.atomic.membarrier( i1 false, i1 true, i1 false, i1 false )
; guarantee the above finishes
store i32 8, %ptr ; before this begins
| SPARC | x86 | Itanium | PPC | MIPS | ARM | Alpha | |
|---|---|---|---|---|---|---|---|
| ll,!ls,!sl,!ss |
membar #LoadLoad |
lfence |
mf |
sync |
sync |
dmb |
mb |
| !ll,ls,!sl,!ss |
membar #LoadStore |
mfence |
mf |
sync |
sync |
dmb |
mb |
| !ll,!ls,sl,!ss |
membar #StoreLoad |
mfence |
mf |
sync |
sync |
dmb |
mb |
| !ll,!ls,!sl,ss |
membar #StoreStore |
sfence |
mf |
eieio |
sync |
dmb |
wmb |
| ll,ls,!sl,!ss |
membar #LoadLoad | #LoadStore |
mfence |
mf |
sync |
sync |
dmb |
mb |
| ll,!ls,sl,!ss |
membar #LoadLoad | #StoreLoad |
mfence |
mf |
sync |
sync |
dmb |
mb |
| ll,!ls,!sl,ss |
membar #LoadLoad | #StoreStore |
mfence |
mf |
sync |
sync |
dmb |
mb |
| !ll,ls,sl,!ss |
membar #LoadStore | #StoreLoad |
mfence |
mf |
sync |
sync |
dmb |
mb |
| !ll,ls,!sl,ss |
membar #LoadStore | #StoreStore |
mfence |
mf |
sync |
sync |
dmb |
mb |
| !ll,!ls,sl,ss |
membar #StoreStore | #StoreLoad |
mfence |
mf |
sync |
sync |
dmb |
mb |
| ll,ls,sl,!ss |
membar #LoadLoad | #LoadStore | #StoreLoad |
mfence |
mf |
sync |
sync |
dmb |
mb |
| ll,ls,!sl,ss |
membar #LoadLoad | #LoadStore | #StoreStore |
mfence |
mf |
sync |
sync |
dmb |
mb |
| ll,!ls,sl,ss |
membar #LoadLoad | #StoreLoad | #StoreStore |
mfence |
mf |
sync |
sync |
dmb |
mb |
| !ll,ls,sl,ss |
membar #LoadStore | #StoreStore | #StoreLoad |
mfence |
mf |
sync |
sync |
dmb |
mb |
| ll,ls,sl,ss |
membar #LoadLoad | #LoadStore | #StoreStore | #StoreLoad |
mfence |
mf |
sync |
sync |
dmb |
mb |
A primary use of these intrinsics is to provide a lowering for GCC builtin functions representing atomic constructs and memory synchronization. The following attempts to provide starting points for those implementations. This should be correct in principle, although they will likely need adjustment to use directly in the front end. Additionally, the GCC specification is quite vague about memory barriers. These implementations assume a very conservative interpretation.
call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%result = call <ty> @llvm.atomic.las( <ty>* %ptr, <ty> %value )
call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%result = call <ty> @llvm.atomic.lss( <ty>* %ptr, <ty> %value )
Spin:
call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%ptr_val = load <ty>* %ptr
%or_val = or <ty> %ptr_val, %value
%result = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %ptr_val, %or_val )
%success = icmp eq <ty> %result, %ptr_val
br i1 %success, label %Done, label %Spin
Done:
Spin:
call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%ptr_val = load <ty>* %ptr
%and_val = and <ty> %ptr_val, %value
%result = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %ptr_val, %and_val )
%success = icmp eq <ty> %result, %ptr_val
br i1 %success, label %Done, label %Spin
Done:
Spin:
call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%ptr_val = load <ty>* %ptr
%xor_val = xor <ty> %ptr_val, %value
%result = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %ptr_val, %xor_val )
%success = icmp eq <ty> %result, %ptr_val
br i1 %success, label %Done, label %Spin
Done:
Spin:
call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%ptr_val = load <ty>* %ptr
%and_val = and <ty> %ptr_val, %value
%nand_val = xor <ty> %and_val, -1
%result = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %ptr_val, <ty> %nand_val )
%success = icmp eq <ty> %result, %ptr_val
br i1 %success, label %Done, label %Spin
Done:
call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%old_val = call <ty> @llvm.atomic.las( <ty>* %ptr, <ty> %value )
%result = add <ty> %old_val, %value
call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%old_val = call <ty> @llvm.atomic.lss( <ty>* %ptr, <ty> %value )
%result = sub <ty> %old_val, %value
Spin:
call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%ptr_val = load <ty>* %ptr
%result = or <ty> %ptr_val, %value
%old_val = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %ptr_val, <ty> %result )
%success = icmp eq <ty> %old_val, %ptr_val
br i1 %success, label %Done, label %Spin
Done:
Spin:
call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%ptr_val = load <ty>* %ptr
%result = and <ty> %ptr_val, %value
%old_val = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %ptr_val, <ty> %result )
%success = icmp eq <ty> %old_val, %ptr_val
br i1 %success, label %Done, label %Spin
Done:
Spin:
call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%ptr_val = load <ty>* %ptr
%result = xor <ty> %ptr_val, %value
%old_val = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %ptr_val, <ty> %result )
%success = icmp eq <ty> %old_val, %ptr_val
br i1 %success, label %Done, label %Spin
Done:
Spin:
call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%ptr_val = load <ty>* %ptr
%and_val = and <ty> %ptr_val, %value
%result = xor <ty> %and_val, -1
%old_val = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %ptr_val, <ty> %result )
%success = icmp eq <ty> %old_val, %ptr_val
br i1 %success, label %Done, label %Spin
Done:
call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%result = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %oldval, <ty> %newval )
call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%tmp = call <ty> @llvm.atomic.cas( <ty>* %ptr, <ty> %oldval, <ty> %newval )
%result = icmp eq <ty> %tmp, %oldval
call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true, i1 true )
%result = call <ty> @llvm.atomic.swap( <ty>* %ptr, <ty> %value )
call void @llvm.atomic.membarrier( i1 false, i1 false, i1 true, i1 true )
call void @llvm.atomic.membarrier( sl, ss )
call void @llvm.atomic.membarrier( i1 false, i1 true, i1 false, i1 true )
store <ty>* %ptr, <ty> 0