Module Multicore_magic

This is a library of magic multicore utilities intended for experts for extracting the best possible performance from multicore OCaml.

Hopefully future releases of multicore OCaml will make this library obsolete!

Helpers for using padding to avoid false sharing

val copy_as_padded : 'a -> 'a

Depending on the object, either creates a shallow clone of it or returns it as is. When cloned, the clone will have extra padding words added after the last used word.

This is designed to help avoid false sharing. False sharing has a negative impact on multicore performance. Accesses of both atomic and non-atomic locations, whether read-only or read-write, may suffer from false sharing.

The intended use case for this is to pad all long lived objects that are being accessed highly frequently (read or written).

Many kinds of objects can be padded, for example:

let padded_atomic = Multicore_magic.copy_as_padded (Atomic.make 101)

let padded_ref = Multicore_magic.copy_as_padded (ref 42)

let padded_record = Multicore_magic.copy_as_padded {
  number = 76;
  pointer = 1 :: 2 :: 3 :: [];
}

let padded_variant = Multicore_magic.copy_as_padded (Some 1)

Padding changes the length of an array. If you need to pad an array, use make_padded_array.

val copy_as : ?padded:bool -> 'a -> 'a

copy_as x by default simply returns x. When ~padded:true is explicitly specified, returns copy_as_padded x.

val make_padded_array : int -> 'a -> 'a array

Creates a padded array. The length of the returned array includes padding. Use length_of_padded_array to get the unpadded length.

val length_of_padded_array : 'a array -> int

Returns the length of an array created by make_padded_array without the padding.

WARNING: This is not guaranteed to work with copy_as_padded.

val length_of_padded_array_minus_1 : 'a array -> int

Returns the length of an array created by make_padded_array without the padding minus 1.

WARNING: This is not guaranteed to work with copy_as_padded.

Missing Atomic operations

val fenceless_get : 'a Stdlib.Atomic.t -> 'a

Get a value from the atomic without performing an acquire fence.

Consider the following prototypical example of a lock-free algorithm:

let rec prototypical_lock_free_algorithm () =
  let expected = Atomic.get atomic in
  let desired = (* computed from expected *) in
  if not (Atomic.compare_and_set atomic expected desired) then
    (* failure, maybe retry *)
  else
    (* success *)

A potential performance problem with the above example is that it performs two acquire fences. Both the Atomic.get and the Atomic.compare_and_set perform an acquire fence. This may have a negative impact on performance.

Assuming the first fence is not necessary, we can rewrite the example using fenceless_get as follows:

let rec prototypical_lock_free_algorithm () =
  let expected = Multicore_magic.fenceless_get atomic in
  let desired = (* computed from expected *) in
  if not (Atomic.compare_and_set atomic expected desired) then
    (* failure, maybe retry *)
  else
    (* success *)

Now only a single acquire fence is performed by Atomic.compare_and_set and performance may be improved.

val fenceless_set : 'a Stdlib.Atomic.t -> 'a -> unit

Set the value of an atomic without performing a full fence.

Consider the following example:

let new_atomic = Atomic.make dummy_value in
(* prepare data_structure referring to new_atomic *)
Atomic.set new_atomic data_structure;
(* publish the data_structure: *)
Atomic.exchance old_atomic data_structure

A potential performance problem with the above example is that it performs two full fences. Both the Atomic.set used to initialize the data structure and the Atomic.exchange used to publish the data structure perform a full fence. The same would also apply in cases where Atomic.compare_and_set or Atomic.set would be used to publish the data structure. This may have a negative impact on performance.

Using fenceless_set we can rewrite the example as follows:

let new_atomic = Atomic.make dummy_value in
(* prepare data_structure referring to new_atomic *)
Multicore_magic.fenceless_set new_atomic data_structure;
(* publish the data_structure: *)
Atomic.exchance old_atomic data_structure

Now only a single full fence is performed by Atomic.exchange and performance may be improved.

val fence : int Stdlib.Atomic.t -> unit

Perform a full acquire-release fence on the given atomic.

fence atomic is equivalent to ignore (Atomic.fetch_and_add atomic 0).

Fixes and workarounds

module Transparent_atomic : sig ... end

A replacement for Stdlib.Atomic with fixes and performance improvements

Missing functionality

module Atomic_array : sig ... end

Array of (potentially unboxed) atomic locations.

Avoiding contention

val instantaneous_domain_index : unit -> int

instantaneous_domain_index () potentially (re)allocates and returns a non-negative integer "index" for the current domain. The indices are guaranteed to be unique among the domains that exist at a point in time. Each call of instantaneous_domain_index () may return a different index.

The intention is that the returned value can be used as an index into a contention avoiding parallelism safe data structure. For example, a naïve scalable increment of one counter from an array of counters could be done as follows:

let incr counters =
  (* Assuming length of [counters] is a power of two and larger than
     the number of domains. *)
  let mask = Array.length counters - 1 in
  let index = instantaneous_domain_index () in
  Atomic.incr counters.(index land mask)

The implementation ensures that the indices are allocated as densely as possible at any given moment. This should allow allocating as many counters as needed and essentially eliminate contention.

On OCaml 4 instantaneous_domain_index () will always return 0.