Posted on
27 Mar 2019

'On Vectorization of Algorithmic Adjoint C++ Code' is written by NAG Collaborators – Johannes Lotz, Klaus Leppkes, Uwe Naumann (RWTH Aachen University, Germany), and was originally published on QuantMinds

Modern C++ compilers are able to automatically vectorize suitable C++ code, that is they can create vector instructions in the object code based on the results of data-dependence analysis applied to the scalar source code. This capability relies on the programmer taking care of things like data alignment and pointer aliasing. In addition, compilers can be given hints in form of vendor-specific keywords or pragmas. For example, the GNU compiler supports __builtin_assume_aligned and the OpenMP standard provides #pragma simd to mark loops operating in SIMD (single instruction multiple data) mode. The respective loop bodies need to be simple; for example, Intel suggests to use "straight-line code (a single basic block)" only and to avoid "function calls".

Vectorization of AAD is essential unless one is willing to accept a substantial increase in run time of the adjoint code relative to the primal code. Without vectorization and depending on the properties of the primal target code as well as on the level of maturity of the AAD solution/tool this ratio often ranges somewhere between  two and ten. Failure to vectorize the adjoint can easily add an order of magnitude. The additional code complexity due to AAD implemented by operator and function overloading and possibly based on sophisticated template meta-programming methodology almost certainly prevents the compiler from autovectorization as illustrated by the following image.

We plot absolute run times for a Monte Carlo solution of a simple stochastic differential equation compiled with the Intel C++ compiler (version 19.0) on AVX2 hardware. The relative run time of the scalar adjoint generated by dco/c++ turns out to be approximately equal to two (blue columns). Auto-vectorization performs adequately for the primal. It fails on the adjoint (orange columns). Use of the dco/c++ vector capabilities not only regains vectorization for the adjoint. It even decreases the primal run time (yellow columns).

Consider the following simplified Monte Carlo code for computing N paths with a set of given parameters p and random variable vector Z of matching size. The output of each individual path is stored in the vector y .

std::vector<double> /* inputs  */ p(M),
/* outputs */ y(N);
for (size_t i = 0; i < N; ++i) {
/* single Monte Carlo path */
y[i] = f(p, Z[i]);
}

The path-wise adjoint method yields the following adjoint code with new input ya (adjoint of primal output y ) and new output pa (adjoint of primal input p ); only additional declarations are shown.

std::vector<double> ya(N), pa(M);
for (size_t i = 0; i < N; ++i) {
/* adjoint of single Monte Carlo path */
fa(p, pa, Z[i], ya[i]);
}


dco/c++ provides vector types in combination with corresponding vector instructions for use within the primal code. The SIMD length needs to be supplied as a compile time constant in form of a template argument. The memory occupied by variables of this type is required to be 32-byte aligned; a specialized allocator for std::vector comes with dco/c++. Moreover, malloc -like platform independent allocation and deallocation functions are available.

/* vector type with SIMD length L */
using vtype = dco::gv<double, L>::type;
std::vector<vtype> p(M);
/* => p is initialized here <= */
/* iterate chunk-wise */
for (auto range : dco::subranges<L>(N)) {
/* vectorized Monte Carlo paths */
vtype yi = f(p, Z[range.global_index]);
/* unpack chunk into y */
dco::vunpack(range, y, yi);
}


The dco/c++ vector type can be used as base type for tangent (directional) as well as adjoint derivative types yielding vectorized computation of tangents and adjoints. Recursive nesting of derivative types enables vectorized second- and higher-order tangents and adjoints. The following code shows a combination of a standard dco/c++ adjoint driver with vectorization.

using vtype = dco::gv<double, L>::type;
/* the adjoint mode and data type */
using amode = dco::ga1s<vtype>;
using atype = amode::type;
/* create tape data structure */
amode::global_tape = amode::tape_t::create();
std::vector<atype> p(M);
for (auto range : dco::subranges<L>(N)) {
/* define which derivatives we want */
amode::global_tape->register_variable(p, M);
/* record vectorized Monte Carlo paths */
atype yi = f(p, Z[range.global_index]);
for (auto i : range) {
y[i.global_index] = dco::value(yi)[i.sub_index];
dco::derivative(yi)[i.sub_index] = ya[i.global_index];
}