Multi-Dimensional Finite Difference Methods on a GPU

One key aspect for the performance of multi-dimensional finite difference methods based on operator splitting is the performance of the underlying tridiagonal system solver [1]. The authors in [2] have analysed different solver strategies and they have reported a speed-up factor of around 15 between GPU and CPU for the best solver strategy (cyclic reduction) on very large systems. CUDA’s sparse matrix library cuSPARSE contains a routine to solve tridiagonal systems based on cyclic reduction. nVIDIA claims a speed-up factor of  around ten compared against Intel’s MKL library [3].

These factors are smaller than the speed-up factors reported for pure Monte-Carlo pricing algorithms. Main reason is that a tridiagonal system solver can not be parallelised by a simple divide and conquer approach like many Monte-Carlo pricing algorithms.

Main classes of the GPU implementation of the Hundsdorfer-Verwer operator splitting scheme are:

  1. GPUFdmLinearOp: implements the FdmLinearOp interface and can be initialized from any instance of FdmLinearOp/FdmLinearOpComposite via the toMatrix()/toMatrixDecomp() methods.
  2. GPUTripleBandLinearOp: GPU based implementation of TripleBandLinearOp. This linear operator can be initialized from any instance of the classes FdmLinearOp/FdmLinearOpComposite via the toMatrix/toMatrixDecomp() methods. The solver for the tridiagonal system is based on cuSPARSE.
  3. GPUHundsdorferScheme: GPU based implementation of the Hundsdorfer-Verwer operator splitting scheme.

plot

plotThe code is available here and it is based on the latest QuantLib version from the trunk, CUDA 4.1 or higher and Cusp 0.3 or higher.  As can be seen from the diagrams above the speed-up factor depends on the problem size. Speed-up factors above ten can only be achieved for very large two-dimensional or for medium-sized three-dimensional problems. The GPU precision is defined in cudatypes.hpp by the typedef for the type “real”.

[1] Karel in ’t Hout, ADI finite difference schemes for the Heston model with correlation.

[2] Pablo Quesada-Barriuso, Julián Lamas-Rodríguez, Dora B. Heras, Montserrat Bóo, Francisco Argüello, Selecting the Best Tridiagonal System Solver Projected on Multi-Core CPU and GPU Platforms.

[3] nVIDIA, cuSPARSE

The Sobol Brownian Bridge Generator on a GPU

The biggest mistake that can be made with quasi random numbers is to just use them in the same way as one uses pseudo random numbers. The real advantage of quasi Monte-Carlo shows up only after

N\sim e^d

samples, where d is the number of dimensions of the problem. If one is using significantly less samples then it becomes very likely that the results are totally wrong. Therefore a dimensional reduction of the problem is often the first step when applying quasi Monte-Carlo methods.

The Brownian bridge is tailor-made to reduce the number of significant dimensions of a Monte-Carlo integration of a stochastic differential equation  In his recent text-book [1] Mark Joshi shows how to use Brownian bridges efficiently for quasi Monte-Carlo pricing and he also outlines the QuantLib implementation. It has been shown in [3] that the overall statistical error of a structured equity portfolio has been reduced by a factor of three when switching from a pseudo random number generator to a Sobol Brownian bridge generator, which in turn means a speed-up factor of nine leaving the Monte Carlo error unchanged.

An efficient implementation of a Sobol Brownian bridge generator is therefore a standard building block for quasi Monte-Carlo option pricing. A CUDA 4.2 implementation for nVIDIA GPUs based on QuantLib 1.2 is available here. The interface and usage are closely related to QuantLib’s Sobol Brownian bridge generator for CPUs. The memory layout of the results on the GPU is CURAND_ORDERING_QUASI_DEFAULT. The zip file also contains a validation program, which checks that both versions of the Sobol Brownian bridge are generating the same results.

As the diagram above shows, the GPU (GTX 560) out-performance the CPU ([email protected]) on average by a factor of 50 for single precision and by a factor of 12 for double precision.

[1] Mark Joshi, More Mathematical Finance
[2] Sebasien Gurrieri, An Analysis of Sobol Sequence and the Brownian Bridge
[3] Andre Bernemann, Raplh Schreyer, Klaus Spanderen, Accelerating Exotic Option Pricing and Model Calibration Using GPUs

The Hybrid Heston-Hull-White Model in the H1HW Approximation

In his PhD thesis [1] Lech Grzelak studies several hybrid equity interest rate models. Among others the thesis presents the H1HW approximation of the hybrid Heston-Hull-White model:

\begin{array}{rcl} dS_t &=& (r_t-q_t) S_t dt + \sqrt{v_t} S_t dW_t^S \\ dv_t &=& \kappa_v (\theta_v - v_t) dt + \sigma_v \sqrt{v_t} dW_t^v \\ dr_t & = & \kappa_r(\theta_{r,t}-r_t) dt + \sigma_r dW_t^r \\ \rho_{Sv} dt &=& dW_t^S dW_t^v\\ \rho_{Sr} dt &=& dW_t^S dW_t^r \\ \rho_{vr} dt &=& dW_t^v dW_t^r \end{array}

This model is tailor-made to analyse the impact of stochastic interest rates on structured equity notes like e.g. auto-callables. The corresponding partial differential equation

\begin{array}{rcl} 0 &=& \frac{\partial u}{\partial t} + \frac{1}{2}S^2\nu \frac{\partial^2 u}{\partial S^2}+ \frac{1}{2}\sigma_\nu^2\nu\frac{\partial^2u}{\partial \nu^2}+ \frac{1}{2}\sigma_r^2\frac{\partial^2 u}{\partial r^2} + \rho_{S\nu}\sigma_\nu S\nu\frac{\partial^2u}{\partial S\partial\nu} + \rho_{Sr}\sigma_r S\sqrt\nu \frac{\partial^2 u}{\partial S\partial r} \nonumber\\ &+& (r-q)S\frac{\partial u}{\partial S}+ \kappa_v(\theta_v-\nu)\frac{\partial u}{\partial \nu}+ \kappa_r(\theta_{r,t}-r)\frac{\partial u}{\partial r} -ru \nonumber \end{array}

does not fit into into the affine diffusion process framework and therefore has no closed form solution based on the characteristic function like the Heston model.

The H1HW approximation replaces the term \sqrt \nu in the correlation interaction by its expectation value E(\sqrt\nu). Semi closed form solutions for E(\sqrt\nu) exist, if \nu_t is a CIR type process. The corresponding hybrid model is affine and a pricing engine for plain vanilla options can easily be implemented in QL based on the characteristic function framework.

The diagram below shows the root-mean-square error for vanilla european options with strikes equal to 40%, 80%, 100%, 120% and 180%. The parameters of the process are given by

\kappa_v=0.3, \theta_v=0.05, \rho_{S\nu}=0.3, \rho_{Sr}=-0.6, \kappa_r=0.01, \sigma_r=0.01

The vol-of-vol \sigma_{\nu} is set so that the Feller parameter \sigma_\nu^2/(\kappa_v \theta_v) varies between 0.25 and 128. The reference prices are calculated using the finite difference Heston-Hull-White engine.

The results of the H1HW approximation look very well if the Feller constraint \sigma_\nu^2/(\kappa_\nu \theta_\nu) < 2 is fulfilled but has problems in high \sigma_\nu scenarios. The integration of the characteristic function of the H1HW approximation becomes unstable if \rho_{Sr} < 0. The example code is available here and depends on the latest QL version based from the SVN trunk.

[1] Lech A. Grzelak: Equity and Foreign Exchange Hybrid Models for Pricing Long-Maturity Financial Derivatives

Running QuantLib on a Raspberry Pi

QuantLib 1.2 runs on a Raspberry Pi, a credit card computer build around an 700MHz ARM processor and priced at US$ 35. Compiling QL on the Pi is straight forward:

  1. install boost unit tests: sudo apt-get install libboost-test-dev
  2. ./configure –disable-static
  3. make

If your distribution is based on boost 1.49 then you will get many warnings like “Warning: swp{b} use is deprecated for this architecture”. These warnings can be ignored or follow the short note here to get rid of them. The QuantLib benchmark runs at 28.3MFlops, which is comparable to the performance of an Pentium II@300MHz.

Richardson Extrapolation for American Options

Popular finite difference schemes like the one-dimensional Crank-Nicholson scheme or multi dimensional operating splitting methods like the Hundsdorfer-Verwer scheme often achieve second order convergence in \Delta t for typical partial differential equations in financial engineering. Unfortunately if these schemes are used together with dynamic programming to price America options the second order convergence is destroyed by the first order convergence of the dynamic programming step.

The Richardson extrapolation is a simple numerical gem to increase the convergence order of a numerical method. Say a numerical method has the following convergence behavior

f(\Delta t) = f_0 + \alpha \cdot (\Delta t)^n + O((\Delta t)^{n+1}) .

Please note that this is a more restrictive requirement than to assume that the numerical method is of order n because \alpha does not depend on \Delta t. Now the formula

R(x, \Delta t) = \frac{x^n f\left(\frac{\Delta t}{x}\right)-f(\Delta t)}{x^n-1} = f_0 + O((\Delta t)^{n+1})

also has

\lim_{\Delta t\rightarrow 0} R(x,\Delta t) = f_0

but the order of convergence of R(x, \Delta t) is  n+1. The order does not depend on the particular value of x but often x is chosen to be two. This simple trick can be extended to the case where n is unknown or the Richardson extrapolation can be used m times to increase the order of convergence to n+m [2].

Lets apply the Richardson extrapolation to the pricing of an American Put under the Heston stochastic differential equation. The corresponding partial differential equation

0 = \frac{\partial u}{\partial t} + \frac{1}{2}S^2v\frac{\partial^2 u}{\partial S^2}+\rho\sigma S\nu \frac{\partial^2 u}{\partial S\partial \nu} + \frac{1}{2}\sigma^2\nu \frac{\partial^2 u}{\partial \nu^2} + rS\frac{\partial u}{\partial S} + \kappa(\theta - \nu)\frac{\partial u}{\partial \nu} -r u

is solved using the Hundsdorfer-Verwer scheme [1]. The early exercise condition is impressed using dynamic programming after every time step. The parameters of the experiment are

\small{T=1, K=10, S_0=11, r=0.1, \nu_0=0.0625, \kappa=5, \theta=0.16, \sigma=0.9, \rho=0.1} .

As can be seen in the diagram below even though the Richardson extrapolation is a simple formula it increases the convergence by one order from 1.0 to approx. 2.1. If the order is not known a priori then the resulting order is still 1.4.

The source code is available here. It depends on the latest QuantLib version from the SVN trunk.

[1] Karel in ’t Hout, ADI finite difference schemes for the Heston model with correlation.

[2] Eric Hung-Lin Liu, Fundamental Methods of Numerical Extrapolation with Applications.

Inter-Thread Communication using Lock-Free Algorithms

In his blog Martin Fowler discusses the LMAX architecture, a high throughput retail financial trading platform. A remarkable detail of the architecture is the central business logic processor, which is implemented as a single threaded Java program. The supporting pre- and post-processing is running as a multi threaded application using lock-free ring buffers.

In general lock-free algorithms are implemented using atomic read-modify-write primitives which are provided by modern CPUs. Probably the most used lock-free algorithm in the boost library is the reference counting in boost::shared_ptr. On popular hardware platforms these reference count updates are based on atomic increments/decrements using  compare-and-swap (CAS) operations. The performance improvements over normal mutex exclusion locks are significant as the following little experiment shows. Basis for this experiment is a function which increments an integer counter 500 million times in a loop [1]. This loop is carried out in

  • simple single threaded loop
  • single thread loop using atomic increments
  • single thread loop acquiring a mutex exclusion lock for every pass.
  • two threads using atomic increments
  • two threads using mutex exclusion locks

A Java and the C++ implementation is available here. The Java code uses the package java.util.concurrent.atomic and the C++ code uses boost::detail::atomic_count to implement the atomic increments. The run times are measured on a Core i3@3GHz.

Lock-free algorithms are difficult to implement and to debug. Tim Blechmann has written a little gem library boost::lockfree (Parts of the library are now in the boost release 1.53.0). Among others this library contains a wait-free single-producer/single-consumer ring buffer. This ring buffer is e.g. tailor-made to separate the Monte-Carlo path generation from the pricing of a derivate. With a few lines of code the path generation can then run in a different thread.

The code available here contains a BufferedDataFactory based on the boost::lockfree::ringbuffer class and a slightly modified version of the QuantLib test case HybridHestonHullWhiteProcessTest::testZeroBondPricing. In this version of the test case the path generation is running in a separated thread and uses the BufferedDataFactory to hand over the data to the pricing thread.

[1] Disruptor: High performance alternative to bounded queues for exchanging data between concurrent threads.

Parallel Model Calibration using MPI and Boost.MPI

The message passing standard MPI is a language-independent communication protocol. MPI supports the parallelization of numerical algorithms on both massive parallel computers and on symmetric multi processor systems. MPI is standardized, highly portable and the de facto standard on massive parallel supercomputers. Even though MPI can be used in a multi-threading environment it is normally used in a multi-process environment. Therefore MPI is tailor-made to parallelize algorithms based on the non thread-safe QuantLib.

The roots of the MPI specification are going back to the early 90’s and you will feel the age if you use the C-API, which is designed to achieve maximum performance. The Boost.MPI library – quoting from the web page – “is a C++ friendly interface to the standard Message Passing Interface… Boost.MPI can build MPI data types for user-defined types using the Boost.Serialization library”.

Model calibration can be a very time-consuming task, e.g. the calibration of a Heston or a Heston-Hull-White model using American puts with discrete dividends. The class MPICalibrationHelper acts as a MPI wrapper for a given CalibrationHelper and allows to parallelize an existing model calibration routine (hopefully with minimal impact/effort). The source code is available here. It contains the MPICalibrationHelper class and as an example the parallel version of the DAXCalibration test case (part of test-suite/hestonmodel.cpp). The code depends on QuantLib 1.0 or higher, Boost.Thread and Boost.MPI.

The diagram above shows the speed-up of a Heston-Hull-White calibration with discrete dividends on an eight core machine using a finite difference pricing engine. The main reason for the sub-linear scaling is the limited memory bandwidth between the CPUs and the main memory and not the MPI communication overhead.

QuantLib-SWIG and a Thread-Safe Observer Pattern in C++

Update 23.11.2015: A modified version of the patch is now part of the official QuantLib Release 1.7.

Update 28.02.2015: Please find the latest version of the patch for QuantLib 1.5 here.

Update 11.05.2014: Please find the latest version of the patch for QuantLib 1.4 here.

The QuantLib library is not really thread-safe. Successful usage in multi-core and parallel environments is often achieved by message passing between multiple processes rather than shared memory and multi-threading. If different threads are acting on distinct objects then QuantLib can also be used in the multi-threading environment (define QL_ENABLE_SESSIONS to make the singletons thread dependent.).

If you are using QuantLib in Java or Scala via SWIG [1] then the QuantLib routines are automatically executed in a multi-threading environment even if your main routine is single threaded. The garbage collector of the JVM is usually running in different thread. In conjunction with QuantLib’s implementation of the Observer pattern this can lead to serious problems, e.g. the following single threaded Scala code crashes on a multi-core computer after a short period of time.

import org.quantlib.{Array => QArray, _}
object ObserverTest {
    def main(args: Array[String]) : Unit = {
        System.loadLibrary("QuantLibJNI");
        val aSimpleQuote = new SimpleQuote(0)

        while (true) {
            (0 until 10).foreach(_ => {
                new QuoteHandle(aSimpleQuote)
                aSimpleQuote.setValue(aSimpleQuote.value + 1)
            })
            System.gc
        }
    }
}

The garbage collector might call the destructor of an observer (QuoteHandle) at the same point in time when the update method on the same object is invoked by aSimpleQuote.setValue. (Find here the original postings on the QuantLib mailing list regarding this problem.). For a greenfield project the author in [2] has a nice solution for this problem in C++. Unfortunately this solution can not be applied to the QuantLib because the solution involves a change of signature of the Observer interface and thereafter would lead to a lot of change in the QuantLib library.

The main problem here is that we have to disable the Observer before the destructor starts to work. This could in theory be achieved by adding a special “Deleter” to every new instance of a boost::shared_ptr. But again this would lead to a lot of changes in the library.

When compiled with the preprocessor directive BOOST_SP_ENABLE_DEBUG_HOOKS the boost library adds a callback hook before a destructor is called by any boost smart pointer. This hook can be utilized to disable an Observer before the actual destructor is executed. In addition the boost::signals2 library [3] provides a simple and thread-safe notification mechanism.

The corresponding thread-safe implementation of the original QuantLib Observer/Observable interface can be found here. Set the preprocessor directive BOOST_SP_ENABLE_DEBUG_HOOKS for every source file. Replace/Add the files observable.hpp and observable.cpp and recompile the QuantLib library and the QuantLib-SWIG module.

Note: if you want to use the patch with QL 1.2 and not for the trunk, please see the comment by gk below.

[1] Simplified Wrapper and Interface Generator, SWIG

[2] Shuo Chen, Where Destructors meet Threads

[3] Douglas G., Mori Hess F., Boost.Signals2

Accelerate VPP Pricing on the GPU using Preconditioner

As outlined in Solving Sparse Linear Systems using CUSP and CUDA a VPP pricing problem can be solved on a GPU using finite difference methods based on the implicit Euler scheme. The matrix inversion is carried out on the GPU using the BiCGStab solver of the numerical library cusp [1]. The  optimal exercise problem for the VPP can be solved using dynamic programming for every time slice on the GPU (see [2]).

The lower convergence order in \Delta t of the implicit Euler scheme is not a disadvantage because the hourly optimization step forces \Delta t = 1h anyhow. This step size is small enough for the implicit Euler scheme to obtain a very good accuracy. The explicit Euler scheme is  numerically unstable, even if \Delta t = 1h. It converged only for large \Delta x, \Delta y and \Delta u.

Clearly the matrix inversion dominates the overall runtime of the pricing routine. Preconditioning is a standard technique to improve the performance of the involved BiCGStab algorithm. The cusp library supports a set of suited preconditioner for this problem.

  • Diagonal: preconditioning is very inexpensive to use, but has often limited effectiveness.
  • Approximate Inverse: Non-symmetric Bridson with different dropping strategies.
  • Smoothed Aggregation: Algebraic Multi-grid preconditioner based on smoothed aggregation.

On a CPU QuantLib supports preconditioning based on the Thomas algorithm. This algorithm solves a tridiagonal matrix with O(8n) operations. The corresponding preconditioner can be seen as a natural extension of the diagonal preconditioner. Experiments on a CPU indicate that  the tridiagonal preconditioner leads to a significant speed-up for our VPP pricing problem. The upcoming CUSPARSE library of the CUDA 4.1 release contains a tridiagonal solver for the GPU. With the help of a small interface class the CUDA tridiagonal solver can be used as a cusp preconditioner. The source code is available here. The code depends on the latest QuantLib version from the SVN trunk.

The testbed is a VPP contract with t_{minUp} = t_{minDown} = 2h and a maturity of 4 weeks (corresponds to 4*7*24=672 exercise steps). Therefore in addition to the three dimensions for the stochastic model the problem has a fourth dimension of size 2t_{minUp} + t_{minDown} = 6  to solve the optimization problem via dynamic programming.

The following diagram shows the different run-times of the algorithms on a GPU and a CPU.

The fastest algorithm on the GPU (implicit Euler scheme with BiCGStab and tridiagonal preconditioner) outperforms the best CPU implementation (operator splitting based on the Douglas scheme) roughly by a factor of ten. The non-symmetric Bridson preconditioner with the dropping strategy from Lin & More on the GPU is the second best algorithm but it ran out of memory on larger grids. The diagonal preconditioner slows down the overall performance on the GPU and the BiCGStab algorithm with the smoothed aggregation preconditioner did not converge.

[1] cusp, Generic Parallel Algorithms for Sparse Matrix and Graph Computations

[2] Pricing of a Virtual Power Plant on a GPU

Bermudan Swaption Pricing based on Finite Difference Methods

Even though short rate models like the Hull-White model or the G2++ model are getting a bit long in the tooth these models are still used for risk management or as benchmark models. Since the early days QuantLib supports the pricing of Bermudan swaptions based on trinomial trees. It’s time to compare the performance and accuracy of the trinomial tree pricing algorithm with finite difference methods.

The G2++ model is defined by the following stochastic differential equation

\begin{array}{rcl} dx_t &=& -ax_tdt+\sigma dW_t^1 \\ dy_t &=& -by_tdt + \eta dW_t^2 \\ dW_t^1 dW_t^2 &=& \rho dt \\ r_t &=& x_t + y_t+\phi_t \\ \phi_t &=& f^M(0,T) +\frac{\sigma^2}{2a^2}\left(1-e^{-aT} \right )^2 + \frac{\eta^2}{2b^2}\left(1-e^{-bT} \right )^2 \\ &+&\rho \frac{\sigma\eta}{ab}\left(1-e^{-aT} \right )\left(1-e^{-bT} \right ) \end{array}

where f^M(0,T) is denoting the market instantaneous forward rate at time 0 for the maturity T (see. e.g. [1]). The Hull-White model can be seen as a one-dimensional simplification of the G2++ model. The corresponding partial differential equation (PDE) can be derived using the Feynman-Kac theorem.

\begin{array}{rcl} \frac{\partial V}{\partial t} = ax\frac{\partial V}{\partial x} + by\frac{\partial V}{\partial y} - \frac{\sigma}{2}\frac{\partial^2 V}{\partial x^2} - \frac{\eta}{2}\frac{\partial^2 V}{\partial y^2}-\rho\sigma\eta\frac{\partial^2 V}{\partial x \partial y} + rV \end{array}

QuantLib supports several operator splitting schemes to solve multi-dimensional PDEs.  The SVN trunk contains FDM engines to price Bermudan swaptions under the Hull-White and the G2++ model. Other short rate models like the CIR++ or Black-Karasinski model can be implemented in the same manner.

As can be seen in the diagrams below the finite difference method based pricing engines outperform the trinomial tree based engines.

[1] D. Brigo, F. Mercurio, Interest Rate Models – Theory and Practice