4.1 Numerical integration
The term numerical integration implies the approximation of the integral of a function; generally, it aims to use the minimum number of function evaluations possible as it tends to be numerically expensive. There is a variety of methods being proposed in the literature to perform numerical integration; throughout this Section, I will focus on quadrature rules, i.e. any method that evaluates the function to be integrated at some points over the integration domain and combines the resulting values to obtain an approximation of the integral. Quadrature rules vary in complexity and accuracy, and generally accuracy improves as rules get more complex. Additionally, integration of functions in few dimensions is generally not too problematic; the task becomes more difficult when integrating over many dimensions as obtaining an acceptable level of accuracy often requires an unfeasible number of function evaluations.
4.1.1 Unidimensional functions
The simplest method to approximate numerically the integral of a unidimensional function \(f(x)\) over an integration domain \([a, b]\) is given by the Riemann sum. A particular form of Riemann sum is given by the midpoint rule, which approximates the integral of a continuous function by the area under a set of \(N\) step functions, with the midpoint of each matching \(f\): \[ \int_a^b f(x) \ dx \approx \frac{b - a}{N} \sum_{i = 1}^N f(a + (i - 0.5)(b - a) / N) \] Alternatively, the trapezoidal rule approximates the area under a continuous function as a trapezoid and then computes its area: \[ \int_a^b f(x) \ dx \approx (b - a) \left[ \frac{f(a) + f(b)}{2} \right] \] it works best when partitioning the integration area into many subintervals, applying the trapezoidal rule to all of them, and then summing the results: \[ \int_a^b f(x) \ dx \approx \sum_{i = 1} ^ N \frac{f(x_{k - 1}) + f(x)}{2} \Delta(x_k), \] with \({x_k}\) a partition of \([a, b]\) such that \(a = x_0 < x_1 < x_2 < \dots < X_{N-1} < x_N = b\) and \(\Delta(x_k) = x_k - x_{k - 1}\) the length of the \(k\)th subinterval.
The accuracy of the midpoint and trapezoidal rules depends on the number of steps (subintervals) \(N\) used to approximate the function, but so does complexity (computationally speaking). The only requirement for applying these rules is that one needs to be able to evaluate the function \(f(x)\) at a given point in its domain. If \(f(x)\) is cheap to evaluate than the midpoint and trapezoidal rules may be just fine; otherwise, it would be better to move onto more complicated methods that yield more accurate results.
A first method that is only slightly more complicated but yields better results is the Simpson’s rule. It works analogously to the midpoint and trapezoidal rule, but using a smooth quadratic interpolant which takes the same values as \(f(x)\) at the extremities of the integration interval \([a, b]\) and at the midpoint \(m = (a + b) / 2\): \[ \int_a^b f(x) \ dx \approx \frac{b - a}{6} \left[ f(a) + 4f((a + b) / 2) + f(b) \right] \] Analogously as the trapezoidal rule, it is possible to obtain greater accuracy by splitting the integration interval into many subintervals, applying the Simpson’s rule to each subinterval, and adding up the results.
Second, it is possible to show that by carefully choosing the points at which to evaluate \(f(x)\) and the weights assigned to each point it is possible to obtain an exact approximation of the integral of any polynomial of degree \(2N - 1\) or less with \(N\) function evaluations (proof in Monahan (2011)). Methods that exploit this feature are commonly named Gaussian quadrature methods. Let \(f(x)\) be a function of order \(2N - 1\) or less to integrate over a conventional domain \([-1, 1]\); let \(w(x)\) be a weight function. The quadrature formula is defined as: \[ \int_{-1}^{1} f(x) \ dx = \sum_{i = 1} ^ N w_i f(x_i) \] Before applying Gaussian quadrature, any integral over the domain \([a, b]\) needs to be changed to to the interval \([-1, 1]\). The change of interval is applied as \[ \int_a^b f(x) \ dx = \frac{b - a}{2} \int_{-1}^{1} f\left( \frac{b - a}{2} x + \frac{a + b}{2} \right) \ dx, \] and the Gaussian quadrature rule is then \[ \int_a^b f(x) \ dx = \frac{b - a}{2} \sum_{i = 1} ^ N w_i f\left( \frac{b - a}{2} x_i + \frac{a + b}{2} \right). \] If \(f(x) = w(x) g(x)\) can be written as the product of a polynomial function \(g(x)\) and a known weighting function \(w(x)\), alternative weights \(w_i'\) and nodes \(x_i'\) give better results: \[ \int_{-1}^{1} f(x) \ dx = \int_{-1}^{1} w(x) g(x) \ dx = \sum_{i = 1} ^ N w_i' f(x_i') \] Depending on the choice of the weighting function \(w(x)\), different Gaussian quadrature rules can be obtained that allow integrating over different domains \([a, b]\). When \(w(x) = 1\), the associated polynomials \(g(x)\) are Legendre polynomials, the quadrature rule is then named Gauss-Legendre quadrature rule, and it allows integrating over the interval \([-1, 1]\). The integration points are then obtained as the the \(N\) roots of the Legendre polynomials: \(x = \{x_1, x_2, \dots, x_N\}\). When choosing the weight function \(\exp(-x)\) the associated polynomials are Laguerre polynomials, the quadrature rule is named Gauss-Laguerre quadrature rule, and the integration domain is \([0, +\infty)\). Finally, when choosing the weight function \(\exp(-x^2)\) the associated polynomials are Hermite polynomials, the quadrature rule is named Gauss-Hermite quadrature rule, and the integration domain is \((-\infty, +\infty)\).
Finally, a slightly more complicated version of Gaussian quadrature is given by the Gauss–Kronrod quadrature formula. In the Gauss-Kronrod quadrature rule, the evaluation points are chosen dynamically so that an accurate approximation can be computed by reusing the information produced by the computation of a less accurate approximation. In practice, integration points from previous iterations can be reused as part of the new set of points, whereas usual Gaussian quadrature would require recomputation of all abscissas at each iteration. This is particularly important when some specified degree of accuracy is needed but the number of points needed to achieve this accuracy is not known ahead of time. Gauss-Kronrod quadrature rule is implemented in R as the integrate()
function.
4.1.2 Multidimensional functions
All the methods presented so far only apply to the integration of unidimensional functions. It is clearly possible to extend quadrature rules to multidimensional settings, by recursively applying unidimensional quadrature rules. Say I want to approximate the integral of a bi-dimensional function \(f(x, y)\); the bi-dimensional Gaussian quadrature rule has the form: \[ \int_X \int_Y f(x, y) \ dx \ dy \approx \sum_j \sum_i w_j w_i f(x_j, y_i) \] This can be extended to any number of dimensions \(d\), but it gets very computationally expensive very quickly as a \(N\)-points rule requires \(N^d\) function evaluations. Two simple tricks for improving accuracy and efficiency are given in Jäckel (2005) and consist in pruning and/or rotating the matrix of location nodes. When pruning the matrix of nodes, those with extremely low weights that therefore contribute very little to the total integral value can be removed to spare some computational cost. Conversely, when rotating the matrix of nodes the correlation between the \(d\) dimensions is taken into account. For instance, assume I am integrating over the random effects of a joint model for longitudinal and survival data; the random effects variance-covariance matrix can be decomposed using either the Cholesky decomposition or the spectral decomposition, and consequently the matrix of location nodes can be pre-multiplied by such decomposition to rotate the grid of integration points, better adapting to the shape of the multidimensional function to integrate.
A better option when the number of dimensions \(d\) to integrate over is high is given by Monte Carlo integration. Consider integrating a multidimensional function \(f(x)\) over some region \(\Omega\) of volume \(V(\Omega)\): \[ I_{\Omega} = \int_{\Omega} f(x) \ dx = E[f(U)] V(\Omega), \] with \(U \sim\) uniform over \(\Omega\). Drawing \(N\) uniform random vectors \(u_i\) an estimator for \(I_{\Omega}\) is \[ \hat{I}_{\Omega} = \frac{V(\Omega)}{N} \sum_{i = 1} ^ N f(u_i); \] this estimator defines Monte Carlo integration. More details in Monahan (2011).
Both Gaussian quadrature and Monte Carlo integration can be further tweaked to improve accuracy and convergence rates: two appealing options are, respectively, adaptive Gaussian quadrature and importance sampling. Adaptive Gaussian quadrature works best when using the Gauss-Hermite rule with the normal density kernel as weighting function. Recall using the spectral or Cholesky decomposition of the random effects variance-covariance to pre-multiply the matrix of location nodes. This will not place the nodes in optimum locations for each subject: if between-subject variability is large, then a common matrix of location nodes is likely to be quite inefficient. Adaptive Gauss-Hermite quadrature, as proposed by Pinheiro and Bates (1995), aims to solve this problem. In brief, the idea consists in applying subject-specific (or cluster-specific) centring and scaling of the quadrature nodes to place the quadrature nodes at an optimal position for each subject. That is achieved by using an alternative normal kernel distribution \(\phi(x; \hat{M}_i, \hat{\Sigma}_i)\) with subject-specific mean vector \(\hat{M}_i\) and variance-covariance matrix \(\hat{\Sigma}_i\): \[ f(x) = \frac{f(x)}{\phi(x; \hat{M}_i, \hat{\Sigma}_i)} \phi(x; \hat{M}_i, \hat{\Sigma}_i) \] This yields a quadrature rule based on the subject-specific normal kernel: the location nodes and weights then depend on \(\hat{M}_i\) and \(\hat{\Sigma}_i\). Further, \(\hat{M}_i\) and \(\hat{\Sigma}_i\) can be updated iteratively (e.g. using empirical Bayes estimates) to better adapt the grid of quadrature points to the actual shape of the integral to approximate as the accuracy of predicting \(\hat{M}_i\) and \(\hat{\Sigma}_i\) increases. Conversely, Monte Carlo integration works best when it is possible to draw a sample from the target distribution (i.e. the distribution of the integral to approximate); unfortunately, that is rarely the case in practice. The idea of importance sampling consists then in drawing a sample from a proposal distribution and then re-weight the estimated integral using importance weights to better adapt to the target distribution.
References
Monahan, John F. 2011. Numerical Methods of Statistics. 2nd ed. Statistical and Probabilistic Mathematics. Cambridge University Press.
Jäckel, Peter. 2005. “A Note on Multivariate Gauss-Hermite Quadrature.” http://awdz65.dsl.pipex.com/ANoteOnMultivariateGaussHermiteQuadrature.pdf.
Pinheiro, José C, and Douglas M Bates. 1995. “Approximations to the Log-Likelihood Function in the Nonlinear Mixed-Effects Model.” Journal of Computational and Graphical Statistics 4 (1): 12–35. doi:10.2307/1390625.