Neural Network–Based Financial Volatility Forecasting: A Systematic Review.ニューラルネットワークベースの金融ボラティリティ予測：系統的レビュー

2022年3月15日 20:19

Abstract
Volatility forecasting is an important aspect of finance as it dictates many decisions of market players. A snapshot of state-of-the-art neural network–based financial volatility forecasting was generated by examining 35 studies, published after 2015. Several issues were identified, such as the inability for easy and meaningful comparisons, and the large gap between modern machine learning models and those applied to volatility forecasting. A shared task was proposed to evaluate state-of-the-art models, and several promising ways to bridge the gap were suggested. Finally, adequate background was provided to serve as an introduction to the field of neural network volatility forecasting.

1 INTRODUCTION
One of the most important tasks in finance is to monitor volatility of market variables such as commodity prices, interest rates, and the variables that constitute the value of its portfolio. It is also a key factor in the pricing of many financial instruments, and is the underlying asset of many derivatives. However, like all financial forecasting and prediction (used interchangeably in this text), it is not a simple task [103]; the stylised facts of volatility [39, 50, 90], the efficient market hypothesis [87, 124], and the ephemeral nature of financial relationships [29, 37] are just a few of the many reasons why. Despite this, it is still possible to forecast volatility to some degree [103, 124].

Volatility is often referred to as latent and unobservable [103], even ex-post, and finding a suitable way to quantify or estimate it is a problem in and of itself, leading to many different definitions being proposed [101]. Although these are only proxies of latent volatility, they still have practical value as they provide a quantitative way of comparison and often align with market definitions [1].

In addition to the many volatility proxies, there are many methods that attempt to model, understand, and forecast volatility. One of the most widely used is the generalised autoregressive conditional heteroscedasticity (GARCH) model, and its family of variants [12]. In contrast to these traditional models, intelligent methods have recently gained much traction and are often nonlinear, encompassing methods such as machine learning (ML) and deep learning (DL), evolutionary algorithms (EAs), and fuzzy logic [20, 114]. ML and DL specifically have surged in popularity over recent years due to a flurry of successful neural network (NN) applications [62], a movement that is also being seen in financial volatility forecasting [20].

This systematic review looks to provide an overview of NN-based financial volatility forecasting, henceforth referred to as NN volatility forecasting. This article does not aim to be a comprehensive review of volatility forecasting in financial markets; for such a review, we direct the readers to Poon and Granger’s 2003 paper [103], which provides a dated yet relevant review of financial volatility forecasting, or Sezer, Gudelek, and Ozbayoglu’s 2020 publication [114], which provides a broad snapshot of state-of-the-art DL in financial time series forecasting. This study differs from the above in that the scope will be significantly narrowed to volatility forecasting with NN-based methods, from 2015 onward, thus offering a higher level of detail. To the best of our knowledge, there exists no timely and comprehensive review for volatility forecasting on financial time-series data using NNs.

Specifically, we had the following three aims: (1) to create a text that can be used as an introductory source to the field of financial volatility forecasting, (2) to provide a snapshot of the state-of-the-art in NN volatility forecasting, and (3) to identify some common issues, how these may be addressed, and some future directions.

The rest of the review is organised as follows: Section 2 provides a broad background on financial volatility, including some theory surrounding the different definitions and when these definitions may be appropriate. Background is also provided on several traditional forecasting methods, intelligent forecasting methods, as well as how the forecasting task can be defined. Section 3 outlines the methodology for the analysis and sub-analyses of this review. Section 4 presents several key results synthesised from the analysis, and Section 5 discusses these, as well as presents a few common issues observed, how these may be addressed, and a few limitations. Concluding remarks are given in Section 6.

2 BACKGROUND
Financial volatility and its forecasting have been studied to a great extent for many years, and the demand for creating a suitable model to understand and forecast volatility is only increasing as the future grows with uncertainty and the number of market players increases. Although our focus is on NN volatility forecasting, it would be valuable to provide background on volatility forecasting in general.

2.1 Defining Volatility
Due to the numerous ways in which volatility can be quantified, it must be concretely defined before any discussion can begin. Volatility is often referred to as latent and unobservable. This is because, within theory, it is an instantaneous variable that scales a stochastic Wiener process, also known as a Brownian motion. The appropriate way to quantify this would be an integral over time, resulting in the integrated volatility [103]. However, this is not possible due to the discrete nature of financial systems. Instead, we can look to the theory of quadratic variation [66], in which the integrated volatility is approximated by the sum of squared returns within a time window τ1 to τ2, given a high enough sampling frequency. This results in the Realised Volatility (RV) proxy [4, 6]:
(1)
RV=∑t=τ1τ2r2t−−−−−⎷,
where rt=log(Pt/Pt−1) is the log return at time t, and Pt is the price at time t. When deciding on the sampling frequency, there is a tradeoff between wanting a very high frequency to approximate continuously observed frictionless prices, and wanting a lower frequency as to avoid market micro-structures. Seminal works around this suggest from 5- to 15-minute intervals as a good balance [5, 6, 59, 112]; however, the decision ultimately relies on the market liquidity of the given asset.

Oftentimes, a financial asset does not meet the market liquidity condition, or high frequency data for an asset is not easily attainable; in such cases, other proxies for volatility are required. One such method is to quantify the dispersion in the daily closing prices of assets, known as the close-to-close method, or better known as Historic Volatility (HV) [103]. It is defined as the standard deviation (or variance) of the log return series, from the time window τ1 to τ2:
(2)
HV=1N∑t=τ1τ2(rt−μτ1,τ2)2−−−−−−−−−−−−−−−⎷,
where μτ1,τ2=1N∑τ2t=τ1rt is the mean of the log returns within a time window, N=τ2−τ1 is the number of samples within that time window, rt=log(Pt/Pt−1) is the log return at time t, and Pt is price at time t. The log returns series is used as opposed to the daily closing prices as they often fit a Gaussian distribution [5, 55]. This is important as the standard deviation is only an appropriate and meaningful measure of dispersion when the underlying distribution of the sample is Gaussian, and it is well known that asset prices do not meet this condition. Occasionally, the returns series is used rather than the log returns as log(1+x)≈x when x is small. In this case, the returns would be defined as Rt=Pt/Pt−1−1, that is, log(Pt/Pt−1)≈Pt/Pt−1−1. There are many ways to test if the drawn data fits a Gaussian distribution; one that is commonly used is the Jarque-Bera test [65], though this may be biased for distributions with short tails, and other tests may be more appropriate, such as a Shapiro-Wilk test or a modified Cramér–von Mises test [123].

Closely related to HV are several other proxies that incorporate additional daily data, each with their own set of assumptions, advantages, and disadvantages. The extreme value method, proposed by Parkinson (1980) [100], assumes a continuous geometric Brownian motion with no drift that utilises high and low prices rather than the closing price. Another proxy proposed by Garman and Klass (1980) [46] also assumes a continuous geometric Brownian motion with no drift, but utilises open, close, high, and low prices. Since then, many other proxies that utilise the range of daily price have emerged, such as that of Rogers and Satchell (1991) [107] that allows for drift, and was empirically shown to be superior than an adjusted Garman Klass method in the presence of time varying drift [108]. Yang and Zhang (2000) [131] proposed a proxy that also assumes continuous geometric Brownian motion, but can handle drift and open price jumps. This was affirmed in simulation: all four proxies are good estimates of the true variance under a geometric Brownian motion with small drift and no opening jumps; the Parkinson and Garman Klass proxies overestimate true variance in the presence of large drift; and only Yang and Zhang’s proxy is stable in the presence of large opening jumps [117]. Additionally, all four were shown to be a close approximation of the RV (15-minute sum of squared log returns) of S&P500 [117]. For a more comprehensive view and comparison of range-based proxies, we refer readers to the works of Chou et al. [24, 25]. Interestingly, there is evidence that some range-based proxies can be forecasted with less error than the traditional close-to-close method (or HV) [75]. However, some have also found that this does not hold when reasonable modifications were made, such as using an adjusted closing price rather than raw closing price for the close-to-close method [102].

From different thinking arises Implied Volatility (IV), in which the volatility is backwards calculated from an option price via some option pricing model, such as the infamous Black-Scholes model [91]. There is no closed-form inverse for the Black-Scholes option implied volatility, and are often evaluated through numerical solvers, such as the Newton-Raphson method [73, 91], or through a closed-form approximation [15, 119]. Because the prices of options are set by the market, and due to the efficient market hypothesis, IV can be thought of as an efficient expectation of future volatility by the market for a specific asset within a certain time period, that is, a good forecast of future volatility. It is also the definition underlying major volatility indices, such as the Chicago Board Options Exchange’s Volatility Index (VIX) [1], which is the 30-day expected volatility resulting from the bid and ask prices of options whose expiry period lies between 23 and 37 days. It has also been shown to hold explanatory power in forecasting other volatility proxies, more so than past values of those proxies themselves [18, 27]. Despite its widespread use, the IV has several flaws. One such flaw is known as volatility smile, in which options that are identical in all but strike price will result in different levels of IV. This is smallest with at-the-money options, and increases as the options become increasingly in-the-money or out-of-the-money [3], meaning that the volatility of an asset will be different depending on what option is used. Another flaw is that option pricing models typically assume constant volatility throughout the period, which is generally never the case [91]. Additionally, it is empirically observed that IV is typically higher than other measures of volatility, such as the RV or HV, for the same period [27, 30, 43].

2.2 Forecasting Methods
There exists a plethora of methods for which to forecast, model, and understand volatility; one of which is NNs, which has recently gained significant popularity due to numerous successful applications in other fields. The field of NNs and ML is extremely broad, and there are many options for readers to gain a deeper understanding [9, 49]; this text only provides a brief overview of NNs, with emphasis in the context of financial volatility forecasting. NNs are often referred to as a universal function approximator [58] as they have the ability to learn any arbitrary nonlinear mapping f from input X to output y; y=f(X) [129]. In the context of financial volatility forecasting, the mapping may be represented as
(3)
yˆt+1=f(Xt),
where yˆt+1 may be the forecasted volatility proxy for the next time period, and inputs Xt may be a matrix of observations (e.g., previous returns Xt=[rt,rt−1,rt−2,…]T). This mapping is learned through some optimisation method (typically back-propagation), in conjunction with some loss function and sampled data, in hopes that the learned map can generalise to out-of-sample instances. The complexity of this mapping is bounded by the structure and quantity of computational nodes, also known as neurons. Increasing the complexity of this mapping is not always a good thing, as it may lead to over-fitting of the in-sample data, thus failing to generalise to out-of-sample instances, and is one of the reasons why the complexity of the network is often restricted [10].

Roughly, NNs can be categorised into three broad groups, each of which is designed for different applications and leverage different aspects of the data. First, multi-layer perceptrons (MLPs) do not naturally leverage any structure within the data and can be thought of as the simplest, most traditional category of NNs, with the most history and largest body of research. It is extremely flexible and can be formulated in different ways to exhibit different properties, such as with a nonlinear autoregressive (NAR) framework, to enforce an autoregressive property to the nonlinear mapping in which the forecast is a function of its previously observed values (e.g., yˆt+1=f([yt,yt−1,…,yt−m]T)) [26, 77]. This can be extended to include exogenous variables (a NARX framework), thus incorporating more information [17, 83]. The learning method can be altered, like that of an extreme learning machine (ELM) to learn significantly faster while maintaining relatively good generalisation performance [60, 98]. The non-linearity could be altered, like that of a radial basis function (RBF) network, to capture different kinds of nonlinearity within the data more easily [16, 88]. The loss could be defined differently, like that of a quantile regression NN (QRNN) allowing the network to estimate the conditional median and other quantiles, rather than just the conditional mean [104, 122].

Second, recurrent neural networks (RNNs) were originally designed to analyse sequential data and exploit the sequence in which data appears, a natural fit for time series forecasting. The RNN builds a hidden state as it recursively parses through the input sequence, retaining useful information from previous elements, often referred to as the networks memory. However, the vanilla RNN can typically only hold short-term information [69], which has resulted in several extensions, such as the long short-term memory (LSTM) [57] and gated recurrent unit (GRU) [28]. These are widely used and combat this problem by using gates to allow the network to remember, update, and forget information. One important consideration is how much of the series should be exposed to the model; a recurrent network can parse the entire history of the time series or just a small window of it, a key differentiating point when compared to other NN architectures. The former allows the network to build context from the entire series, whilst the latter can only build context from the small time window.

Third, convolutional neural networks (CNNs), originally based on the visual cortex [45], were designed for image tasks, and leverages spatial information in the data. Independent units search over the entire image in small sections, looking for the presence of a feature to create a feature map. Thus, translational invariance is built into the model, meaning the presence of a feature is detected regardless of its location in the input data, which is not the case with the MLP or RNN. This process is hierarchically repeated on the feature maps until the final layer aggregates all the extracted information to perform a prediction. Due to the highly hierarchical structure, a well-trained CNN can be thought of as a feature extractor, as the lower levels of the network often detect base features that are common to all images, whilst the upper levels are conditioned for the specific task [74]. There are many ways in which time series data can be used with a CNN. The first is to replace the standard two-dimensional convolutions with one-dimensional convolutions. This simply means the spatial information is leveraged in one dimension: time. Some successful examples of this are WaveNet [125] and temporal convolutional networks (TCNs) [78]. It is also possible to convert the time series into an image, such as through a recurrence plot [36], Gramian Angular Fields (GAFs) [128], or Markov Transition Fields (MTFs) [19], and then use traditional two-dimensional convolutions.

In terms of volatility forecasting models, the traditional GARCH family of models are infamous due to their theoretical underpinnings and popularity. The GARCH model is an extension of the autoregressive conditional heteroscedasticity (ARCH) model [38], belonging to the autoregressive (AR) universe of models. As the name suggests, an ARCH(q) model forecasts future volatility, conditioned on previous observations, and can be represented as
(4)
ϵt=σtzt,
(5)
σ2t=α0+∑i=1qαiϵ2t−i,
where ϵt are the residuals (or innovations), zt is white noise, σt is the conditional variance, α0 and αi are learned coefficients, and q is the number of lagged square residuals to include (or the order of the ARCH model). This can be extended to the GARCH(p, q) model by assuming the error variances follow an autoregressive moving average (ARMA) model, which can be expressed as
(6)
σ2t=w+∑i=1qαiϵ2t−i+∑i=1pβiσ2t−i,
where w, αi, and βi are learned coefficients, and p is the number of lagged conditional variances to include (or order of the model). Since the conception of the GARCH(p, q) model, there have been many advancements that address the models inability to capture several stylised facts of volatility [39]. Exponential, threshold, and Glosten-Jagannathan-Runkle versions (EGARCH, TGARCH, GJR-GARCH) allow for asymmetric dependencies in volatility. Integrated and fractionally integrated versions (IGARCH, FIGARCH) address volatility persistence, where an observed shock in the volatility series seems to impact future volatility over a long horizon. Despite the countless other versions of the GARCH model, some have found that a GARCH(1, 1) forecasting model outperforms all other GARCH variants (including ARCH) for foreign exchange volatility. However, this was not seen for stock market volatility, where a GARCH(1, 1) was outperformed by a variant (though was still significantly better than the ARCH model) [53]. Others have found that the ARCH model is superior to GARCH for country indices when estimating value at risk (VaR) [99].

There are also many other models belonging to the AR universe that can be used for forecasting. One example that has been used in literature is the conditional autoregressive range (CARR) models [23, 48], a generalisation of GARCH models with the extreme value theory to allow forecasting of range-based volatility proxies from range-based inputs. Another example is the smooth transition autoregressive (STAR) family of models, which uses different AR models in different regimes with a smooth transition function between them [8, 21], or the heterogeneous autoregressive (HAR) models, inspired by the Heterogeneous Market Hypothesis, a relatively simple model that considers volatilities realised over different interval sizes [7, 31]. A final example is the moving average (MA) family of models, such as ARMA [14, 132] and exponentially weighted moving average (EWMA) [63, 68]. All these models are unified in that the forecasts they produce are conditioned on past values, a key property in all AR models, however, they all differ in exact formulation, and thus have different attributes.

A hybridised NN forecasting method can also be fashioned by combining the previously mentioned AR methods with a NN, each compensating for the others flaws. There are also many other methods to hybridise NNs with, such as methods derived from fuzzy logic [33, 121] or chaos theory [84]. Fuzzy models allow better handling of imprecise and non-numerical information with differing degrees of truth by allocating inputs into fuzzy sets, which can increase robustness to uncertainty. One way to incorporate this with a NN is through an adaptive network-based fuzzy inference system (ANFIS) architecture [33, 64]. It provides the advantage of combining the rules in the rule base of fuzzy theory to describe the complex relationships, and the learning ability of NN to adjust the membership functions and rule base. Chaos theory models the apparent disorder within a system that nevertheless obeys specific rules and is highly sensitive to initial conditions. Although there does not seem to be solid support for the presence of chaos in economics in the past years [40], some recent developments in testing have found empirical evidence of chaotic behaviour in volatility, specifically volatility indices [84].

There are also a number of other methods that can be used in conjunction with NNs, but were not considered hybridisations for the purposes of this review. These included methods such as transforming/preprocessing the input data to make information more readily available for the forecasting model, such as Principal Component Analysis (PCA), for dimensionality reduction [121, 130], Discrete Wavelet Transform (DWT), for basis function dependant signal decomposition [115, 132], and Empirical Mode Decomposition (EMD), to decompose signals into Intrinsic Mode Functions (IMFs) [61, 133]. Evolutionary computing can also be used as an alternative parameter search and optimisation method to train the weights of the NN, as opposed to the standard first order back propagation.

2.3 Defining the Forecasting Task
Given that the NN is a universal approximator that attempts to learn the mapping f in yˆt+1=f(Xt), the task defines what yˆt+1 and Xt are. This is an important factor in the success of a model; the inputs Xt must contain enough information to predict the output yˆt+1. Note that information is not used interchangeably with data: data is simply a set of observations or samples, whilst information is more abstract; noise can produce an infinite amount of data, but will contain almost no information. The input can be constructed in many ways, the simplest of which is through a univariate approach, where the input contains only one variable (but possibly multiple lags of it). This will lead to a simple and parsimonious model, but may not perform well as the inputs may not hold enough information to predict the output effectively. An example can be seen in Figure 1, in which a NAR MLP network forecasts volatility 1 day ahead from the previous five observed values of volatility. Additional information can be introduced to the model through more variables, either derived from the same underlying asset or other assets. This can better describe the movements of both the underlying asset and the market in general, but may mean facing the curse of dimensionality, in which high dimensional inputs result in samples becoming exponentially more sparse. This means all samples will become approximately equidistant, resulting in the network not being able to discover useful clusters, ultimately leading to over-fitting and poor out-of-sample performance. The output can also be constructed in many ways, not only in terms of what volatility proxy to use, but also in terms of when the window of forecast starts and the length of this window (specifically, τ1 and τ2 in Equations (1) and (2)). Different forecasting windows require different information from the inputs. Whether a set of inputs holds enough information to forecast an output is particularly difficult to evaluate, as typical measures, such as correlation or information criterion, are linear and have trouble capturing how well a nonlinear NN mapping will perform.

ref

この記事が気に入ったらサポートをしてみませんか？