1 Introduction

The systemic risk related to the financial contagion effects of the COVID-19 pandemic, on February 20, 2020, turned a steadily increasing S&P 500 index market into panic and distress. Between February 20 and March 23, the market plummeted at a record pace and lost more than 30%. During this time, the volatility index (VIX) peaked at 82% on March 16, 2020.Footnote 1 This was similar to the subprime mortgage crisis when the VIX closed at 79% on October 24, 2008. However, the bear market that ensued did not last for too long and the recovery started on March 24, 2020. By June 8, 2020, the S&P 500 index market recuperated about 85% of the losses from its record closing high on February 19, 2020.

The option market followed the underlying S&P 500 index market closely. Specifically, the average premium of call options written on the S&P 500 index (across all strike prices and maturities) declined from $139.34 on February 19, 2020 to $101.31 on March 23, 2020. More generally, considering that the S&P 500 index option prices represent discounted expectations of future outcomes of the S&P 500 index, if the market expects a crash, the prices of out-of-the-money put options would be relatively large compared to the out-of-the-money call options. In fact, on February 19, 2020, the average price of all the out-of-the-money put (call) options was $25.40 ($39.64). Hence, it appears that the COVID-19 crash that followed had not been anticipated by the market participants. Then, over the period until March 23, 2020, the average daily price of all the out-of-the-money put (call) options increased (decreased) to $105.83 ($34.99). This indicates the emergence of extreme pessimism of option traders at the trough of the crisis. In the same vein, by analyzing the prices of options on commodity futures in March, 2020, Vercammen (2020) also reported traders’ growing confusion and concern about the future of commodity prices and disruptions to supply chains due to COVID-19.

Broadly speaking, the COVID-19 contagion has had a devastating global impact on various businesses and markets over both short- and long-run (Sharif et al. 2020). Certain industries such as tourism, transportation, hospitality and automotive manufacturing were halted abruptly. Mazur et al. (2021) investigated the performance of the stock constituents of the S&P 1500 index during the COVID-19 crisis. They found that health care, food, natural gas, and software sectors performed abnormally well, while crude petroleum, real estate, entertainment and hospitality sectors underperformed and produced high negative returns. Similarly, for the Asian stock markets, Liu et al. (2020) documented large cumulative abnormal returns for the pharmaceutical, software and IT services, while transportation, lodging and catering incurred heavy losses. Such a discouraging outlook for profitability was further confirmed in Harjoto et al. (2021) in the sense that COVID-19 generated substantial negative shocks to emerging markets and for small firms. As explained in Bellalah et al. (2020), the COVID-19 crisis was marked by a significant worldwide reduction in allocated investments and their model was able to assist in portfolio decisions in presence of regime-switching.

In the context of financial contagion, Ramelli and Wagner (2020) illustrated how the effects of the COVID-19 pandemic were amplified through financial channels. Morelli and Petrella (2021) suggested that the contagion was spread between option and stock markets during the COVID-19 outbreak and that it affected both European and American options in a similar manner. Furthermore, Akhtaruzzaman et al. (2021c) demonstrated that the strength of the relationship between Chinese and G7 financial and non-financial companies’ stock returns increased substantially during the COVID-19 crash. In regards to oil prices, Akhtaruzzaman et al. (2021a) found that the COVID-19 pandemic appeared to have moderated the oil price risk exposure of both financial and non-financial firms. As for the U.S. stock market, by studying the frequency domain causality, Lento and Gradojevic (2021) documented that the S&P 500 index returns caused oil returns prior to the pandemic, but that the causality direction was reversed during the market crash and recovery. They also reported mostly bi-directional causalities between the S&P 500 index returns and various asset classes when the market crashed during the February 20, 2020 to March 23, 2020 period. With respect to gold prices, Akhtaruzzaman et al. (2021b) showed that gold lost its ‘safe haven’ property during the market recovery (from March 17, 2020 to April 24, 2020). Their evidence basically corroborates that from Lento and Gradojevic (2021) that observed price spillover effects from gold to the stock market at this time.

This paper studies the behavior of the S&P 500 index option market in response to the COVID-19 pandemic through the lens of explainable artificial intelligence (XAI). First, our objective is to track the option pricing accuracy of an array of competing models across three time periods: from January 1, 2020 to February 19, 2020 (pre-COVID-19), from February 20, 2020 to March 23, 2020 (COVID-19 market crash), and from March 24, 2020 to June 15, 2020 (post-COVID-19 recovery). Since the seminal works of Bates (1991, 2000), there has been a few papers that analyzed the performance of option pricing models during a financial crisis (Bates 2012; Fulop et al. 2014; Calvet et al. 2015; Driouchi et al. 2018; Luo et al. 2001), Gençay and Altay-Salih (2003), and Gradojevic et al. (2009). More recent work can be seen in, for example, Gradojevic (2016) and Jang and Lee (2019). The goal of such scholarly efforts is to harness the learning ability and flexibility of machine learning models to achieve better prediction accuracy than the classical (parametric) financial option models.

In relation to the existing option pricing literature, the novelty of this paper lies in the following: (1) making use of the advanced AI methods such as extreme gradient boosting and random forest in option pricing; (2) introducing two additional inputs (open interest and volume) to a traditional non-parametric model and measuring their significance; (3) proposing a new clustering approach based on a re-organizing neural network to facilitate option pricing and model interpretation; (4) utilizing XAI in interpreting the non-parametric models; and (5) testing option pricing models across various market regimes surrounding the COVID-19 market meltdown.

In short, the main results of this paper can be summarized as follows: (1) non-parametric models (extreme gradient boosting and random forest) produce the most accurate forecast performance across all market regimes; (2) the Black-Scholes model’s pricing accuracy is comparable to that of the non-parametric models during the market crash; (3) the relative importance of traditional option pricing model’s inputs (moneyness, time-to-maturity and volatility) is robust during regime shifts; (4) open interest becomes a more important input at times of market volatility and distress; and (5) the proposed data clustering method is beneficial for providing deeper insights into the behavior of market participants and data generating mechanisms of option prices.

The paper is laid out as follows: The next section reviews the option pricing literature with the special emphasis on non-parametric models. Section 3 describes the S&P 500 index options data. Section 4 presents the methodological steps involved in our research design as follows: (i) empirical models (support vector regression, artificial neural network, and random forest), (ii) optimal clustering (with re-organizing neural networks), (iii) sub-period analysis, and (iv) feature importance estimation. Section 5 provides the empirical findings, while Sect. 6 concludes.

2 Background literature: option pricing

To address the well-known pricing biases of the Black-Scholes model (Black and Scholes 1973) such as the “volatility smile” and “volatility smirk” phenomena, research efforts have expanded into develo** both parametric and non-parametric alternatives.Footnote 4 The research on parametric models has mainly focused on introducing the stochastic volatility (SV), stochastic volatility random jump (SVJ) and stochastic interest rate (SI) to the Black-Scholes setting. These parametric extensions have been shown to be superior to the Black-Scholes model in out-of-sample pricing and hedging exercises (Bakshi et al. 1997). Specifically, the SV model has been shown to have first-order importance over the Black-Scholes model (Gençay and Gibson 2009). The SVJ model further enhances the SV model for pricing short-term options, while the SI model extends the SVJ model in regards to the pricing of long-term options. Additional notable parametric research contributions include Bates (1991, 2000), Heston (1993), Christoffersen et al. (2009), Calvet et al. (2015), He and Zhu (2016), Wong and Lo (2009), Cai and Kou (2011) and Gaß et al. (2018).

Although parametric models in general improve the pricing accuracy of the Black-Scholes model, these models exhibit certain pricing biases and are often inferior to non-parametric approaches. Non-parametric models are not bound by the normality of return distributions (or other unjustifiable parametric assumptions) and can benefit from their adaptive learning abilities (Jang and Lee 2019; Gradojevic et al. 2009). For example, in Gençay and Gibson (2009), the out-of-sample performance of an artificial neural network model was compared to the SVJ, SI and SV parametric approaches for the S&P 500 stock market index. The paper showed that all three parametric models were dominated by the neural network pricing model with the GARCH (1, 1) volatility. Similarly, a semi-parametric approach from Andreou et al. (2008) was able to improve upon the SV and SVJ models.Footnote 5

The non-parametric approaches to option pricing have also been used by Hutchinson et al. (1994), Garcia and Gençay (2000), Qi and Maddala (1996), Gençay and Qi (2001), and Gençay and Altay-Salih (2003). More recent examples can be found in Barone-Adesi et al. (2008), Fan and Mancini (2009), von Spreckelsen et al. (2014) and Guidolin and Hansen (2016). An innovative strand of literature concerns option pricing with wavelets (Liu et al. 2019). This paper used a set of daily options written on the DAX-30 index over the 2009–2012 period. The wavelet-based option pricing model outperformed the SVJ parametric benchmark on out-of-sample data, but it was still inferior to the neural network non-parametric model.

In addition, the predictive accuracy and hedging prowess of neural network-based models has been highlighted in a very recent paper by Cao et al. (2021). They develop a novel hybrid gated neural network option pricing model, where they used a similar method to predict the S&P 500 index implied volatility. This paper basically reinforces the work of Culkin and Das (2017) that showed how neural networks could be trained to mimic option pricing traders. Buehler et al. (2019) is another related contribution that demonstrates the advantages of hedging with deep learning neural networks on the S&P500 index option data (2013–2018) in comparison to the Black-Scholes model.Footnote 6

Some other important option pricing approaches are the mixture of distributions model by Melick and Thomas (1997) and Bhat and Kumar (2012), and the semi-parametric estimator by Aït-Sahalia and Lo (1998). These models have shown sizable improvements in option pricing accuracy compared to the Black-Scholes model; however, their out-of-sample pricing is inferior to non-parametric modular neural network models (Gradojevic et al. 2009). It is also worthwhile to mention non-parametric approaches that are based on the affine jump-diffusion models (Carr and Wu 2004) and the normal inverse Gaussian models (Eriksson et al. 2009; Barndorff-Nielsen and Shephard 2001). Furthermore, fuzzy logic has been proven very useful for option pricing in an uncertain market environment (Agliardi and Agliardi 2009). Nevertheless, fuzzy logic models have never been compared to non-parametric competitors in an out-of-sample pricing exercise with real market data.

3 Data

The data are provided by DeltaNeutral and represent the daily closing S&P 500 index European call option prices (the average of the bid-ask quotes), taken from the Chicago Board Options Exchange. Call options across different strike prices and maturities are considered for the period from January 1, 2020 to June 15, 2020. Since it is one of the deepest and the most liquid option markets in the United States, the S&P 500 index option market is sufficiently close to the theoretical setting of the Black-Scholes model. The implied volatility used in the estimations is a proprietary mean estimate provided by DeltaNeutral. The risk-free rate is approximated by the monthly yield of the U.S. Treasury bills. To reduce the size of the data set, options with zero volume and open interest on a given day were eliminated.

The data were divided into three non-overlap** sub-samples as follows:

  • January 1, 2020 to February 19, 2020 (pre-COVID-19): 12989 observations;

  • February 20, 2020 to March 23, 2020 (COVID-19 market crash): 19547 observations;

  • March 24, 2020 to June 15, 2020 (post-COVID-19 recovery): 38909 observations.

Table 1 Sample properties of the S&P 500 index European call option prices

The statistical properties of the data set are presented in Table 1. The data across three sub-periods are divided into several categories in terms of moneyness and time-to-maturity (\(\tau \)). A call option is defined to be out-of-the money (OTM) if \((S_t/K)<0.95\), near-the-money (NTM) if \(0.95\le (S_t/K)\le 1.05\) and in-the-money (ITM) if \((S_t/K)>1.05\). An option is classified as short-term if \(\tau <60\) days, medium-term if \(60\le \tau \le 180\) days and long-term if \(\tau >180\) days. The reported numbers are the average quoted bid-ask midpoint price, the standard deviation of the bid-ask midpoint prices (shown in parentheses) and the total number of observations for each moneyness-maturity category. OTM, NTM, and ITM options take approximately 50%, 33%, and 17% of the total sample, respectively. The average prices of call options range from $0.91 for the OTM, short-term options in the pre-COVID-19 period to $807.46 for the ITM, long-term options in the post-COVID-19 period. Standard deviations become extremely large for the ITM options on all sub-samples and they generally increase with maturity.

It is useful to note that the COVID-19 and post-COVID-19 sub-periods markedly affected all options, regardless of their moneyness or maturity. First, the prices of both OTM and NTM medium- and short-term options on average increased during the COVID-19 crash and, then, slightly decreased during the recovery period. This is not the case, however, with long-term options whose prices were the highest in the post-COVID-19 period for all maturities. Clearly, long-term uncertainty was priced at its peak after the crash. Standard deviations followed the similar inverted-U pattern across the sub-periods for all OTM and NTM options, while the standard deviations for ITM options did not show any discernible temporal patterns.

4 Methodology

4.1 The model

The option pricing formula is defined in the spirit of Hutchinson et al. (1994), Garcia and Gençay (2000) and Gradojevic et al. (2009):

$$\begin{aligned} C_t=\phi (S_t,K,\tau ,\sigma _{IV},OPI,VOL) \end{aligned}$$
(1)

where \(C_t\) is the call option price, \(S_t\) is the price of the underlying asset, K is the strike price, \(\tau \) is the time-to-maturity, OPI denotes open interest (number of open call contracts), VOL is volume (number of contracts traded) and \(\sigma _{IV}\) is the implied volatility. Assuming the homogeneity of degree one of the pricing function \(\phi \) with respect to \(S_t\) and K, one can write the option pricing function as follows:

$$\begin{aligned} \underbrace{\frac{C_t}{K}}_{c_t}=\phi \left( \underbrace{\frac{S_t}{K}}_{x_1},1,\underbrace{\tau }_{x_2},\underbrace{\sigma _{IV}}_{x_3},\underbrace{OPI}_{x_4},\underbrace{VOL}_{x_5}\right) =\phi (x_1,x_2,x_3,x_4,x_5). \end{aligned}$$
(2)

In general, options are often referred to as plain vanilla derivatives because their payoff (or price) is determined by the so-called underlying, which is in our case the S&P 500 stock market index. Call options are more profitable for the buyer when, ceteris paribus, the price of the underlying (\(S_t\)) increases or the strike price (K) decreases. Therefore, intuitively, these two variables must be integral parts of the option pricing formula. Further, when time-to-maturity (\(\tau \)) increases, call options in general become more valuable. This is explained by the fact that it is more likely that the option will be in the money (\(S_t-K>0\)) and, thus, worthwhile exercising at maturity. The preceding explanatory variables are extended with the implied volatility that is a standard input to an option pricing model. The two non-standard inputs that are used in our model are open interest and volume. Apart from the possibility that such inputs may gain importance in a distressed market (e.g., COVID-19 crash), the rationale for using these two predictors is hinted at in Gârleanu et al. (2009) who found that demand pressures could influence option prices. Volume and open interest reflect the activity of option traders (i.e., market sentiment) that is related to demand for options.

4.2 Clustering of options

First, we determine the optimal number of clusters based on the partitioning of the input space for all sub-periods. For such a purpose we utilize the Davies-Bouldin (DB) index (Davies and Bouldin 1979):

$$\begin{aligned} DB=\frac{1}{M} \sum _{i=1}^{M}R_i,\quad R_i=\max (R_{ij}),\quad j=1,\ldots ,M;j\ne i \end{aligned}$$
(3)

where \(R_{ij}=(S_i+S_j)/D_{ij}\) is a similarity measure between observations in data partitions (clusters) i and j, \(S_i\) is a dispersion measure of the observations of \(i^{th}\) cluster calculated as the average Euclidean distance of the data points in cluster i to its center, and \(D_{ij}\) is a cluster’s dissimilarity measure (distance between the centers of cluster pairs). The partition that produces the minimum DB is considered to be optimal.

The actual clustering is performed by using a competitive learning algorithm called Re-Organizing Neural Network (RONN) (Kukolj and Levi 2004; Kukolj et al. 2006). The algorithm relies on training data that contain N input observations \({x_k}, k =1,\ldots , N\), where N is the sample size. If we denote the dimension of the input vector by n, then the clusters we find will represent areas where data values are highly concentrated.

The RONN generates a one-layered network with one input layer where \(x_k\)’s are fed. Nodes in the output layer represent cluster centers. The total number of clusters is an input parameter of the algorithm. In tandem with the DB index measure, for each pre-selected number of clusters, the RONN algorithm finds the cluster centers, which is followed by the selection of optimal clustering structure, based on the minimal DB index. The RONN model is an iterative learning algorithm and its main loop contains two smaller loops within itself. The first loop performs iterative adjustments of the node coordinates using the k-means algorithm until the nodes stabilize in their geometric positions. Coordinates of the nodes are calculated as arithmetic means of observations in each cluster. A cluster’s mean-squared error (MSE) is a measure of deviation of observations in the cluster from its center.

The second loop concerns ‘dead nodes’, i.e., cluster centers which have ended up without any observations in their vicinity. If there are no ‘dead nodes’, the cluster with the smallest number of observations is identified. The center of that cluster is then considered a ‘dead node’, while its observations are re-distributed to the closest clusters. During the iterative adjustments each ‘dead node’ observation is allocated to the node with the largest MSE and the new coordinates of this node are then given by:

$$\begin{aligned} {v_i}^{new}={v_{\max }}^q + \delta ,\quad i=1,\ldots ,q \end{aligned}$$
(4)

where \({v_{\max }}^q\) is the location of the selected node among the q nodes with the largest MSE values, \({v_i}^{new}\) is the position of the new node with respect to the center of a possible new cluster, and \(\delta =[\delta _1,\delta _2,\ldots ,\delta _n]^T\) are small random numbers. The procedure depicted by Eq. 4 is repeated for all ‘dead nodes’ until they are allocated to clusters. A detailed description of the RONN algorithm can be found in Kukolj et al. (2006).

4.3 Feedforward artificial neural network (FF-ANN)

To explain the concept of an FF-ANN model, we will refer to Eq. 2 and our goal is to estimate the parameters of non-linear function (\(\phi \)). In the context of the FF-ANN model, the parameters are called connection weights (\(\alpha _{ij}\) and \(\beta _{j}\)) and node biases (\(\alpha _{j0}\) and \(\beta _{0}\)). The model consists of three building blocks that are the input layer (where predictors are fed into the model), hidden layers (where the functional approximation or learning takes place) and the output layer (where the option price prediction is generated).

If we initially assume the model architecture involves only one hidden layer with q computational elements (nodes or neurons), the FF-D-ANN model is estimated as:

$$\begin{aligned} c_{t}= \phi \Bigg (\beta _{0}+\sum _{j=1}^{q}\beta _{j}\psi \Big (\alpha _{j0}+\sum _{i=1}^{k}\alpha _{ij}x_{it}\Big )\Bigg ) \end{aligned}$$
(5)

In this example, q is the number of hidden nodes, where the single hidden and the output layers are characterized by two flexible classes of non-linearities: \(\psi \) and \(\phi \), respectively. \(\alpha _{ij}\) and \(\beta _{j}\) denote appropriate connection weights between the adjacent layers. Subscripts 0 for \(\alpha \) and \(\beta \) stand for the biases.

The training algorithm is Adam, a first-order gradient-based optimizer based on adaptive estimates of lower-order moments. It is computationally efficient and suitable for large datasets with a lot of parameters to estimate. The default parameters follow those provided in Kingma and Ba (2015). The activation function types used in the hidden layer are sigmoid, with the linear function in the output layer. Validation data is used to select the optimal model architecture. The model will set apart 10% of the training data, will not train on it, and will evaluate the loss (i.e., mean-squared error) and any model metrics on this data at the end of each epoch. If, during the last 5 epochs the monitored quantity (validation loss) has no improvement, the training will be stopped.

4.4 Support vector machine (SVM) model

SVM is a supervised learning model for prediction and classification, introduced by Vapnik (1999). The basic idea of this algorithm can be described by two steps: (a) transform training samples into a space of higher dimension using a non-linear map** function; and (b) perform linear regression in the space of higher dimension in order to separate the data samples. Transforming data from original to the new (higher order) space is performed using a predetermined kernel function. The kernel function is defined as dot product of mapped data input vectors (\(x_{it},~i=1,\ldots ,k\)) from the original space that enables computations of points in the new feature space without explicitly calculating the unknown non-linear map** (\(\phi (x_{1t},x_{2t},\ldots ,x_{kt}\))). In this implementation, we employ a second order polynomial for the initial kernel function (\(K(x_{it},x)\)). This function is redefined by a normalization, as given by the following expression:

$$\begin{aligned} {\widetilde{K}}(x_{it},x)= \frac{K(x_{it},x)}{\sqrt{K(x_{it},x_{it})K(x,x)}} \end{aligned}$$
(6)

The normalization of the kernel function can be viewed as a simultaneous rescaling of data rows and columns to obtain a matrix with all diagonal entries set to one (Graf and Borer 2001). In the second step of the SVM algorithm (frequently called SV Regression—SVR), we perform data separation by constructing linear regression in the higher dimension feature space:

$$\begin{aligned} f(x)= \omega ^T \phi (x)+b, \end{aligned}$$
(7)

where \(\omega \) represents the vector of weights and b is a bias term. The optimization goal is to determine the trade-off between the flatness of f(x) while making sure that it has at most an \(\epsilon \) deviation between the obtained targets and training data outputs. In our work, we use the SVR implementation given in Shevade et al. (2000). The \(\epsilon \) parameter of the \(\epsilon \)-insensitive loss function is set to 0.001.

4.5 Random forest (RF) model

A random forest (RF) is an ensemble machine learning technique introduced by Breiman (2001). It consists of a collection of regression trees whose averaged outputs determine the final prediction of the ensemble. The RF learning is based on a bootstrap aggregation (bagging) and random features subspace selection. Through a bagging procedure, each tree in the ensemble has randomly selected portions of training samples (with replacement) from the original dataset. In order to avoid possible correlations between constituent random trees and enhance the estimation performance of the RF model, the idea of a random features subspace is applied. As a result, each tree is grown on the basis of a randomly chosen input subset. For each node, the splitting algorithm searches over a subset of selected features to determine the best split point. The RF growth during the learning process is determined by two parameters: the size of the features subset used in each regression tree and the number of trees that form the forest. We tuned these two parameters to find the best regression RF model, as presented in Probst et al. (2019). Because of the small number of input features, a grid search was adopted as the simplest tuning strategy. Varying the number of trees on the [50, 150] interval and the features subset size from 2 to 6 showed that the RF with 80 trees and the features subset size of four has the lowest RMSE across the three data sub-periods considered.

In the process of constructing each decision tree, the RF learning algorithm uses the classification and regression trees (CART) learning algorithm (Breiman et al. 1984) which adopts the Gini index as the impurity-based measure. More precisely, the Gini index is used every time a split of a node is made on a certain variable. The learning algorithm is able to cope with numeric variables, characteristic for the problem considered in this study, by discretizing the continuous scale into two intervals. The optimal cut-off point is determined on the basis of evaluation of a threshold pool consisting of adjacent distinct values. The Gini index describes a decrease in the node impurity weighted by the probability of reaching that node:

$$\begin{aligned} G = \sum _{i=1}^{C} {p_i}(1-p_i), \end{aligned}$$
(8)

where \(p_i\) is the node probability calculated as the number of observations that reach the node divided by the total number of observations and C is the number of classes in the target variable. The variance reduction of a given tree node describes a decrease of the variance of the target variable due to the split at this node. Adding up and averaging the variance reductions for every node over all trees in the forest gives the value of relative importance of an input variable. Feature importance in the form of Gini index is commonly used to identify the contribution of individual predictors toward explaining the output (option price). Such an approach that relies on a decision tree classifier is considered intrinsically XAI in nature and it will assist us in obtaining the economic interpretation of our results.

5 Results

5.1 Comparative performance on sub-periods

We run our predictive models and price call options on the last 10% of data (testing set) across the three sub-periods: (1) pre-COVID-19, (2) COVID-19 crash, and (3) post-COVID-19 recovery. Our goal is to compare the forecast performance of the competing models (FF-ANN, SVM, RF, XGB, and the Black-Scholes benchmark) on each subsample and to also identify the most important predictors of option price fluctuations.Footnote 7 Table 2 shows the pricing accuracy reflected in the root mean-squared prediction error (RMSE) of the models that were tested on out-of-sample data.

Table 2 Comparison of predictive models for the sub-periods of data

First, we find that the non-parametric option pricing models strongly outperformed the Black-Scholes model during the pre-COVID-19 period. Overall, the most accurate pricing performance is generated by the XGB model. The second best-performing model on all time periods is a RF that produces almost 70% lower out-of-sample RMSE than the Black-Scholes model over the first sub-sample. We can also conclude that the regime shift which took place in the second and third sub-samples to a certain extent limited learning and generalization abilities of non-parametric models. Surprisingly, the pricing accuracy of the Black-Scholes model improved during the pandemic, although it was still inferior to that of the RF and XGB models. Being a pre-specified non-linearity, the Black-Scholes model showed less sensitivity to a regime shift than the competing non-parametric models. This evidence is in accord with Kukolj et al. (2012) for the 2008 crisis. Another interesting finding is that, except for the XGB model, the pricing ability of the models diminished during the post-COVID-19 recovery period. As it could also be observed in Table 1, the uncertainty appeared to have been greater post-COVID-19 than during the actual crash, which impeded the option pricing accuracy.

Next, we will study the relative importance of inputs in the RF model. For such a purpose, we estimate the Gini indices (based on the average impurity decrease) while pricing options on each sub-sample of data. Table 3 displays the estimates for the inputs across the three sub-periods. It can be observed that the moneyness of options (\(S_t/K\)) was among the top two most important inputs on each market regime. In fact, moneyness is the input that contributed to option pricing the most during extreme market volatility (COVID-19 and post-COVID-19 periods). Interestingly, during a steady market regime, options prices were the most sensitive to implied volatility, while this input was the second most important predictor during the post-COVID-19 recovery. We can conclude that when the market is relatively stable or on an upswing, implied volatility is important for option pricing. In contrast, during a market downswing, market participants react to excessive movements by focusing more on open interest (and even maturity) when determining option prices. This evidence of distressed behavior is a departure from the traditional no-arbitrage principle in option pricing where market-makers are able to perfectly hedge their inventories. Apparently, when the market is collapsing and is in a state of panic, open interest changes affect option demand imbalances, which, in turn, impacts option prices substantially.

Table 3 Feature importance in the RF model for the sub-periods of data

Following another referee’s suggestion, we supplement the XAI analysis with the Shapley additive explanations (SHAP) values for our set of inputs across the three subsamples (Lundberg and Lee 2017). The advantage of this method lies in its ability to explain the output of machine learning models. Essentially, the SHAP values show how much each predictor contributes (positively or negatively) to the output. Another benefit of the SHAP approach is its local interpretability where each observation is assigned its own set of SHAP values. First, we will estimate the SHAP values across the training data sets for the most accurate models (RF and XGB) by taking the mean absolute value of each feature.

Table 4 SHAP in the RF and XGB models for the sub-periods of data

Table 4 demonstrates the key role that traditional inputs, moneyness (\(S_t/K\)) and time-to-maturity (\(\tau \)), play in option pricing across all three market regimes. The fact that moneyness is the most informative input is consistent with the Gini indices from Table 3. However, according to the SHAP measures, implied volatility is not as important as the Gini indices suggested, but is consistently the third most important input. Also, when compared to Table 3, Table 4 suggests a diminished impact of open interest during the COVID-19 market crash. Nevertheless, relative to the other two sub-periods, it can be observed that open interest affects the output the most during the crash. As explained before, this shows that the markets experienced a greater degree of distress at this time.

Table 5 Feature importance in the XGB and RF models for the sub-periods of data

For the sake of direct comparison between the Gini indices estimated by the RF and XGB methods, next, we show their normalized values side by side in Table 5. Clearly, the evidence for the RF model mirrors Table 3, while that for the XGB model is slightly different from the SHAP values in Table 4. More precisely, the Gini indices for the XGB model during the pre-COVID-19 period suggest that the implied volatility input was more important than time-to-maturity, which, in fact, reinforces the results for the RF model (i.e., the first column of Table 5). All other Gini values for the XGB model in Table 5 resemble the corresponding ones in Table 4. In all, we conclude that the XAI interpretation, as it may be expected, depends to a certain extent on the underlying non-linear modeling method (e.g., RF vs. XGB) as well as on the XAI technique that is applied (e.g., Gini vs. SHAP). When an input is highly dominant, its influence and interpretation is robust across methods, but, when an input is relatively uninformative, there exist differences in interpretation that we believe are complementary in nature as they provide further insights into the phenomenon of interest.

Fig. 1
figure 1

SHAP summary plots. Notes: In Panel A, the SHAP feature contributions are shown for the pre-COVID-19 period (January 1, 2020 to February 19, 2020). In Panel B, the SHAP feature contributions are shown for the COVID-19 market crash period (February 20, 2020 to March 23, 2020). In Panel C, the SHAP feature contributions are shown for the post-COVID-19 recovery period (March 24, 2020 to June 15, 2020). The aggregate SHAP values for each instance-input combination and their relationship to the output (option price) is denoted by red (direct relationship) or blue (inverse relationship) dots. The underlying model in all three panels is estimated with the XGB algorithm

Figure 1 displays the SHAP summary plots for the XGB model that explain how each input contributes to the output and in which direction. This plot utilizes the training data and ranks features vertically in descending order according to the magnitude of their SHAP values. The color of each dot represents the impact direction of the feature on the model output where blue (red) rectangles represent inverse (direct) relationship. For all observations, their location along the x-axis shows whether the impact is associated with a higher or lower prediction. What stands out in Fig. 1 is the strong dominance of the moneyness input and its positive effect on options prices on all market regimes. The other inputs have a much weaker contribution which can be deduced from the high concentration of their SHAP values around zero. Noteworthy, it appears that implied volatility and time-to-maturity display mostly positive relationships, while open interest and volume exhibit negligent negative impacts.

5.2 Clustering and feature importance

In this subsection, we perform optimal clustering of options on each market regime. By this, we provide a deeper analysis of the pricing accuracy and the relevance of model’s inputs. We will also present an economic interpretation of our findings. Our first goal is to estimate the optimal number of clusters for each of the three time periods. Figure 2 plots the values of the DB index for various cluster choices. Clearly, the optimal number of clusters for the pre-COVID-19 market is six, while it is five for both COVID-19 and post-COVID-19 data. We conjecture that the pandemic changed the market microstructure to a more homogeneous trading behavior of the investors that originated in the convergence of their beliefs. Put differently, a change in the risk aversion of an average investor and an increased concentration thereof caused a reduction in the market “sub-regimes” (or clusters).

Fig. 2
figure 2

Selection of the optimal number of clusters. Notes: The selection is performed for three sub-samples: (1) January 1, 2020 to February 19, 2020 (pre-COVID-19): 12989 observations (dot-dashed line), (2) February 20, 2020 to March 23, 2020 (COVID-19 market crash): 19547 observations (solid line), and (3) March 24, 2020 to June 15, 2020 (post-COVID-19 recovery): 38909 observations (dashed line). The optimal number of clusters (horizontal axis) is where the Davies-Bouldin criteria (vertical axis) are at the minimum value

To locate the optimal clusters and their centers on the pre-COVID-19 data, we apply the RONN algorithm. For the visual illustration only, we reduce the five-dimensional space to a two-dimensional projection and display it in Fig. 3. The cluster boundaries are clearly defined and the clusters are reasonably spaced, further attesting to the validity of our approach. It is worthwhile to stress that the first two principal components cumulatively preserve about 95% of the original data variance in reduced space. When de-normalized to their original coordinates, the cluster centers for inputs (\(S_t/K, \sigma _{IV}, \tau , VOL, OPI\)) are located in the original input space as follows:

  • Cluster 1 \(\rightarrow \) (1.22, 0.16, 1.71, 93.78, 2018.09);

  • Cluster 2 \(\rightarrow \) (0.90, 0.15, 2.87, 40.08, 131.75);

  • Cluster 3 \(\rightarrow \) (1.17, 0.10, 0.10, 2333.03, 50609.87);

  • Cluster 4 \(\rightarrow \) (1.00, 0.10, 0.16, 322.90, 2384.70);

  • Cluster 5 \(\rightarrow \) (1.03, 0.14, 0.83, 197.83, 3948.50);

  • Cluster 6 \(\rightarrow \) (1.03, 0.10, 0.16, 1419.17, 18355.29).

From the coordinates’ values of the cluster centers, we identify cluster 1 as deep-ITM, long-term options, while cluster 2 are OTM, very long-term options. Cluster 3 refers to deep-ITM, short-term options with high volume and open interest. Cluster 4 is at-the-money, short-term options and cluster 5 is NTM options that are long-term. Finally, cluster 6 is NTM, short-term options with high volume and open interest. Of interest is to test the pricing accuracy of the RF model on each cluster and assess how clustering changes the importance of inputs. The pricing exercise and feature importance estimation will be run in-sample, followed by a comparative study across clusters.

Fig. 3
figure 3

Pre-COVID-19: clusters and their centers. Notes: The clusters are found by using the RONN algorithm for the pre-COVID-19 data (January 1, 2020 to February 19, 2020: 12989 observations). We reduce the five-dimensional space to a two-dimensional projection where PC#1 and PC#2 denote the first two principal components. Numbers 1-6 indicate the location of cluster centers in the reduced space

Table 6 lists the Gini indices as well as the pricing errors (RMSE’s) for the six pre-COVID-19 clusters of data. In terms of pricing accuracy, we find that it was easier to price NTM, short-term options with relatively lower volatility and higher volume (as in clusters 4 and 6). RMSE’s appeared to have increased with maturity and moneyness. The largest pricing errors were produced in clusters 5 and 1, i.e., for long-term options with lower volume.

Table 6 Pre-COVID-19: feature importance across clusters

As for the relative importance of inputs, the Gini indices in Table 6 reveal that moneyness was always among the two top ranked features. In clusters 1 and 2, however, implied volatility emerged as the most important input. This is in line with the fact that clusters 1 and 2 contained options with the highest volatility of all clusters. Therefore, when pricing options in a cluster with large implied volatility, the relative importance of the volatility input increases. We also infer that open interest becomes a dominant input in clusters 3 and 6, i.e., for the options with high open interest. It is extraordinary to report that some options in the market were sensitive to demand-based shocks already in the pre-COVID-19 period. This surprising evidence was not observable in Table 3 because it examined all option contracts together in a single cluster. Lastly, both volume and maturity inputs do not seem informative in option pricing.

Fig. 4
figure 4

COVID-19 crash: clusters and their centers. Notes: The clusters are found by using the RONN algorithm for the COVID-19 data (February 20, 2020 to March 23, 2020: 19547 observations). We reduce the five-dimensional space to a two-dimensional projection where PC#1 and PC#2 denote the first two principal components. Numbers 1-5 indicate the location of cluster centers in the reduced space

The optimal clusters and their centers for the COVID-19 period (reduced to a two-dimensional projection) are shown in Fig. 4. In this case, the first two principal components preserved in total about 90% of the original data variance in reduced space. The cluster centers for inputs (\(S_t/K, \sigma _{IV}, \tau , VOL, OPI\)) in the five-dimensional original space are at the following coordinates:

  • Cluster 1 \(\rightarrow \) (1.02, 0.21, 1.57, 99.38, 2283.53);

  • Cluster 2 \(\rightarrow \) (0.93, 0.24, 0.72, 242.90, 3710.52);

  • Cluster 3 \(\rightarrow \) (0.97, 0.43, 0.10, 3156.97, 35520.08);

  • Cluster 4 \(\rightarrow \) (0.90, 0.21, 2.76, 25.21, 199.81);

  • Cluster 5 \(\rightarrow \) (0.92, 0.36, 0.15, 388.79, 2900.18).

Thus, cluster 1 represents ITM, long-term options, while cluster 2 are OTM, long-term options with higher open interest. Cluster 3 are OTM, short-term options with high volatility and volume, and very high open interest. Cluster 4 are deep-OTM, long-term options with low volume and open interest. Cluster 5 are deep-OTM, short-term options with high volatility and open interest. The Gini indices and RMSE’s for the clusters can be found in Table 7. During the COVID-19 period, the highest pricing accuracy was reached in clusters 3 and 5, which are OTM, short-term options with high volatility and open interest. It can be generally stated that longer maturity is detrimental for the pricing accuracy, while other inputs do not have a clear systematic effect on the model’s pricing ability across the clusters.

Table 7 COVID-19: feature importance across clusters

Table 7 also confirms that moneyness is the most useful input in all clusters, followed by the implied volatility and open interest. Similar to the pre-COVID-19 period, the open interest input gained importance in the clusters where high open interest options are concentrated. In light of our findings for the pre-COVID-19 period, the fact that option demand also drove option prices during the COVID-19 crash was not unexpected. Further, volume was relatively more important only in the third cluster that contains very high volume options. Again, the maturity input was of minor relevance for option pricing.

Fig. 5
figure 5

Post-COVID-19 recovery: clusters and their centers. Notes: The clusters are found by using the RONN algorithm for the COVID-19 data (March 24, 2020 to June 15, 2020: 38909 observations). We reduce the five-dimensional space to a two-dimensional projection where PC#1 and PC#2 denote the first two principal components. Numbers 1-5 indicate the location of cluster centers in the reduced space

Finally, we plot the optimal clusters and their centers for the post-COVID-19 time period in Fig. 5. Here, the first two principal components cumulatively preserve about 94% of the data variance in new feature space. The cluster centers for inputs (\(S_t/K, \sigma _{IV}, \tau , VOL, OPI\)) are distributed as follows:

  • Cluster 1 \(\rightarrow \) (1.00, 0.27, 0.14, 145.07, 1300.40);

  • Cluster 2 \(\rightarrow \) (0.99, 0.22, 2.62, 38.94, 469.04);

  • Cluster 3 \(\rightarrow \) (0.96, 0.25, 0.27, 783.05, 13851.57);

  • Cluster 4 \(\rightarrow \) (0.92, 0.22, 1.38, 108.25, 2287.64);

  • Cluster 5 \(\rightarrow \) (1.04, 0.27, 0.63, 113.28, 1901.41).

The cluster centers 1-5 above identify the following option types: (1) at-the-money, short-term options with high volatility and open interest, (2) at-the-money, very long-term options with low volume and open interest, (3) OTM, short-term options high volume and open interest, (4) deep-OTM, long-term options with high open interest, and (5) ITM, medium-term options with high open interest. With regard to the pricing performance in the individual clusters, presented in Table 8, the lowest (highest) error was generated in cluster 3 (4). In accord with some of our previous evidence, pricing errors increase with both moneyness and maturity. Nonetheless, the effect from the remaining inputs on the RMSE’s was not significant enough for a stylized economic interpretation.

Table 8 Post-COVID-19: feature importance across clusters

In contrast to the COVID-19 period, implied volatility was the most dominant input in three of the five option clusters. Although moneyness was always among the top two inputs in terms of the Gini index, it was the strongest only in clusters 1 and 3. Hence, clustering provides new findings that were not apparent from Table 3 that analyzed all post-COVID-19 options together. In particular, we demonstrate that volatility was a more important input than moneyness during the market recovery. In all likelihood, among other risk factors, volatility risk was the most significant and it commanded greater option prices in the post-COVID-19 period. Furthermore, demand-induced shocks were not as pronounced at that time, as they had been in the first two periods. Open interest was among the two highest Gini indices only in cluster 1, but, it is noteworthy, that it was consistently the third most important input in other clusters. As observed before, volume and time-to-maturity were the least useful for pricing in all clusters.

6 Conclusions

This paper tackles the problem of the lack of parametric transparency and economic interpretability in non-parametric option pricing models by utilizing data clustering and XAI. We also recognize that in real-world markets options are hedged imperfectly. Consequently, we depart from the Black-Scholes framework (where prices are determined by no-arbitrage) and expand the model with additional inputs—open interest and volume—that reflect market demand and sentiment. By concentrating on the most recent S&P 500 index options data that include the COVID-19 crash and post-COVID-19 recovery, we provide unique insights into the option market microstructure and the behavior of traders exposed to multiple regime shifts.

First, we perform a comparative option pricing exercise across various model specifications and time periods. The evidence shows that non-parametric techniques easily dominate the Black-Scholes model in a stable, pre-COVID-19 market. On the other hand, in the volatile, pandemic market, the Black-Scholes model delivers pricing accuracy that is comparable to non-parametric models. This is consistent with Garcia and Gençay (2000) and Kukolj et al. (2012) that reported relative success of the Black-Scholes model during regime shifts in 1987 and 2008, respectively. Overall, on all three sub-periods (pre-COVID-19, COVID-19, post-COVID-19), the XGB and RF models produce the most accurate option prices out-of-sample.

We also find that non-parametric option pricing (and the resulting economic interpretation) depends on the nature of non-parametric models and the XAI technique that is applied. More specifically, when an option pricing model’s input is highly important (e.g., moneyness), its contribution and interpretation is invariant to model specification and time period. In contrast, when an input is of minor non-linear importance (e.g., volume), its significance may vary. Although such findings could be perceived as a limitation of our study, it is worthwhile to note that, when we combine the evidence from multiple methods, we are able to obtain a deeper grasp of the forces behind the data generating process of option prices. Our findings point to the dominant role played by the moneyness input in option pricing. The other inputs are significantly less important, but the influence of time-to-maturity and implied volatility is frequently present across models and market regimes. An intriguing evidence we report is that certain options were sensitive to demand-based shocks reflected in the open interest input already in the pre-COVID-19 period. Moreover, during the COVID-19 market crash, we observe an increased importance of open interest which suggests an emergence of liquidity and demand shocks during market distress. More specifically, on all time periods, the importance of open interest was greater in option clusters with high implied volatility and open interest (and to a certain extent moneyness). This in general corroborates the evidence of positive relationship between net demand for options and excess implied volatility found in Gârleanu et al. (2009).

The models and methods proposed in this paper help improve our understanding of the dynamics of option prices and their underlying risk factors. Our approach could be generalized to any financial derivative, providing sufficient market information is available. Within the context of stock options, future research efforts might concentrate on explaining idiosyncratic and systematic risk factors inherent in option contracts across various industries (as in, e.g., Ramelli and Wagner , 2020). Another potentially interesting research avenue is to employ high-frequency data where premia for stochastic volatility or jump risk could also contribute to option price movements.Footnote 8 It is of interest to policy-makers to measure how much variability in option (or, more generally, asset) prices can be attributed to non-traditional factors such as open interest and other potential gauges of market sentiment, especially at crisis times. Also, ignoring market microstructure effects (i.e., trader behavior) inferred from our clustering method may lead policy-makers to wrong conclusions about the effectiveness of a particular policy aimed at calming volatile markets.

In all, we demonstrate clear benefits of clustering and XAI to our understanding of the option market mechanisms, especially during periods of extreme uncertainty. Our empirical explorations complement the extant literature on the econometrics of option pricing. Specifically, we shed more light on the potential latent variables and complex non-linearities of the pricing model (Garcia et al. 2010).