Statistical arbitrage with graph clustering algorithms

Applying network theory to statistical arbitrage strategies.

Oct 30, 2024

This article is based on the paper ‘Correlation Matrix Clustering for Statistical Arbitrage Portfolios’ (Cartea et al. 2023). The paper describes a framework for constructing statistical arbitrage portfolios using graph clustering algorithms. In this article I will implement and backtest several trading strategies based on the approach described in the paper.

The general algorithm of the strategy is as follows.

Calculate residual returns.
Compute correlation matrix of residuals and interpret it as a weighted signed network.
Use graph clustering algorithms to partition stocks into groups such that, on average, the correlation between stocks in different groups is low, while the correlation within the same group is high.
Employ rolling windows to identify stocks whose returns are above or below the cluster mean.
Construct a contrarian portfolio by taking long positions in ‘losers’ and short positions in ‘winners.'

I will start by implementing different components of the algorithm individually, and then bring them together to create a complete trading strategy.

We begin by preparing the data. I’ll use the same dataset as in my recent articles on portfolio optimization, which includes around 500 assets — constituents of the S&P 500 index. We’ll also need the risk-free rate and market return data to calculate residuals, for which I’ll use data from the Kenneth R. French Data Library. A description of the specific dataset I’m using can be found here.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

Show hidden characters

	prices = pd.read_csv('../sp500.csv', index_col=0)
	prices.index = pd.to_datetime(prices.index)
	returns = prices.pct_change().iloc[1:]

	# factor data
	ff_data = pd.read_csv('../F-F_Research_Data_Factors_daily.CSV', skiprows=3, index_col=0).iloc[:-1]
	ff_data.index = pd.to_datetime(ff_data.index)
	ff_data = ff_data.loc[returns.index]
	ff_data /= 100 # percent to decimal

view raw pt22_n1.py hosted with ❤ by GitHub

The first step in the algorithm is calculating residuals. We use the following formula to estimate the parameter beta and calculate the residuals epsilon for each asset. Code for this is provided below.

Show hidden characters

	from sklearn.linear_model import LinearRegression

	# estimate betas
	returns_rf = returns - ff_data[['RF']].values
	res = LinearRegression().fit(ff_data[['Mkt-RF']].values, returns_rf.values)
	betas = pd.Series(res.coef_.flatten(), index=returns.columns)

	# calculate residuals
	resid = pd.DataFrame(index=returns.index, columns=returns.columns)
	for s in tqdm(returns.columns):
	resid[s] = returns_rf[s] - betas.loc[s] * ff_data['Mkt-RF']

view raw pt22_n2.py hosted with ❤ by GitHub

To use the clustering algorithms described in the paper, we need to specify the number of clusters. The paper proposes two approaches to estimating this:

RMT (Random Matrix Theory) approach,
Explained variance approach.

Note: In this section I assume that the reader is familiar with the concepts of eigenvalues and eigenvectors in linear algebra; if not, here’s a good introductory video.

The RMT approach is based on a theorem from Random Matrix Theory, which I discussed in my previous article on portfolio optimization with RMT. Here, the number of clusters equals the number of eigenvalues that exceed a specific threshold, lambda+, which is calculated using the following formulas.

First, we calculate the sample correlation matrix of the residuals and its eigenvalues, then count the number of eigenvalues that exceed the threshold lambda+. The entire procedure is shown on the screenshot below.

Estimating the number of clusters (RMT approach)

In the explained variance approach, we simply set the number of clusters equal the number of eigenvectors (corresponding to the largest eigenvalues) required to explain a certain percentage of the variance in the data. On the screenshot below, I calculate the number of eigenvectors required to explain 90% of the variability. As you can see, the number obtained here is significantly higher compared to the RMT approach.

Estimating the number of clusters (explained variance approach)

Another approach tested in the paper involved manually setting the number of clusters to a fixed value of 30. In this article, I will focus solely on testing the RMT approach.

Now, let’s move on to the final missing piece - the clustering algorithm. Fortunately, all the clustering methods are already implemented in SigNet library, allowing us to use them with just one line of code.

The paper discusses three clustering techniques:

Spectral Clustering.
Signed Laplacian Clustering.
SPONGE (Signed Positive Over Negative Generalised Eigenproblem) clustering.

The pros and cons of these techniques are described in the paper, so I won’t repeat them here. Instead, I’ll simply demonstrate how to use each of them.

The first step for each clustering algorithm is defining a cluster where the correlation matrix is interpreted as an adjacency matrix. The cluster is initialized with separate positive and negative adjacency matrices, both of which must be in Compressed Sparse Column (CSC) format. To convert numpy arrays to CSC, we use the csc_matrix function from the scipy library.

In spectral clustering, we work with the absolute values of the correlation matrix, meaning our negative adjacency matrix will only contain zeros. Defining a cluster and performing clustering is demonstrated below.

Show hidden characters

	from signet.cluster import Cluster
	from scipy.sparse import csc_matrix

	c = Cluster((csc_matrix(abs(corr)), csc_matrix(np.zeros(corr.shape))))
	c.spectral_cluster_adjacency(k=25)

view raw pt22_n3.py hosted with ❤ by GitHub

The final result of the clustering is an array of cluster assignments, part of which is shown above.

Now, let’s move on to signed Laplacian clustering. Here, both positive and negative elements are allowed in the adjacency matrix. We split the positive and negative elements of the correlation matrix into separate dataframes, convert them to CSC format, and create a new cluster object. The clustering is performed in the same way, but using a different function.

Show hidden characters

	# split adjacency matrix into positive and negative
	corr_positive = pd.DataFrame(index=corr.index, columns=corr.columns)
	corr_negative = pd.DataFrame(index=corr.index, columns=corr.columns)

	corr_positive[corr>0] = corr[corr>0]
	corr_positive = corr_positive.fillna(0)
	corr_negative[corr<0] = -corr[corr<0]
	corr_negative = corr_negative.fillna(0)

	c = Cluster((csc_matrix(corr_positive), csc_matrix(corr_negative)))
	c.spectral_cluster_laplacian(k=25)

view raw pt22_n4.py hosted with ❤ by GitHub

SPONGE clustering uses the same input as Laplacian clustering; it is simply performed with a different function.

Show hidden characters

	c = Cluster((csc_matrix(corr_positive), csc_matrix(corr_negative)))
	c.SPONGE_sym(k=25)

view raw pt22_n5.py hosted with ❤ by GitHub

That’s basically it. We’ve implemented all the key building blocks required to construct a trading strategy.

Now, I will implement and backtest a baseline strategy for comparison purposes. This will be a simple contrarian strategy where, each day, I buy the ‘losers’ (stocks with returns lower than the market) and sell the ‘winners’ (stocks with returns higher than the market). It is similar to the strategies we are going to test, except it only has one cluster.

The code for backtesting is provided below.

Show hidden characters

	start_date = '2019-01-01'
	end_date = '2022-12-31'

	# subtract risk-free rate from returns
	returns_rf = returns - ff_data[['RF']].values

	# calculate positions
	positions = pd.DataFrame(index=returns_rf.loc[start_date:end_date].index, columns=returns_rf.columns)
	positions[returns_rf < ff_data[['Mkt-RF']].values] = 1
	positions[returns_rf > ff_data[['Mkt-RF']].values] = -1

	# make absolute values of long and short positions sum to 1
	positions[positions>0] *= 1/abs(positions[positions>0].sum(axis=1)).values.reshape(-1,1)
	positions[positions<0] *= 1/abs(positions[positions<0].sum(axis=1)).values.reshape(-1,1)

	# calculate strategy returns
	ret = (positions.shift() * returns.loc[positions.index]).sum(axis=1)
	cumret_baseline = (1 + ret).cumprod()

view raw pt22_n6.py hosted with ❤ by GitHub

Performance metrics and cumulative returns of the baseline strategy, along with the market, are shown above. Now we are ready to move on to the strategies based on clustering algorithms.

The strategy described in the paper contains many different parameters:

Number of days to estimate the number of clusters: 20
Length of rolling window to estimate beta and compute residuals: 60
Number of lookback days to construct the correlation matrix and compute cluster mean returns: 5
Rebalance period to recompute the correlation matrix, recompute the clusters and rebalance the portfolio: 3
Threshold to identify whether the stock is a winner or a loser: 0
Threshold to consider the portfolio mean reverted (stop-win): 5%
Percentage of winners and losers to select for trading: 100%

Some of the parameters above don’t make sense to me. For example, why use one number of days to estimate the number of clusters and another to estimate beta and compute residuals? Other parameters (like the rebalancing period) are useful in a production environment, but to test whether the trading idea is feasible, we can do without them. Moreover, having a large number of parameters increases the risk of overfitting. Therefore, I’m going to simplify the strategy proposed in the paper and use the following algorithm:

Each day, use a 60-day lookback period to estimate betas, calculate residuals, and compute their sample correlation matrix.
Calculate the eigenvalues of the correlation matrix to estimate the number of clusters and perform clustering.
Calculate the mean returns within each cluster using the same 60-day period and open short positions in winners (assets with returns above the cluster mean) and long positions in losers (assets with returns below the cluster mean).
Ensure that the resulting portfolio is dollar-neutral — the amount of capital allocated to long positions must equal the amount allocated to short positions.
Repeat the process each trading day.

Now the number of parameters has been significantly reduced:

Length of rolling window for historical data: 60
Threshold to identify whether a stock is a winner or a loser: 0
Percentage of winners and losers to select for trading: 100%

Let’s try to backtest it.

Here, I am using a backtesting routine that is slightly different from the one I used in my other articles. I define a function that calculates a row of positions dataframe for each given day. Then, I run this function over the backtesting period and combine the results into a single dataframe. Code for backtesting the strategy using spectral clustering is shown below.

Show hidden characters

	w = 60 # length of the rolling window

	def spectral_clustering_bt(t):
	# prepare data
	returns_rf_tmp = returns_rf.loc[:t].iloc[-w:]
	returns_rf_tmp = returns_rf_tmp.dropna(axis=1)
	ff_data_tmp = ff_data.loc[returns_rf_tmp.index]

	# compute betas
	res = LinearRegression().fit(ff_data_tmp[['Mkt-RF']].values, returns_rf_tmp.values)
	betas = pd.Series(res.coef_.flatten(), index=returns_rf_tmp.columns)

	# compute residuals
	resid = pd.DataFrame(index=returns_rf_tmp.index, columns=returns_rf_tmp.columns)
	for s in resid.columns:
	resid[s] = returns_rf_tmp[s] - betas.loc[s] * ff_data_tmp['Mkt-RF']

	# estimate the number of clusters
	corr = resid.corr() # correlation matrix
	evalues = np.linalg.eigh(corr)[0]
	Q = returns_rf_tmp.shape[0] / returns_rf_tmp.shape[1]
	lambda_plus = 1 + 1/Q + 2*np.sqrt(1/Q) # upper RMT bound
	K = len(evalues[evalues>lambda_plus]) # number of clusters

	# perform clustering
	with warnings.catch_warnings():
	warnings.simplefilter('ignore')
	c = Cluster((csc_matrix(abs(corr)), csc_matrix(np.zeros(corr.shape)))) # use absolute values of correlation matrix as input
	ca = c.spectral_cluster_adjacency(k=K)

	# construct positions
	positions = pd.DataFrame(index=[t], columns=returns_rf.columns)
	positions = positions.fillna(0)
	for cluster in range(K):
	c_members = returns_rf_tmp.columns[ca==cluster]
	c_winners = returns_rf_tmp[c_members].mean()[returns_rf_tmp[c_members].mean()>returns_rf_tmp[c_members].mean().mean()].index
	c_losers = returns_rf_tmp[c_members].mean()[returns_rf_tmp[c_members].mean()<returns_rf_tmp[c_members].mean().mean()].index
	positions.loc[t, c_winners] = -1 / len(c_winners) # short position in winners
	positions.loc[t, c_losers] = 1 / len(c_losers) # long_position in losers

	return positions

	positions = pd.concat([spectral_clustering_bt(t) for t in returns_rf.loc[start_date:end_date].index])

	# make absolute values of long and short positions sum to 1
	positions[positions>0] *= 1/abs(positions[positions>0].sum(axis=1)).values.reshape(-1,1)
	positions[positions<0] *= 1/abs(positions[positions<0].sum(axis=1)).values.reshape(-1,1)

	# calculate strategy returns
	ret = (positions.shift() * returns.loc[positions.index]).sum(axis=1)
	cumret_spectral = (1 + ret).cumprod()

view raw pt22_n7.py hosted with ❤ by GitHub

The other strategies are similar; only the clustering part changes. I won’t post their code here, but you can find it in the corresponding Jupyter notebook.

Let’s take a look at final results. Performance metrics and cumulative returns of each strategy are shown below. Note that, due to the randomness involved in clustering algorithms, your results might vary slightly.

In terms of Sharpe ratios, all algorithms are approximately the same. However, if we look at their Sortino ratios, the strategies based on Laplacian and SPONGE clustering algorithms significantly outperform the other strategies and the overall market. Although the total return of these strategies is lower than the market, the strategies are dollar-neutral, allowing us to use leverage to increase returns without changing the underlying risk-adjusted performance (Sharpe and Sortino ratios).

Ideas for further research:

Try using other factor models for calculating residuals.
Adjust the lengths of lookback windows and introduce stop-wins and stop-losses, as described in the paper.
Test different approaches to estimating the number of clusters.
Perform some analysis of the clusters computed by each clustering algorithm.

In conclusion, we have explored the use of graph clustering algorithms to construct statistical arbitrage strategies. While all strategies showed similar Sharpe ratios, clustering-based approaches stood out in terms of Sortino ratios, indicating better downside risk management.

This study demonstrates the potential of clustering-based approaches in statistical arbitrage and emphasizes the importance of further research to refine and optimize these methods for practical trading scenarios.

Jupyter notebook with source code is available here.

If you have any questions, suggestions or corrections please post them in the comments. Thanks for reading.

References

[1] Correlation Matrix Clustering for Statistical Arbitrage Portfolios (Cartea et al. 2023)

[2] SigNet - a package for clustering of signed networks

[3] Portfolio Optimization. Random Matrix Approach to Estimating Covariance Matrix

Financial Noob

Discussion about this post