Simple Strategies. Rotation strategy in SPY and TLT

Implementing and analyzing a simple swing trading strategy.

Sep 26, 2024

This post is based on the article 10 Best Swing Trading Strategies 2024, which describes a few simple swing trading ideas. Simplicity often leads to more robust and easier-to-understand models, which can be just as effective as complex ones. In this series of articles, I will backtest and analyse a variety of such straightforward strategies, starting with the rotation strategy in SPY and TLT (strategy #5 from the article).

The strategy we are testing today is really simple. Here’s a quote from the article describing its logic:

Every month rank SPY and TLT from the month’s close to the previous month’s close.
On the next day’s open, buy the ETF with the best performance of the last month. If you are already long the one with the best performance, keep holding.
Rinse and repeat.

Basically we should invest all our capital in an instrument that performed better in the previous month. Recall that SPY consists of top 500 large-cap US stocks, while TLT consists of US Treasury bonds (with remaining maturities longer than 20 years). I believe the main idea behind this strategy is that stocks and bonds tend to be negatively correlated: when stocks perform poorly, bonds often perform well, and vice versa.

Let’s see if it works.

First we need to download and prepare the data. I am going to download historical prices of these two instruments from 2003-01-01 to 2023-12-31. SPY has a longer price history, but TLT was launched in 2002, limiting our analysis to approximately 21 years of data.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

Show hidden characters

	import yfinance as yf

	start = '2003-01-01'
	end = '2023-12-31'
	s1 = 'SPY'
	s2 = 'TLT'

	s1_prices = yf.download(s1, start=start, end=end)
	s2_prices = yf.download(s2, start=start, end=end)

view raw ss1_n1.py hosted with ❤ by GitHub

At first we download daily OHLC prices for each instrument. First rows of the resulting dataframe are shown below.

Now we need to resample those dataframes to monthly resolution. In order to calculate each month’s return, we need to have an opening price and a closing price for each month. For closing prices we will use adjusted close prices (as usual), which allow us to account for corporate actions such as dividends, stock splits, etc. Similarly, we need adjusted open prices, but they are not included in our dataset. Therefore I’m going to use a simple trick (described here) to estimate adjusted open prices from adjusted close prices (lines 1-2 in the code below).

Show hidden characters

	s1_prices['Adj Open'] = s1_prices['Open'] * s1_prices['Adj Close'] / s1_prices['Close']
	s2_prices['Adj Open'] = s2_prices['Open'] * s2_prices['Adj Close'] / s2_prices['Close']

	from collections import OrderedDict

	s1_prices_m = s1_prices.resample('1M').agg(
	OrderedDict([
	('Adj Open', 'first'),
	('Adj Close', 'last')
	])
	)

	s2_prices_m = s2_prices.resample('1M').agg(
	OrderedDict([
	('Adj Open', 'first'),
	('Adj Close', 'last')
	])
	)

view raw ss1_n2.py hosted with ❤ by GitHub

After adjusted open prices are calculated, we resample the data to monthly resolution and obtain the following results.

Monthly (adjusted) open and close prices

Then we calculate monthly returns of each instrument and combine them in one dataframe.

Show hidden characters

	returns_m = pd.DataFrame(columns=[s1, s2])
	returns_m[s1] = s1_prices_m['Adj Close'] / s1_prices_m['Adj Open'] - 1
	returns_m[s2] = s2_prices_m['Adj Close'] / s2_prices_m['Adj Open'] - 1

view raw ss1_n3.py hosted with ❤ by GitHub

The data is prepared and we are ready to implement a backtest.

According to the rules, at the end of each month we need to choose the best performing instrument and invest all our capital in it. As you can see below, this can be done with a few lines of code. Note that I also calculate the cumulative returns of each instrument separately for comparison purposes.

Show hidden characters

	# prepare positions df
	positions = pd.DataFrame(index=returns_m.index, columns=returns_m.columns)
	positions[s1] = (returns_m[s1] > returns_m[s2]).astype(int)
	positions[s2] = (returns_m[s2] > returns_m[s1]).astype(int)

	# calculate return
	algo_ret = (positions.shift() * returns_m).sum(axis=1)

	# cumulative returns
	algo_cumret = (1 + algo_ret).cumprod()
	s1_cumret = (1 + returns_m[s1]).cumprod()
	s2_cumret = (1 + returns_m[s2]).cumprod()

	# make sure that each series start from 1
	algo_cumret /= algo_cumret[0]
	s1_cumret /= s1_cumret[0]
	s2_cumret /= s2_cumret[0]

view raw ss1_n4.py hosted with ❤ by GitHub

Cumulative returns of Algo, SPY, and TLT

Plot of cumulative returns is shown above. Note that our simple strategy slightly outperforms both SPY and TLT up until some time around 2021. Let’s take a look at the performance metrics.

Our algorithm outperforms both individual assets in every calculated metric. Recall that this includes the last few years, where the algorithm didn’t perform very well. Let’s exclude this last part and check how well it performed up until that drop.

The peak occurred in July 2021, so we will exclude all the data after that point.

As expected, the algorithm’s performance metrics are even better.

Now let’s try to figure out what makes this simple strategy so profitable. The first thing that comes to mind is our trading signal. Recall that we are simply buying the asset with the best performance during last month. Therefore our signal is the difference between last month’s returns of SPY and TLT. If it is greater than 0, then SPY outperformed TLT during the last month and we expect it to happen again in the following month. If it is lower than zero, then TLT outperformed SPY during the last month and we expect that to repeat. So, basically, our trading signal can be defined as follows:

Let’s calculate and plot it.

Show hidden characters

	signal = np.sign(returns_m[s1] - returns_m[s2])
	plt.figure(figsize=(18,4))
	plt.plot(signal)
	plt.axhline(0, linestyle='dashed', color='r')

view raw ss1_n5.py hosted with ❤ by GitHub

Since we always invest in the asset with the highest return from the previous month, we expect that this same asset will outperform again in the following month. In that case we expect our signal to have a tendency to stay the same. It’s hard to tell from the plot, but it seems that our signal jumps between its two states pretty often.

If the signal tends to stay the same, it should have some autocorrelation. Let’s plot ACF and PACF plots to see if that’s the case.

Show hidden characters

	import statsmodels.api as sm

	fig = plt.figure(figsize=(18,8))
	ax1 = fig.add_subplot(211)
	fig = sm.graphics.tsa.plot_acf(signal, ax=ax1)
	ax2 = fig.add_subplot(212)
	fig = sm.graphics.tsa.plot_pacf(signal, ax=ax2)

view raw ss1_n6.py hosted with ❤ by GitHub

We can’t see any statistically significant autocorrelation in the plots above. Let's calculate how many times it changes from one month to the next, as well as the number and length of periods where it remains unchanged.

It changes on 128 of 252 months and stays the same in 123 of 252 months. This means that our signal is incorrect in 128 of 252 months, giving us a hit rate of 123/252 = 48.8%.

We can try to visually assess the performance of our signal using a confusion matrix.

Show hidden characters

	from sklearn.metrics import confusion_matrix

	# prepare data
	y_true = signal.iloc[1:].values
	y_pred = signal.shift().iloc[1:].values

	# https://www.kaggle.com/code/agungor2/various-confusion-matrix-plots
	data = confusion_matrix(y_true, y_pred)
	df_cm = pd.DataFrame(data, columns=np.unique(y_true), index = np.unique(y_true))
	df_cm.index.name = 'Actual'
	df_cm.columns.name = 'Predicted'
	plt.figure(figsize = (4,3))
	sns.heatmap(df_cm, cmap="Blues", annot=True)# font size

view raw ss1_n7.py hosted with ❤ by GitHub

From the plot above, we can see that we have:

49 true negatives,
74 true positives,
64 false negatives,
64 false positives.

From these numbers, we can calculate a few performance metrics, as shown below.

Fractions of correct negative and positive predictions

For comparison take a look at the histograms of precision and recall scores of randomly generated signals.

Prediction and recall scores of random signals

It doesn't look like our signal is performing well. Perhaps the algorithm performs better during larger market movements, helping us avoid significant losses in turbulent markets or capture substantial gains when the market trends. To test this hypothesis, we need to compare the return distributions when the signal's predictions are correct versus when they are incorrect. Let’s split the strategy’s returns into these two groups and calculate their summary statistics.

Show hidden characters

	ret_signal_correct = algo_ret.iloc[np.where(y_true == y_pred)] # signal's prediction is correct
	ret_signal_incorrect = algo_ret.iloc[np.where(y_true != y_pred)] # signal's prediction is incorrect

	pd.concat([ret_signal_correct.describe().rename('correct'), ret_signal_incorrect.describe().rename('incorrect')], axis=1)

view raw ss1_n8.py hosted with ❤ by GitHub

There are some differences in extreme values, but otherwise, both groups look very similar. We can also compare them visually using histograms.

Show hidden characters

	fig,axs = plt.subplots(1,2,figsize=(18,4))
	axs[0].hist(ret_signal_right, bins=20)
	axs[0].set_title('Signal\'s prediction correct')
	axs[0].set_xlabel('return')
	axs[0].set_ylabel('frequency')
	axs[1].hist(ret_signal_wrong, bins=20)
	axs[1].set_title('Signal\'s prediction incorrect')
	axs[1].set_xlabel('return')
	axs[1].set_ylabel('frequency')

view raw ss1_n9.py hosted with ❤ by GitHub

The two histograms look quite similar, so they likely reflect the same underlying distribution. In order to test this hypothesis, we can apply the Kolmogorov-Smirnov test to these two samples.

Recall that the null-hypothesis of Kolmogorov-Smirnov test is that the two samples come from the same distribution. As you can see on the screenshot above, we get a p-value of 0.76. Therefore, we can’t reject the null hypothesis, so we conclude that there is no significant evidence to suggest that the two samples come from different distributions.

So far, we haven’t been able to find any evidence that our signal has any predictive power. Could it be the case that the strategy’s performance is just random? What results will we get if we use random sequences of 1 and -1 as a trading signal? We can easily test that using random simulations. In the code below, I perform 1000 such simulations, saving the performance metrics from each simulation into a dataframe.

Show hidden characters

	random_signal_metrics = pd.DataFrame(columns=['Total return', 'APR', 'Sharpe', 'MaxDD', 'MaxDDD'])

	for i in tqdm(range(1000)):
	np.random.seed(i)
	signal = np.random.choice([-1,1], size=252)
	positions = pd.DataFrame(index=returns_m.index, columns=returns_m.columns)
	positions[s1] = (signal>0).astype(int)
	positions[s2] = (signal<0).astype(int)
	algo_ret = (positions.shift() * returns_m).sum(axis=1)
	algo_cumret = (1 + algo_ret).cumprod()
	algo_cumret /= algo_cumret[0]

	random_signal_metrics.loc[i] = calculate_metrics(algo_cumret)

view raw ss1_n10.py hosted with ❤ by GitHub

Summary statistics of the resulting dataframe are shown below.

Recall that in our backtest we got the following performance metrics.

Most metrics are close to the mean values we obtained from random simulations, which means that the performance metrics of the backtest could be driven by randomness. Let’s calculate the fraction of random strategies that outperform our strategy for each metric separately.

Show hidden characters

	for metric in random_signal_metrics.columns[:-1]:
	backtest_val = results_df.loc['Algo'][metric] # value of the metric in backtest
	num_outperf = random_signal_metrics[random_signal_metrics[metric]>=backtest_val].shape[0]
	print(f'{metric} : {num_outperf / len(random_signal_metrics)}')

	# MaxDDD must be LOWER than in backtest
	metric = 'MaxDDD'
	backtest_val = results_df.loc['Algo'][metric]
	num_outperf = random_signal_metrics[random_signal_metrics[metric]<=backtest_val].shape[0]
	print(f'{metric} : {num_outperf / len(random_signal_metrics)}')

view raw ss1_n11.py hosted with ❤ by GitHub

Fraction of random strategies outperforming the backtested strategy in a given metric

As you can see above, a large fraction of random strategies performs better than the strategy we are analyzing. However, here we tested each performance metric individually. Let’s calculate the fraction of random strategies that outperform our strategy across all performance metrics.

Fraction of random strategies outperforming the backtested strategy in all metrics

14% of random simulations outperform our strategy in every metric, suggesting that there is a 14% chance our strategy’s performance could be explained by randomness. Furthermore, the trading signal appears random, with no predictive power. Therefore, we can conclude that the backtest results of this strategy are unreliable.

Here are a few ideas that could be used to improve the results:

Trade at a higher frequency (weekly or daily).
Exit the market (sell everything and keep the portfolio in cash) when the returns of both instruments are negative.
Add more assets to trade.

Jupyter notebook with source code is available here.

If you have any questions, suggestions or corrections please post them in the comments. Thanks for reading.

References

[1] 10 Best Swing Trading Strategies 2024

Financial Noob

Discussion about this post