The Principles of Backtesting

Principles of Backtesting

    What is a Backtest?
    Realism, Taxes, and Transaction Costs
    The Time-Lag Problem
    The Survivorship Problem.
    The Pricing Problem
    The Data Mining Problem
    The MEGO Problem

Revised 4/20/99

What is a Backtest?

A backtest is a simulation of how a real-life investor with a mechanical investment strategy might have responded to data, whether buying, selling, or forgoing any transaction at all. A backtest determines which stocks (or other investments) an investor would have bought, sold, or held given what the investing public knew at the time of the hypothetical transactions. A backtest also determines what the percentage return to the investor would have been had the investor made the recommended transactions.

Realism, Taxes, and Transaction Costs

As with any simulation, the more realistic the backtest, the more useful it is. Obviously, the goal of any investor is to maximize return after taxes and transaction costs (while avoiding unnecessary risk). A backtest should allow an investor to evaluate the impact of taxes and commissions. However, reality is not the same for all investors. Investors face different transaction costs and different tax rates. When a backtest makes assumptions about these costs for all investors, it is less useful (and possibly useless) for investors not covered by those assumptions. Ideally, a backtest will give an investor some guidance as to what these costs will be without relying on a blanket assumption. For instance, a backtest might report the number of buys/sells per year to give an investor an idea of commission expenses. Likewise, reporting the typical holding period for securities selected in the backtest (which may be longer than the rebalancing interval) would give investors an idea of the tax cost of a strategy.

The Time-Lag Problem

The data used in a backtest must have been known to investors at the time of the hypothetical transactions. In the real world, investors cannot trade on information they will not know until tomorrow. A backtest should not assume that they could.

There are two ways of dealing with the time-lag problem. First, you could make an assumption about when the data became available to investors, and then assume that the transactions took place after that date. That's what James O'Shaugnessy did in What Works on Wall Street. O'Shaugnessy used a time lag of 11 months or more, but he doesn't fully explain what he meant. Since he rebalanced his portfolios on December 31, presumably he completely disregarded quarterly and annual reports for periods ending after the preceding January 31 (11 months earlier). He asserted the long time lag was necessary to make sure all the data he was using was available to the public on his rebalancing date.

O'Shaugnessy claims the long lag was "conservative" but it may have introduced a distortion of its own. Between January 31 and December 31 the typical company issues at least two, and more likely three, quarterly reports. A company with a December 31 fiscal year-end will issue reports for the quarters ending March 31, June 30, and September 30. The vast majority of these reports come out less than two months after the end of the quarter; many are publicly reported within 3 or 4 weeks. These reports are presumably used by investors in deciding what a company is worth, and probably carry a lot more weight than a report that is two or three quarters old. The causal link between the company's 12/31/97 report and stock performance between 12 and 24 months later (i.e., 12/31/98-12/31/99) seems far more tenuous than the impact of, say, the 9/30/98 report on that performance.

Q-Investor, a backtesting program from Q-Analytics, allows the backtester to choose the lag factor. The default is seven weeks. In other words, the program assumes that annual reports for the period ending 12/31 are not available until about 2/12. That's not a bad assumption. Unfortunately, Q-Investor has only annual data for years older than 1996 -- no quarterly data. All that older data is fiscal year data. If you use a buy date of January 1 and a seven-week lag, Q-Investor will use information for periods ending earlier than the previous mid-November. For most companies (which end their fiscal year on 12/31), that means a simulated investment decision on 1/1/95 uses the information from the annual report for the year ending 12/31/93. Like O'Shaugnessy, Q-Investor users must contend with the possible distortions of very long lag times.

The second way to combat time lag is to use data as it was actually published on a particular date in the past, and assume the transactions occurred shortly after publication of that data. This is what Robert Sheard did in The Unemotional Investor. He simply went to the library and looked up old issues of the Value Line Investment Survey. Probably the best way of doing this kind of time-lagging is with historical Value Line datasets in electronic form. These sets include monthly data going back to the beginning of 1986. The exact date of the data varies from month to month, but normally it was first published some time during the first week of the month, or the last few days of the preceding month. Value Line generally compiles the data on Wednesday and publishes it on Friday. Weeks with holidays are the exception, with compilation and publication dates all over the lot.

The Survivorship Problem

Firms go out of business all the time via merger or bankruptcy. Frequently data on these firms disappears after the firms do. It is important to use historical data that includes these "obsolete" firms because it is possible that such a firm would have been "bought" by a hypothetical investor using the criteria in the backtest. Failing to include obsolete firms in the historical data biases the results. It is not clear, however, whether the bias would increase returns (by avoiding firms headed for bankruptcy) or decrease them (by avoiding firms that were taken over at a generous premium to their market price). Q-Investor includes data on some, but not all, obsolete companies, with more recent data being more complete.

Using data as it was actually published, such as Value Line data, eliminates survivorship bias. All the companies that later disappeared are still in the historical Value Line data.

The Pricing Problem

All investments in a backtest must be purchased at a price available to investors in general after the date the historical data became available to investors. For instance, Value Line normally distributes its data to investors on Friday by midday. It would be unrealistic to assume an investor could have bought stocks using that data at the close on Thursday. A transaction at the close on Friday would be a more reasonable assumption.

It is not clear whether there is a systematic bias in favor of a particular time of day. It is hard to say whether a transaction at the open is more favorable than one at the close or at midday. So long as the backtest consistently uses the same kind of price for both purchasing and selling, any systematic bias should be eliminated. Anything you might gain by waiting to buy at the close is eliminated at the next rebalancing when you sell at the close. As a practical matter, closing prices are more likely to be available in a usable form. Thus, most backtests use closing prices.

Bid/ask spreads are another problem faced by investors. Quotes from online services during the trading day may be bid prices (the price at which someone has offered to buy the stock), ask prices (the price at which someone is willing to sell the stock) or last transaction prices (a price at which the stock actually exchanged hands). The difference between the bid and the ask is the spread. It would be unreasonable to assume that you could always buy at the bid or sell at the ask. Bid and ask prices are not meant to reflect actual transactions in the marketplace. They simply reflect the status of the auction.

Closing prices always reflect the last trade of the day, Moreover, some brokers permit their clients to submit orders with instructions that they be executed at the close. Thus, assuming that purchases and sales in a backtest take place at a closing price seems to be reasonable.

For calculating returns, always use split-adjusted prices. Otherwise, you are likely to understate your returns. On the other hand, if the screen uses price as a screening criteria (for instance, the Foolish Four sorts Dow stocks in ascending order of price) use the actual trading price when deciding which stocks to pick, and switch to split-adjusted prices for calculating the investment returns.

Splits can also cause other "crystal ball" problems. For instance, one of the earlier promising screens using Q-Investor involved looking for stocks with high volume in terms of the number of shares traded per week. Strangely, the screen did not seem to work as well when the dollar value of shares trading hands was used instead of volume. The author of the screen finally realized that Q-Investor uses split-adjusted volume as well as price. When a stock splits 2:1, the number of shares doubles, as does its volume in Q-Investor. This makes it seem like some stocks, which would split at some point in the future as a result of their great performance, had enormous volume. In real life (i.e., without the split adjustment) their volume was not that remarkable. The screen was a crystal ball. High volume stocks were "destined" to split in the future because they would go up. Be on the lookout for crystal balls in your screens.

The Data Mining Problem

Just because an investment strategy would have worked in the past had an investor stumbled upon it does not necessarily mean it is reasonable to think it will continue to work in the future. The Super Bowl Indicator (NFC team wins, market goes up) and the hemline indicator (skirts long, sell short) are obvious examples, but donÌt be lulled into thinking that any strategy based on balance sheets and income statements is sure to work. Backtesters must be careful to avoid hyping results based on a limited set of data. There may be something special about the data the backtester used (e.g., it's all January data) that causes anomalous results. Or the results may simply have been caused by chance.

Data mining is even more of a hazard as backtesters use new programming techniques and greater computer horsepower. "Genetic algorithms" sometimes spit out eye-popping numbers when using historical data. Q-Investor, gives you the option to "Q-Optimize" a screen with genetic algorithms. It remains to be seen whether these algorithms actually help predict anything. As O'Shaugnessy said in What Works on Wall Street, "Torture the data enough and it will confess to anything."

There are two primary ways to combat data mining. The most obvious is to view with suspicion results that contradict common sense or other well-designed studies. For instance, assume that a backtest determines that if you change an investment model to require a return on invested capital (ROIC) of less than 15%, total return for the strategy increases. In the real world, a low ROIC is normally not a desirable characteristic. Possibly the rest of the screen was particularly good at identifying candidates in capital-intensive industries; the low-ROIC filter simply excluded companies outside those industries. The screen might do even better if it simply focused on these industries rather than ROIC. Another possibility is that the original version of the strategy chose a few bad market performers that coincidentally had a high ROIC. The low ROIC requirement eliminates these "problem child" stocks, but who can say whether problem stocks in the future will also have low ROIC? Those possibilities (and others) should be investigated before getting too excited.

The second way to avoid data mining is to divide the available historical data into a "play" set and a "confirmation" set. Use the play set to experiment with, and then see if the confirmation set confirms your hypothesis. If you had all 160 months of data from Value Line at your disposal (from January 1986 to April 1999), you might have a play set of 30 randomly selected months. This method is almost sure to tell you whether you are really eliminating problem children or just trying too hard. If you do not have fairly large set of data to start with, you cannot use this technique. Be very cautious in interpreting your results.

The MEGO Problem

MEGO stands for "My eyes glaze over." You might be experiencing it right now. Enthusiastic backtesters often want to show all of their work, and more than once someone else has caught a critical error in the results. That's good. But a lengthy backtest generates so much data that no one could possibly read it all. The backtester should try to summarize the data as usefully as possible. To have any hope of doing so, the backtester should consider what questions to answer before even starting a backtest.

Here are the questions most investors want a backtest to answer before putting money into an investment model:

Does the model provide returns that are significantly better than an appropriate benchmark (e.g., the S&P 500)?
What is the best (highest return, least volatility) holding period for the model? A good selection of holding periods might be 1, 3, 6, 12, 18 and 24 months, plus holding until December 31 of the year following the purchase. Foolish Four research that shows that if you don't start in December or January, you can do almost as well by buying whenever and holding until the end of the following calendar year. Testing all these periods at the same time and recording the results in the same place means the backtester only has to look up the buy price once.
Do higher ranking stocks in the model show a strong tendency to outperform lower ranking stocks? If so, a Dozens approach might be workable.
Do the highest ranking stocks show a strong tendency to underperform the other stocks in the model? The original Foolish Four dropped the highest rank stock for this reason.
How many trades/year are necessary for the model?
How long is the typical stock held in the model?
Do the model's winners or losers have characteristics in common -- particularly characteristics that can be screened?
Is there a particular time of year to start to use the model that will significantly increase or decrease returns or volatility?
How does the model perform on these statistical measurements:

Compounded annual growth rate given a particular start date and holding period.
Arithmetic average growth rate for all periods tested, given a particular holding period
Standard deviation
Sharpe ratio
Range of expected returns (2 standard deviations away from mean)
Base rate (percentage of the time the screen beats its benchmark)
Percentage of the time the screen has a positive return
Margin of error
Significance, or is the screen's apparent ability to outperform a benchmark a matter of chance?

One way of organizing the results is to use a spreadsheet. Here's an Excel worksheet with a proposed format for gathering and analyzing data for one month of a backtest. All the monthly worksheets for a backtest can be assembled into a single workbook for computing averages, standard deviations, etc.