Data-mining bias, sample selection bias, survivorship bias, look-ahead bias, and time-period bias

CFA level I / Quantitative Methods: Application / Sampling and Estimation / Data-mining bias, sample selection bias, survivorship bias, look-ahead bias, and time-period bias

We have already discussed the issues regarding the selection of the appropriate sample size. If the sample is size is large enough, then we can assume all the distributions to be normal. However, there are challenges if the sample size is small and the population distribution is non-normal.

Apart from the selection of sample size, there are many other issues impacting the sampling. Few of the important issues are discussed below.

Data-Mining Bias: It arises due to the misuse of the sample data. The analyst searches through a dataset for a statistically significant pattern by repeatedly drilling into the same data until a pattern is found. The investment strategies that are borne due to data-mining are often not successful in the future. The results found due to this bias appear to be far more significant than they are actually.

The data-mining bias can be checked using an out-of-sample test. The results would work fine on the in-the-sample test due to the data-mining bias. However, they are not likely to work in the out-of-sample test. If they work on an out-of-sample test as well, then there must be some economic significance, and the probability of data-mining bias would be much lower.

Intergenerational data-mining also causes a lot of statistical problems as the importance of results due to it are usually overstated. It involves using information developed by previous researchers using a data set to guide current research using the similar dataset.

The two warning signs for the potential existence of data mining are following:

  • Too much digging or too little confidence
  • Lack of story or economic rationale

Sample Selection Bias: When a certain data is excluded from a data set then that sample is said to be a biased sample, and that leads to sample selection bias. It can even occur in the market where the quality and consistency of the data are quite high. But it is much prevalent in the markets where the data is not reported consistently such as hedge fund markets. The hedge funds have very few regulations, and they generally avoid reporting their data when the performance is not good.

Survivorship Bias: It is a type of sample selection bias. When the sample is taken from the dataset that is currently in existence that would lead to survivorship bias. It is quite prevalent in hedge fund industry. If you select only the performance of the hedge funds that have survived, then that would lead to survivorship bias. It generally gives an upward bias to the returns.

Look-Ahead Bias: It occurs when the test parameter uses information that was not available on the test date. For example- P/B ratios. In price to book ratio, the price of the stock is readily available but the book value is available for the last quarter, and the new book value would be available only at the end of the current quarter.

Time-Period Bias: It results when the selection of time period may make the results time period specific. A shorter time period might give more relevant results but would lack statistical significance. A longer time period would be statistically significant but might lack the relevancy as there could be possible structural changes in the data set. For example, if analyst measures the returns of gold for a longer period of time, it could lead to a lack of relevance as before 1971 the US dollar was backed by the gold.

Previous LOS: Confidence interval for a population mean with a known and an unknown variance