Data snooping (part I)
Are these correlations statistically significant? The answer is not obvious.
- Viewed in isolation, each correlation is significant.
- But each regression is just one of over 20000 regressions performed. Even with uncorrelated datasets, one observed p-value would likely be around 1/20000.
- Assuming that the 20000 regressions are independent is too conservative.
- Yet we chose which regressions to do based on general knowledge based indirectly on regressions performed by others previously. So the true size of the entire space of regressions searched is over 20000.
The p-value corresponding to a t-value of 5.0 is about 10-6. It is reasonable to assume that an observed correlation with this strength is not a statistical fluke.
Doubts about significance remain:
- There is no guarantee a correlation persists over time.
- Independent but autocorrelated series require an alternative t-test.
- Other least-squares regression assumptions may also be violated.