Modelling Non-stationary 'Big Data'

Apr 2020 | 905

Authors: Jennifer Castle, Jurgen Doornik, David Hendry

Abstract: Seeking substantive relationships among vast numbers of spurious connections when modelling Big Data requires an appropriate approach. Big Data are useful if they can increase the probability that the data generation process is nested in the postulated model, increase the power of specification and mis-specification tests, and yet do not raise the chances of adventitious significance. Simply choosing the best-fitting equation or trying hundreds of empirical fits and selecting a preferred one–perhaps contradicted by others that go unreported–is not going to lead to a useful outcome. Wide-sense non-stationarity (including both distributional shifts and integrated data) must be taken into account. The paper discusses the use of principal components analysis to identify cointegrating relations as a route to handling that aspect of non-stationary big data, along with saturation to handle distributional shifts, and models the monthly UK unemployment rate, using both macroeconomic and Google Trends data, searching over 3000 explanatory variables and yet identifying a parsimonious, well-specified and theoretically interpretable model specification.

JEL Codes: C51, Q54

Keywords: Cointegration; Big Data; Model Selection; Outliers; Indicator Saturation; Autometrics

View All Working Papers