Statistical learning in complex and temporal data: distances, two-sample testing, clustering, classification and Big Data

Montero Manso, Pablo

Statistical learning in complex and temporal datadistances, two-sample testing, clustering, classification and Big Data

Montero Manso, Pablo

Supervised by:

José Vilar Director

Defence university: Universidade da Coruña

Fecha de defensa: 30 January 2019

Committee:

Alicia Troncoso Lara Chair
Rubén Fernández Casal Secretary
Jorge Caiado Committee member

Type: Thesis

Teseo: 580496 DIALNET RUC editor

Abstract

This thesis deals with the problem of statistical learning in complex objects, with emphasis on time series data. The problem is approached by facilitating the introduction of domain knoweldge of the underlying phenomena by means of distances and features. A distance-based two sample test is proposed, and its performance is studied under a wide range of scenarios. Distances for time series classification and clustering are also shown to increase statistical power when applied to two-sample testing. Our test compares favorably to other methods regarding its flexibility against different alternatives. A new distance for time series is defined by considering an innovative way of comparing lagged distributions of the series. This distance inherits the good empirical performance of existing methods while removing some of their limitations. A forecast method based on times series features is proposed. The method works by combining individual standard forecasting algorithms using a weighted average. These weights come from a learning model fitted on a large training set. A distributed classification algorithm is proposed, based on comparing, using a distance, the empirical distribution functions between the dataset that each computing node receives and the test set.