# Intelligent systems for function approximation and the integration of heterogeneous biological data

- Héctor Pomares Cintas Director
- Ignacio Rojas Ruiz Co-director

Defence university: Universidad de Granada

Fecha de defensa: 10 October 2011

- Alberto Prieto Espinosa Chair
- Alberto Guillén Perales Secretary
- Hubert Hackl Committee member
- Manuel Gonzalo Claros Díaz Committee member
- Jose Ramon Gonzalez Gonzalez Committee member

Type: Thesis

## Abstract

This dissertation presents a set of contributions that can be grouped into three parts. The first part of the thesis is related to the integration of heterogeneous biological data for the prediction of functional associations between proteins. This topic has become in the last years one of the major goals of current biological studies. In the literature, there exist several machine learning methods applied to the integration of heterogeneous biological data sources. However, all of them suffer from the same common problem: interpretability and simplicity for the decision maker. Due to this, it is proposed a data integration methodology based on interpretable and simple IF-THEN rules that reflect the contributions of different types of evidences or data sources toward the prediction of functional associations between proteins. Through a multi-objective genetic programming (MO-GP) approach run in parallel architectures, a set of pareto optimal IF-THEN classification rules are provided and each rule can be used to build an functional linkage network (FLN) with a given level of accuracy. Furthermore, the decision maker does not have to specify partial preferences on the desired accuracy of the FLN, since covering the entire pareto, different FLNs are obtained, each one with a different level of accuracy. The second part of the dissertation is related to the automation of Affymetrix 3' microarray data analysis. Microarray data are commonly used in the data integration previously described so that it is proposed a microarray data analysis tool with the following features: (1) automatic detection of low quality microarrays so that the decision maker is able to decide whether one or more arrays are defective or not based on a full set of quantitative and qualitative measures, (2) automatic selection of the best pre-processing methods among several ones for a given data set through objective quality metrics and (3) automatic generation of confident and complete lists of differentially expressed genes according to the set of best pre-processing methods selected before. This automation means an important advance in microarray data analysis and a great help to the decision maker, since the automatic detection of low quality microarrays and the automatic selection of the best pre-processing methods will avoid that posterior phases of microarray data analysis, such as classification, are affected by low quality arrays and/or an incorrect choice of pre-processing methods. The third part of the dissertation is related to the problem of distributing the original data set (input/output data) into two representative and balanced sets for function approximation tasks and to the problem of model selection. Two contributions are proposed. The first one is related to one of the most common methodologies to evaluate models built by supervised learning algorithms. Such methodology consists in partitioning the original data set (input/output data) into two sets: learning and test. The learning set is used for building models that capture the relationships between inputs and outputs while the test set is used for checking models' generalization ability with data not used in the learning process. Usually, in the literature, the partition into learning and test sets does not usually take into account the variability and geometry of the original data. This might lead to non-balanced and unrepresentative learning and test sets and, thus, to wrong conclusions in the accuracy of the learning algorithm. Thus, it is proposed a new deterministic data mining approach to distributing a given data set (input/output data) into two representative and balanced sets of roughly equal size to be used in function approximation problems. The distribution takes into account the variability of the data set with the purpose of allowing both a fair evaluation of learning's accuracy and to make reproducible machine learning experiments usually based on random distributions. The second contribution is associated to one of the problems related to the selection of the best model for Radial Basis Function Neural Network (RBFNN) in time series prediction tasks. This problem is given by the methodology commonly used in the literature to select the best structure model. Such methodology is based on K-fold cross-validation model evaluation strategy which has some drawbacks, such as its random nature and the subjective decision for a proper value of K. Thus, it is proposed a new deterministic model selection methodology with applications for incremental Radial Basis Function Neural Network (RBFNN) construction in time series prediction problems. Such model selection approach is a combined algorithm which takes advantage of balanced and representative training and validation sets obtained through the data distribution approach previously described for their use in all the steps of the RBFNN design: initialization, optimization and network model evaluation. This way, the model prediction accuracy is improved, reducing the computation time spent in selecting the model and avoiding random and computationally expensive model selection methodologies based on K-fold cross-validation procedures.