Model integration in data mining: from local to global decisions

  1. Bella Sanjuán, Antonio
Dirigida por:
  1. César Ferri Ramírez Director/a

Universidad de defensa: Universitat Politècnica de València

Fecha de defensa: 23 de julio de 2012

Tribunal:
  1. Vicente J. Botti Navarro Presidente/a
  2. Francisco Javier Oliver Villarroya Secretario/a
  3. J. S. Sanchez Vocal
  4. Alicia Troncoso Lara Vocal
  5. Antonio Bahamonde Rionda Vocal

Tipo: Tesis

Teseo: 329303 DIALNET

Resumen

Machine Learning is a research area that provides algorithms and techniques that are capable of learning automatically from past experience. These techniques are essential in the area of Knowledge Discovery from Databases (KDD), whose central stage is typically referred to as Data Mining. The KDD process can be seen as the learning of a model from previous data (model generation) and the application of this model to new data (model deployment). Model deployment is very important, because people and, very especially, organisations make decisions depending on the results of the models. Usually, each model is learned independently from the others, trying to obtain the best (local) result. However, when several models have to be used together, some of them can depend on each other (e.g., outputs of a model are inputs of other models) and constraints appear on their application. In this scenario, the best local decision for each individual problem could not give the best global result, or the result could be invalid if it does not fulfill the problem constraints. Customer Relationship Management (CRM) is an area that has originated real application problems where data mining and (global) optimisation need to be combined. For example, prescription problems deal about distinguishing or ranking the products to be offered to each customer (or simetrically, selecting the customers to whom we should make an offer). These areas (KDD, CRM) are lacking tools for a more holistic view of the problems and a better model integration according to their interdependencies and the global and local constraints. The classical application of data mining to prescription problems has usually considered a rather monolitic and static view of the process, where we have one or more products to be offered to a pool of customers, and we need to determine a sequence of offers (product, customer) to maximise profit. We consider that it is possible to perform a better customisation by tuning or adapting several features of the product to get more earnings. Therefore, we present a taxonomy of prescription problems based on the presence or absence of special features that we call negotiable features. We propose a solution for each kind of problem, based on global optimisation (combining several models in the deployment phase) and negotiation (introducing new concepts, problems and techniques). In general, in this scenario, obtaining the best global solution analytically is unreachable and simulation techniques are useful in order to obtain good global results. Furthermore, when several models are combined, they must be combined using an unbiased criterion. In the case of having an estimated probability for each example, these probabilities must be realistic. In machine learning, the degree to which estimated probabilities match the actual probabilities is known as calibration. We revisit the problem of classifier calibration, proposing a new non-monotonic calibration method inspired in binning-based methods. Moreover, we study the role of calibration before and after probabilistic classifier combination. We present a series of findings that allow us to recommend several layouts for the use of calibration in classifier combination. Finally, before deploying the model, making single decisions for each new individual, a global view of the problem may be required, in order to study the feasibility of the model or the resources that will be needed. Quantification is a machine learning task that can help to obtain this global view of the problem. We present a new approach to quantification based on scaling the average estimated probability. We also analyse the impact of having good probability estimators for the new quantification methods based on probability average, and the relation of quantification with global calibration. Summarising, in this work, we have developed new techniques, methods and algorithms that can be applied during the model deployment phase for a better model integration. These new contributions outperform previous approaches, or cover areas that have not already been studied by the machine learning community. As a result, we now have a wider and more powerful range of tools for obtaining good global results when several local models are combined.