eISSN:
 2639-8001

All Issues

Volume 1, 2019

Foundations of Data Science

December 2019 , Volume 1 , Issue 4

Select all articles

Export/Reference:

Issues using logistic regression with class imbalance, with a case study from credit risk modelling
Yazhe Li, Tony Bellotti and Niall Adams
2019, 1(4): 389-417 doi: 10.3934/fods.2019016 +[Abstract](843) +[HTML](367) +[PDF](4084.46KB)
Abstract:

The class imbalance problem arises in two-class classification problems, when the less frequent (minority) class is observed much less than the majority class. This characteristic is endemic in many problems such as modeling default or fraud detection. Recent work by Owen [19] has shown that, in a theoretical context related to infinite imbalance, logistic regression behaves in such a way that all data in the rare class can be replaced by their mean vector to achieve the same coefficient estimates. We build on Owen's results to show the phenomenon remains true for both weighted and penalized likelihood methods. Such results suggest that problems may occur if there is structure within the rare class that is not captured by the mean vector. We demonstrate this problem and suggest a relabelling solution based on clustering the minority class. In a simulation and a real mortgage dataset, we show that logistic regression is not able to provide the best out-of-sample predictive performance and that an approach that is able to model underlying structure in the minority class is often superior.

Quantum topological data analysis with continuous variables
George Siopsis
2019, 1(4): 419-431 doi: 10.3934/fods.2019017 +[Abstract](734) +[HTML](661) +[PDF](1473.63KB)
Abstract:

I introduce a continuous-variable quantum topological data algorithm. The goal of the quantum algorithm is to calculate the Betti numbers in persistent homology which are the dimensions of the kernel of the combinatorial Laplacian. I accomplish this task with the use of qRAM to create an oracle which organizes sets of data. I then perform a continuous-variable phase estimation on a Dirac operator to get a probability distribution with eigenvalue peaks. The results also leverage an implementation of continuous-variable conditional swap gate.

On the incorporation of box-constraints for ensemble Kalman inversion
Neil K. Chada, Claudia Schillings and Simon Weissmann
2019, 1(4): 433-456 doi: 10.3934/fods.2019018 +[Abstract](595) +[HTML](366) +[PDF](1289.35KB)
Abstract:

The Bayesian approach to inverse problems is widely used in practice to infer unknown parameters from noisy observations. In this framework, the ensemble Kalman inversion has been successfully applied for the quantification of uncertainties in various areas of applications. In recent years, a complete analysis of the method has been developed for linear inverse problems adopting an optimization viewpoint. However, many applications require the incorporation of additional constraints on the parameters, e.g. arising due to physical constraints. We propose a new variant of the ensemble Kalman inversion to include box constraints on the unknown parameters motivated by the theory of projected preconditioned gradient flows. Based on the continuous time limit of the constrained ensemble Kalman inversion, we discuss a complete convergence analysis for linear forward problems. We adopt techniques from filtering, such as variance inflation, which are crucial in order to improve the performance and establish a correct descent. These benefits are highlighted through a number of numerical examples on various inverse problems based on partial differential equations.

Partitioned integrators for thermodynamic parameterization of neural networks
Benedict Leimkuhler, Charles Matthews and Tiffany Vlaar
2019, 1(4): 457-489 doi: 10.3934/fods.2019019 +[Abstract](881) +[HTML](405) +[PDF](10550.03KB)
Abstract:

Traditionally, neural networks are parameterized using optimization procedures such as stochastic gradient descent, RMSProp and ADAM. These procedures tend to drive the parameters of the network toward a local minimum. In this article, we employ alternative "sampling" algorithms (referred to here as "thermodynamic parameterization methods") which rely on discretized stochastic differential equations for a defined target distribution on parameter space. We show that the thermodynamic perspective already improves neural network training. Moreover, by partitioning the parameters based on natural layer structure we obtain schemes with very rapid convergence for data sets with complicated loss landscapes.

We describe easy-to-implement hybrid partitioned numerical algorithms, based on discretized stochastic differential equations, which are adapted to feed-forward neural networks, including a multi-layer Langevin algorithm, AdLaLa (combining the adaptive Langevin and Langevin algorithms) and LOL (combining Langevin and Overdamped Langevin); we examine the convergence of these methods using numerical studies and compare their performance among themselves and in relation to standard alternatives such as stochastic gradient descent and ADAM. We present evidence that thermodynamic parameterization methods can be (ⅰ) faster, (ⅱ) more accurate, and (ⅲ) more robust than standard algorithms used within machine learning frameworks.

Cluster, classify, regress: A general method for learning discontinuous functions
David E. Bernholdt, Mark R. Cianciosa, David L. Green, Jin M. Park, Kody J. H. Law and Clement Etienam
2019, 1(4): 491-506 doi: 10.3934/fods.2019020 +[Abstract](502) +[HTML](209) +[PDF](1784.82KB)
Abstract:

This paper presents a method for solving the supervised learning problem in which the output is highly nonlinear and discontinuous. It is proposed to solve this problem in three stages: (ⅰ) cluster the pairs of input-output data points, resulting in a label for each point; (ⅱ) classify the data, where the corresponding label is the output; and finally (ⅲ) perform one separate regression for each class, where the training data corresponds to the subset of the original input-output pairs which have that label according to the classifier. It has not yet been proposed to combine these 3 fundamental building blocks of machine learning in this simple and powerful fashion. This can be viewed as a form of deep learning, where any of the intermediate layers can itself be deep. The utility and robustness of the methodology is illustrated on some toy problems, including one example problem arising from simulation of plasma fusion in a tokamak.

Editors

Referees

Librarians

Email Alert

[Back to Top]